What Are The Drone Anti-jamming Systems Technology?
Three lessons from building a production ML system
1. Three lessons learned
from building a production
machine learning system
Michael Manapat
Stripe
@mlmanapat
2. Fraud
• Card numbers are stolen by hacking, malware, etc.
• “Dumps” are sold in “carding” forums
• Fraudsters use numbers in dumps to buy goods,
which they then resell
• Cardholders dispute transactions
• Merchant ends up bearing cost of fraud
3. • We train binary classifiers to predict fraud
• We use open source tools
• Scalding/Summingbird for feature generation
• scikit-learn for model training
(eventually: github.com/stripe/brushfire)
5. Early ML at Stripe
• Focused on training with more and more data and
adding more and more features
• Didn’t think much about
• ML algorithms (tuning hyperparameters, e.g.)
• The deeper reasons behind any particular set of
results
Substantial reduction in fraud rate
6. Product development
From a product standpoint:
• We were blocking high risk charges and surfacing
just the decision
• We wanted to provide Stripe users insight into our
actions—reasons for scores
7. Score reasons
X = 5, Y = 3: score = 0.1
Which feature is “driving” the score more?
X < 10
Y < 5 X < 15
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
True False
8. Score reasons
X = ?, Y = 3:
(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61
Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51
Now producing richer reasons with multiple predicates
X < ?
Y < 5 X < ?
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
9. Model introspection
If a model didn’t look good in validation, it wasn’t clear what
to do (besides trying more features/data)
What if we used our “score reasons” to debug model issues?
10. • Take all false positives (in validation data or in
production) and group by generated reason
• Were a substantial fraction of the false positives
driven by a few features?
• Did all the comparisons in the explanation
predicates make sense? (Were they comparisons a
human might make for fraud?)
• Our models were overfit!
12. Summary
• Don’t treat models as black boxes
• Thinking about the learning process (vs. just
features and data) can yield significant payoffs
• Tooling for introspection can accelerate model
development/“debugging”
Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie,
Jocelyn Ross, Tom Switzer
14. • December 31st, 2013
• Train a binary classifier for
disputes on data from Jan
1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Based on validation data, pick
a policy for actioning scores:
block if score > 50
15. Questions (1)
• Business complains about high false positive rate:
what would happen if we changed the policy to
"block if score > 70"?
• What are the production precision and recall of the
model?
16. • December 31st, 2014. We
repeat the exercise from a
year earlier
• Train a model on data
from Jan 1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to
wait ~60 days for labels)
• Validation results look
~ok (but not great)
• We put the model into
production and the results
are terrible
17. Questions (2)
• Why did the validation results for the new model
look so much worse?
• How do we know if the retrained model really is
better than the original model?
18. Counterfactual evaluation
• Our model changes reality (the world is different
because of its existence)
• We can answer some questions (around model
comparisons) with A/B tests
• For all these questions, we want an approximation
of the charge/outcome distribution that would exist
if there were no model
19. One approach
• Probabilistically reverse a
small fraction of our block
decisions
• The higher the score, the lower
probability we let the charge
through
• Weight samples by 1 / P(allow)
• Get information on the area we
want to improve on
21. ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56
Recall = 5 / 6 = 0.83
22. • The propensity function controls the exploration/
exploitation tradeoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the more we
allow through
• Bootstrap to get error bars (pick rows from the table
uniformly at random with replacement)
• Li, Chen, Kleban, Gupta: "Counterfactual Estimation
and Optimization of Click Metrics for Search Engines"
23. Summary
• Have a plan for counterfactual evaluation before
you productionize your first model
• You can back yourself into a corner (with no data to
retrain on) if you address this later
• You should be monitoring the production
performance of your model anyway (cf. next lesson)
Alyssa Frazee, Julia Evans, Roban Kramer, Ryan
Wang
25. Production vs. data stack
• Ruby/Mongo vs. Scala/Hadoop/Thrift
• Some issues
• Divergence between production and training
definitions
• Upstream changes to library code in production
feature generation can change feature definitions
• True vs. “True”
27. Aggregation
jobs (get all
aggregates per
model)
Logged
scoring
requests
Domain-specific
scoring
service
(business
logic)
“Pure”
model
evaluation
service
28. Summary
• Monitor the production inputs to and outputs of
your models
• Have dashboards that can be watched on deploys
and alerting for significant anomalies
• Bake the monitoring into generic ML infrastructure
(so that each ML application isn’t redoing this)
Steve Mardenfeld, Tom Switzer
29. • Don’t treat models as black boxes
• Have a plan for counterfactual evaluation before
productionizing your first model
• Build production monitoring for action rates, score
distributions, and feature distributions (and bake
into ML infra)
30. Thanks
Stripe is hiring data scientists, engineers, and
engineering managers!
mlm@stripe.com | @mlmanapat