SlideShare a Scribd company logo
1 of 50
Download to read offline
Benoît Rostykus
Senior Machine Learning Researcher
without headaches
March 2018
● Correlation ≠ causation
● Randomization & counterfactuals
● Experimental vs observational data
● Inverse Propensity Scoring
● Instrumental variables
○ Generalized Method of Moments
○ Scalable IV regression
○ Weak instruments
● Conclusion
Content
Correlation ≠ causation
Should you stop buying margarine to save your marriage?
Or should you stay married to eat less margarine?
Correlation(X, Y) is high, does it mean...
… X causes Y ?
… Y causes X ?
in general, neither
X
Y
C
most common reason: unobserved confounder
observed
observed
unobserved
“Omitted Variable Bias”
W1 W2 W3 W4 W5
advertise?
Probability of
buying diapers
tomorrow
Advertising
W6
● High probability of conversion the day before weekly groceries irrespective of adverts shown
● Effect of Pampers ads is null in this case.
Traditional (correlational) machine learning will fail and waste $ on useless ads
in practice, Cost-Per-Incremental-Acquisition can be > 100x Cost-Per-Acquisition (!!!!!)
D1 D2 D3 D4 D5
promote?
Probability of
watching next
episode
Recommendations
D6
Netflix homepage is an expensive real-estate (high opportunity cost):
- so many titles to promote
- so few opportunities to recommend
Traditional (correlational) ML systems:
- take action if probability of positive reward is high, irrespective of reward base rate
- don’t model incremental effect of taking action (showing recommendation, ad etc.)
Surely we can do better.
Randomization &
counterfactuals
Typical ML pipeline
1) Build model predicting reward probability
2) In AB test, pick UI element that maximizes predicted reward
3) If long-term business metric lift is green, roll out
Typical ML pipeline
1) Build model predicting reward probability
2) In AB test, pick UI element that maximizes predicted reward
3) If long-term business metric lift is green, roll out
1) Learn P(reward)
2) Max P(reward) to pick arm
3) Evaluate on #(rewards | new policy) - #(rewards | old policy)
i.e
Typical ML pipeline
1) Build model predicting reward probability
2) In AB test, pick UI element that maximizes predicted reward
3) If long-term business metric lift is green, roll out
1) Learn P(reward)
2) Max P(reward) to pick arm
3) Evaluate on #(rewards | new policy) - #(rewards | old policy)
i.e
mismatch!
AB tests
● Offer lift measurement by randomizing treatment (= algo)
● Typically user-level counterfactuals
● Counterfactual: “What would happen if we were to use this new algo?”
AB tests
● Offer lift measurement by randomizing treatment (= algo)
● Typically user-level counterfactuals
● Counterfactual: “What would happen if we were to use this new algo?”
We can generalize this!
Counterfactuals: “what would happen if?”
1) Randomize treatment application (binary 1/0 or treatment intensity)
2) Log what happens
3) Use randomized logs to answer counterfactual questions (“what if?”)
● Different levels of randomization granularity possible:
○ User-level
■ What happens to user metric X if I always take action A vs action B?
■ What happens to user metric X if I always take action A vs ∅ ?
○ Session-level
○ Impression-level
● Different flavors offer answers to different causal questions
Experimental vs
observational data
● When we’re in control of the production system to produce counterfactuals,
we call that experimental data
● When we don’t control part of the randomization process,
we call that observational data
Incrementality modeling
Simple experimental example
● On X% of traffic, take no action (or random action)
● On (100-X)% of traffic, take action
● X is typically small because it has a product cost (quality / efficiency / …)
● From collected data, learn:
○ P(reward | features, action) = f(features)
○ P(reward | features, no action) = g(features)
○ Predicted lift: lift(features) = f(features) - g(features)
● Use incremental model in production:
○ Max over arms of lift(features(arm))
Incrementality modeling
Pros
● simple
● no model assumptions, plug your favorite ML technique
Cons
● 2 models
● X is small
○ limit on g model accuracy
○ asymmetrical
● doesn’t explicitly model lift
○ can be hard to calibrate
○ offline metrics?
Inverse Propensity Scoring
One IPS solution
Generalizing the previous example...
In production:
● take action with P(treatment | features)
● take counterfactual action with 1 - P(treatment | features)
We can be in control of P (experimental) or not (observational)
Even when we control P, we might want it non-binary (smooth)
- to control for product quality cost
- to provide enough variance if there’s sequential treatment and long-term reward
One IPS solution
1) Learn model of P(treatment | features)
2) Learn incremental model through weighted MLE:
: treatment variable (usually binary)
: predicted lift
and can have different features set
if sequential treatment, need to condition on full treatment path
One IPS solution
Pros
- unbiased in theory if no unobserved confounders
- explicitly model (treatment/control) covariates shift
- generic weighted MLE
- plug in your favorite model
- your usual ML library already supports it
Cons
- Not robust to unobserved confounders
- needs to have enough variance over the features space
- IPSing can blow up variance of estimate
- usually resort to clipping
- if is biased, what happens to ?
treated
control
Application: Ad incrementality
Does online advertising work?
Application: Ad incrementality
article, 02/08/2018
clickbait headlines:
article, 07/27/2017
Application: Ad incrementality
Main problem: measurement
Advertising platforms report on metrics such as Cost-Per-Click (CPC) or
Cost-Per-Action (CPA)
Based on attribution methodologies such as:
- Last click (only give credit to the last ad clicked before conversion)
- Last view (only give credit to the last ad viewed)
- Any view (any ad “viewed” gets the credit)
- Arbitrary combinations of the previous with fudge factors
...These are non-rigorous ways of estimating causal effect of an ad
In practice metrics reported over-inflate ad effect by 1 or 2 orders of magnitude
… because most people would convert anyway, irrespective of the ad
Cost-Per-Incremental-Action > 100x CPA
Rigorous alternative:
1) proper lift measurement using counterfactuals (“ghost bids” / “ghost ads”)
2) Incrementality-based bidding to optimize for ad effect
Application: Ad incrementality
in an ideal world...
don’t
convert
don’t convert
C1
C2
C1 C2
-
incremental
conversions:
holdback group:
don’t show ads
apply treatment:
show ads
control
treatment
user space
convert
convert
Application: Ad incrementality
in a real world...
control
treatment
cookie space
● Non-perfect identity
○ cookie ≠ user, cross-device tracking...
● Non-perfect non-random compliance
○ We don’t show ads to all treatment cookies
○ Non-deterministic: due to auction mechanisms
● Cross-channel issues
○ Can’t easily compare numbers across platforms
○ Holdback group pollution: control group on one
platform is exposed through other platforms
● Heterogeneous logging/reporting tech
● Incrementality varies by ad characteristics
○ Need scores for each impression
apply treatment:
show ads
holdback group:
don’t show ads
Application: Ad incrementality
incremental value of an impression:
Application: Ad incrementality
Trained through MLE with IPS:
bids = impressions + lost auctions
}
Auction is a non-random process which decides if the treatment (impression) is applied
We need to learn it to get an unbiased estimate of the treatment effect:
Instrumental variables
Instrumental variable
Under the following model:
An instrument is an observed variable following the 2 properties:
is correlated with
1
2 X YZ
can only influence through
1In practice, we replace with a weaker hypothesis:
Instrumental variable
Example: Does money make people happy?
X YZ
$ won at the lottery happinesswealth
Reasonable instrument:
But:
● conditioned on being a gambler (specific demographics)
● Z can affect Y irrespective of $ (fun of gambling)
In practice, finding good instruments for observational data is hard:
“Many empirical controversies in economics are essentially disputes about whether or
not certain variables constitute valid instruments.” - Davidson & McKinnon book
Instrumental variable (IV) Regression
● The idea is that we can debias our model by using the
fact that our instruments explain the endogenous
features independently from the regression residual
● Bread and butter of econometrics
○ Because we don’t have parallel universes to run
AB tests on economic policies
○ Observational data is sometimes all we have
Let’s dive into the details since it is less familiar to people with an ML background
IV regression
From hypothesis:
We derive:
with:
*: see here, chapter 4.3
Because conditional expectation is an orthogonal projection*
GMM for IV regression
From there, we can see that the inference becomes:
See this book for thorough treatment of GMM
This is called (the functional form of):
the Generalized Method of Moments for IV regression
The usual econometrics solution to this problem in the linear case is
2-stage least square (2SLS), which expresses the solution through
matrix inversion:
… this works for small datasets, but breaks (O(n3
) complexity, O(n2
) storage) for
internet-scale data or non-linear models
Scalable GMM for IV regression
We are trying to minimize:
Joint work with T. Jebara,
to be published
We want a solution that scales linearly with:
- # of training points
- # of non-zero features per row (sparse high dimensional X)
- # of non-zero instruments per row (sparse high dimensional Z)
or:
Scalable GMM for IV regression
More likely to sample rows which matter (large instruments): will converge fast
Idea: pairwise importance-sampling SGD
Joint work with T. Jebara,
to be published
Scalable GMM for IV regression
Bonus: Extra variance-reduction around the point-estimate thanks to importance sampling!
(hypothesis: same effect as efficient GMM)
Converges faster than known alternatives for non-linear GMM:
Joint work with T. Jebara,
to be published
Weak instruments
Now that we can run IV regression on problems:
- with millions of features
- with millions of instruments
Should we do it?
Problem: when instruments are weak (low correlation with X), the IV estimator is biased
… first towards the correlational answer
… but then unbounded
Causal answer can become worse than correlational one with IV too!
Joint work with D. Hubbard
Weak instruments
Simple 1D experiment
observed: hours, z, y
unobserved: wealth
Try to recover:
Joint work with D. Hubbard
Weak instruments
● Can 100 weaker IVs replace 1 good IV? It depends
● Non-monotonic behavior in the very weak regime :(
Joint work with D. Hubbard
Weak instruments
Controlling for instruments quality is crucial
How to do it in a meaningful and scalable way?
● Partial answer from the statistics literature:
○ Partial F-tests
○ Cragg–Donald
○ Anderson-Rubin
○ …
● “Regularize” the instruments
○ Which cross-validation metric to use? Circular problem. No ground truth!
○ See Hausman Causal Correction from Lewis & Wong
Joint work with D. Hubbard
IPS-MLE vs IV-GMM
Both are unbiased and consistent when there are no unobserved confounders.
Typical estimates have higher variance than their correlational counterparts.
IPS-MLE IV-GMM
● Familiar to ML people
● More flexible on model class
● Easy to scale
● Less theoretical guarantees
● Not robust to unobserved
confounders
● Bias and variance come from IPS
weight
● Familiar to econ people
● Mostly gaussian residuals
● Harder to scale
● More theoretical guarantees
● Robust to unobserved confounders
● Bias and variance come from IV
strength
In both cases, there is no built-in fallback to correlational answer if randomization is poor
(other methods exist! Ex: propensity matching)
Conclusion
Applications
Plenty of use cases for causal inference at Netflix
● Advertising
● Causal recommendations
● Content valuation
● Increased experimentation power
● ...
Causal inference in practice
Hard! because:
Causal effects are small
Asymptotic unbiasedness is useless if the variance dominates, even on large datasets
Variance grows even more when there is sequential treatment
Unobserved confounders can have bigger magnitude than what we try to measure
Plenty of unsatisfactory / unanswered questions in the literature
No clean ground-truth
All estimators have their flaws.
Hard (impossible) to measure and compare biases offline on large-scale problems
When it matters
Correlational models are fine...
● When we only care about fitting the data / predicting
● When your model predictions won’t interact with the product
Causal models can help...
● When there’s a “why?”
○ “why did NFLX stock price move today?”
● When there’s a “what would happen if?”
○ “what would happen to streaming if iOS app was 10% slower?”
● To build cost-efficient ML algorithms
○ incremental models factor in the effect of taking the action suggested
○ aligned with business metrics lift : maximize likelihood of green AB test
○ … it’s just a greedy one-step-ahead Reinforcement Learning strategy
Thank you.

More Related Content

What's hot

Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsYves Raimond
 
Recommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixJiangwei Pan
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Causal inference in practice
Causal inference in practiceCausal inference in practice
Causal inference in practiceAmit Sharma
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated RecommendationsHarald Steck
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal InferenceNBER
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemEhsan38
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixJustin Basilico
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesGeorgian Micsa
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking Systemivaderivader
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...Sudeep Das, Ph.D.
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsJustin Basilico
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsJustin Basilico
 
Correlation, causation and incrementally recommendation problems at netflix ...
Correlation, causation and incrementally  recommendation problems at netflix ...Correlation, causation and incrementally  recommendation problems at netflix ...
Correlation, causation and incrementally recommendation problems at netflix ...Roelof van Zwol
 

What's hot (20)

Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
 
Recommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at Netflix
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Causal inference in practice
Causal inference in practiceCausal inference in practice
Causal inference in practice
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End tool
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking System
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender Systems
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Correlation, causation and incrementally recommendation problems at netflix ...
Correlation, causation and incrementally  recommendation problems at netflix ...Correlation, causation and incrementally  recommendation problems at netflix ...
Correlation, causation and incrementally recommendation problems at netflix ...
 

Similar to Causality without headaches

Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingXavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingKai Xin Thia
 
ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)Emily Glassberg Sands
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Egor Kraev
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing frameworkAgnes van Belle
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGrubhubTech
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedRising Media Ltd.
 
User behavior analyses JavaZone 2020
User behavior analyses JavaZone 2020User behavior analyses JavaZone 2020
User behavior analyses JavaZone 2020Muhammad Ali Norozi
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachAjit Ghodke
 
Machine Learning Approach.pptx
Machine Learning Approach.pptxMachine Learning Approach.pptx
Machine Learning Approach.pptxCYPatrickKwee
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsXavier Amatriain
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshopodsc
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Dori Waldman
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummiesMichael Winer
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingPierre Gutierrez
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 

Similar to Causality without headaches (20)

Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingXavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
 
ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing framework
 
Regresión
RegresiónRegresión
Regresión
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
 
User behavior analyses JavaZone 2020
User behavior analyses JavaZone 2020User behavior analyses JavaZone 2020
User behavior analyses JavaZone 2020
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Machine Learning Approach.pptx
Machine Learning Approach.pptxMachine Learning Approach.pptx
Machine Learning Approach.pptx
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
ABTest-20231020.pptx
ABTest-20231020.pptxABTest-20231020.pptx
ABTest-20231020.pptx
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Causality without headaches

  • 1. Benoît Rostykus Senior Machine Learning Researcher without headaches March 2018
  • 2. ● Correlation ≠ causation ● Randomization & counterfactuals ● Experimental vs observational data ● Inverse Propensity Scoring ● Instrumental variables ○ Generalized Method of Moments ○ Scalable IV regression ○ Weak instruments ● Conclusion Content
  • 4. Should you stop buying margarine to save your marriage?
  • 5. Or should you stay married to eat less margarine?
  • 6. Correlation(X, Y) is high, does it mean... … X causes Y ? … Y causes X ? in general, neither X Y C most common reason: unobserved confounder observed observed unobserved “Omitted Variable Bias”
  • 7. W1 W2 W3 W4 W5 advertise? Probability of buying diapers tomorrow Advertising W6 ● High probability of conversion the day before weekly groceries irrespective of adverts shown ● Effect of Pampers ads is null in this case. Traditional (correlational) machine learning will fail and waste $ on useless ads in practice, Cost-Per-Incremental-Acquisition can be > 100x Cost-Per-Acquisition (!!!!!)
  • 8. D1 D2 D3 D4 D5 promote? Probability of watching next episode Recommendations D6 Netflix homepage is an expensive real-estate (high opportunity cost): - so many titles to promote - so few opportunities to recommend Traditional (correlational) ML systems: - take action if probability of positive reward is high, irrespective of reward base rate - don’t model incremental effect of taking action (showing recommendation, ad etc.)
  • 9. Surely we can do better.
  • 11. Typical ML pipeline 1) Build model predicting reward probability 2) In AB test, pick UI element that maximizes predicted reward 3) If long-term business metric lift is green, roll out
  • 12. Typical ML pipeline 1) Build model predicting reward probability 2) In AB test, pick UI element that maximizes predicted reward 3) If long-term business metric lift is green, roll out 1) Learn P(reward) 2) Max P(reward) to pick arm 3) Evaluate on #(rewards | new policy) - #(rewards | old policy) i.e
  • 13. Typical ML pipeline 1) Build model predicting reward probability 2) In AB test, pick UI element that maximizes predicted reward 3) If long-term business metric lift is green, roll out 1) Learn P(reward) 2) Max P(reward) to pick arm 3) Evaluate on #(rewards | new policy) - #(rewards | old policy) i.e mismatch!
  • 14. AB tests ● Offer lift measurement by randomizing treatment (= algo) ● Typically user-level counterfactuals ● Counterfactual: “What would happen if we were to use this new algo?”
  • 15. AB tests ● Offer lift measurement by randomizing treatment (= algo) ● Typically user-level counterfactuals ● Counterfactual: “What would happen if we were to use this new algo?” We can generalize this!
  • 16. Counterfactuals: “what would happen if?” 1) Randomize treatment application (binary 1/0 or treatment intensity) 2) Log what happens 3) Use randomized logs to answer counterfactual questions (“what if?”) ● Different levels of randomization granularity possible: ○ User-level ■ What happens to user metric X if I always take action A vs action B? ■ What happens to user metric X if I always take action A vs ∅ ? ○ Session-level ○ Impression-level ● Different flavors offer answers to different causal questions
  • 18. ● When we’re in control of the production system to produce counterfactuals, we call that experimental data ● When we don’t control part of the randomization process, we call that observational data
  • 19. Incrementality modeling Simple experimental example ● On X% of traffic, take no action (or random action) ● On (100-X)% of traffic, take action ● X is typically small because it has a product cost (quality / efficiency / …) ● From collected data, learn: ○ P(reward | features, action) = f(features) ○ P(reward | features, no action) = g(features) ○ Predicted lift: lift(features) = f(features) - g(features) ● Use incremental model in production: ○ Max over arms of lift(features(arm))
  • 20. Incrementality modeling Pros ● simple ● no model assumptions, plug your favorite ML technique Cons ● 2 models ● X is small ○ limit on g model accuracy ○ asymmetrical ● doesn’t explicitly model lift ○ can be hard to calibrate ○ offline metrics?
  • 22. One IPS solution Generalizing the previous example... In production: ● take action with P(treatment | features) ● take counterfactual action with 1 - P(treatment | features) We can be in control of P (experimental) or not (observational) Even when we control P, we might want it non-binary (smooth) - to control for product quality cost - to provide enough variance if there’s sequential treatment and long-term reward
  • 23. One IPS solution 1) Learn model of P(treatment | features) 2) Learn incremental model through weighted MLE: : treatment variable (usually binary) : predicted lift and can have different features set if sequential treatment, need to condition on full treatment path
  • 24. One IPS solution Pros - unbiased in theory if no unobserved confounders - explicitly model (treatment/control) covariates shift - generic weighted MLE - plug in your favorite model - your usual ML library already supports it Cons - Not robust to unobserved confounders - needs to have enough variance over the features space - IPSing can blow up variance of estimate - usually resort to clipping - if is biased, what happens to ? treated control
  • 25. Application: Ad incrementality Does online advertising work?
  • 26. Application: Ad incrementality article, 02/08/2018 clickbait headlines: article, 07/27/2017
  • 27. Application: Ad incrementality Main problem: measurement Advertising platforms report on metrics such as Cost-Per-Click (CPC) or Cost-Per-Action (CPA) Based on attribution methodologies such as: - Last click (only give credit to the last ad clicked before conversion) - Last view (only give credit to the last ad viewed) - Any view (any ad “viewed” gets the credit) - Arbitrary combinations of the previous with fudge factors ...These are non-rigorous ways of estimating causal effect of an ad In practice metrics reported over-inflate ad effect by 1 or 2 orders of magnitude … because most people would convert anyway, irrespective of the ad Cost-Per-Incremental-Action > 100x CPA Rigorous alternative: 1) proper lift measurement using counterfactuals (“ghost bids” / “ghost ads”) 2) Incrementality-based bidding to optimize for ad effect
  • 28. Application: Ad incrementality in an ideal world... don’t convert don’t convert C1 C2 C1 C2 - incremental conversions: holdback group: don’t show ads apply treatment: show ads control treatment user space convert convert
  • 29. Application: Ad incrementality in a real world... control treatment cookie space ● Non-perfect identity ○ cookie ≠ user, cross-device tracking... ● Non-perfect non-random compliance ○ We don’t show ads to all treatment cookies ○ Non-deterministic: due to auction mechanisms ● Cross-channel issues ○ Can’t easily compare numbers across platforms ○ Holdback group pollution: control group on one platform is exposed through other platforms ● Heterogeneous logging/reporting tech ● Incrementality varies by ad characteristics ○ Need scores for each impression apply treatment: show ads holdback group: don’t show ads
  • 31. Application: Ad incrementality Trained through MLE with IPS: bids = impressions + lost auctions } Auction is a non-random process which decides if the treatment (impression) is applied We need to learn it to get an unbiased estimate of the treatment effect:
  • 33. Instrumental variable Under the following model: An instrument is an observed variable following the 2 properties: is correlated with 1 2 X YZ can only influence through 1In practice, we replace with a weaker hypothesis:
  • 34. Instrumental variable Example: Does money make people happy? X YZ $ won at the lottery happinesswealth Reasonable instrument: But: ● conditioned on being a gambler (specific demographics) ● Z can affect Y irrespective of $ (fun of gambling) In practice, finding good instruments for observational data is hard: “Many empirical controversies in economics are essentially disputes about whether or not certain variables constitute valid instruments.” - Davidson & McKinnon book
  • 35. Instrumental variable (IV) Regression ● The idea is that we can debias our model by using the fact that our instruments explain the endogenous features independently from the regression residual ● Bread and butter of econometrics ○ Because we don’t have parallel universes to run AB tests on economic policies ○ Observational data is sometimes all we have Let’s dive into the details since it is less familiar to people with an ML background
  • 36. IV regression From hypothesis: We derive: with: *: see here, chapter 4.3 Because conditional expectation is an orthogonal projection*
  • 37. GMM for IV regression From there, we can see that the inference becomes: See this book for thorough treatment of GMM This is called (the functional form of): the Generalized Method of Moments for IV regression The usual econometrics solution to this problem in the linear case is 2-stage least square (2SLS), which expresses the solution through matrix inversion: … this works for small datasets, but breaks (O(n3 ) complexity, O(n2 ) storage) for internet-scale data or non-linear models
  • 38. Scalable GMM for IV regression We are trying to minimize: Joint work with T. Jebara, to be published We want a solution that scales linearly with: - # of training points - # of non-zero features per row (sparse high dimensional X) - # of non-zero instruments per row (sparse high dimensional Z) or:
  • 39. Scalable GMM for IV regression More likely to sample rows which matter (large instruments): will converge fast Idea: pairwise importance-sampling SGD Joint work with T. Jebara, to be published
  • 40. Scalable GMM for IV regression Bonus: Extra variance-reduction around the point-estimate thanks to importance sampling! (hypothesis: same effect as efficient GMM) Converges faster than known alternatives for non-linear GMM: Joint work with T. Jebara, to be published
  • 41. Weak instruments Now that we can run IV regression on problems: - with millions of features - with millions of instruments Should we do it? Problem: when instruments are weak (low correlation with X), the IV estimator is biased … first towards the correlational answer … but then unbounded Causal answer can become worse than correlational one with IV too! Joint work with D. Hubbard
  • 42. Weak instruments Simple 1D experiment observed: hours, z, y unobserved: wealth Try to recover: Joint work with D. Hubbard
  • 43. Weak instruments ● Can 100 weaker IVs replace 1 good IV? It depends ● Non-monotonic behavior in the very weak regime :( Joint work with D. Hubbard
  • 44. Weak instruments Controlling for instruments quality is crucial How to do it in a meaningful and scalable way? ● Partial answer from the statistics literature: ○ Partial F-tests ○ Cragg–Donald ○ Anderson-Rubin ○ … ● “Regularize” the instruments ○ Which cross-validation metric to use? Circular problem. No ground truth! ○ See Hausman Causal Correction from Lewis & Wong Joint work with D. Hubbard
  • 45. IPS-MLE vs IV-GMM Both are unbiased and consistent when there are no unobserved confounders. Typical estimates have higher variance than their correlational counterparts. IPS-MLE IV-GMM ● Familiar to ML people ● More flexible on model class ● Easy to scale ● Less theoretical guarantees ● Not robust to unobserved confounders ● Bias and variance come from IPS weight ● Familiar to econ people ● Mostly gaussian residuals ● Harder to scale ● More theoretical guarantees ● Robust to unobserved confounders ● Bias and variance come from IV strength In both cases, there is no built-in fallback to correlational answer if randomization is poor (other methods exist! Ex: propensity matching)
  • 47. Applications Plenty of use cases for causal inference at Netflix ● Advertising ● Causal recommendations ● Content valuation ● Increased experimentation power ● ...
  • 48. Causal inference in practice Hard! because: Causal effects are small Asymptotic unbiasedness is useless if the variance dominates, even on large datasets Variance grows even more when there is sequential treatment Unobserved confounders can have bigger magnitude than what we try to measure Plenty of unsatisfactory / unanswered questions in the literature No clean ground-truth All estimators have their flaws. Hard (impossible) to measure and compare biases offline on large-scale problems
  • 49. When it matters Correlational models are fine... ● When we only care about fitting the data / predicting ● When your model predictions won’t interact with the product Causal models can help... ● When there’s a “why?” ○ “why did NFLX stock price move today?” ● When there’s a “what would happen if?” ○ “what would happen to streaming if iOS app was 10% slower?” ● To build cost-efficient ML algorithms ○ incremental models factor in the effect of taking the action suggested ○ aligned with business metrics lift : maximize likelihood of green AB test ○ … it’s just a greedy one-step-ahead Reinforcement Learning strategy