In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
8. Case Study I: Artwork Optimization
Goal: Recommend a personalized
artwork or imagery for a title to help
members decide if they will enjoy the
title or not.
9. Case Study II: Billboard Recommendation
Goal: Successfully introduce content
to the right members.
10. Traditional Approaches for
Recommendation
Collaborative Filtering
● Idea is to use the “wisdom of the
crowd” to recommend items
● Well understood and various
algorithms exist (e.g. Matrix
Factorization)
Collaborative Filtering
0 1 0 1 0
0 0 1 1 0
1 0 0 1 1
0 1 0 0 0
0 0 0 0 1
Users
Items
11. Challenges for Traditional Approaches
● Scarce feedback
● Dynamic catalog
● Country availability
● Non-stationary member base
● Time sensitivity
○ Content popularity changes
○ Member interests evolves
○ Respond quickly to member feedback
12. Challenges for Traditional Approaches
Continuous and fast
learning needed
● Scarce feedback
● Dynamic catalog
● Country availability
● Non-stationary member base
● Time sensitivity
○ Content popularity changes
○ Member interests evolves
○ Respond quickly to member feedback
13. Multi-Armed Bandits
Increasingly successful in various practical settings where these challenges occur
Clinical Trials Network Routing
Online Advertising
AI for Games Hyperparameter Optimization
14. Multi-Armed Bandits
● A gambler playing multiple slot machines with
unknown reward distribution
● Which machine to play to maximize reward?
15. Multi-Armed Bandit For Recommendation
Exploration-Exploitation tradeoff :
Recommend the optimal title given the evidence i.e. exploit
OR
Recommend other titles to gather feedback i.e. explore.
16. Numerous Variants
● Different Strategies: ε-Greedy, Thompson Sampling (TS), Upper Confidence
Bound (UCB), etc.
● Different Environments:
○ Stochastic and stationary: Reward is generated i.i.d. from a distribution
specific to the action. No payoff drift.
○ Adversarial: No assumptions on how rewards are generated.
● Different objectives: Cumulative regret, tracking the best expert
● Continuous or discrete set of actions, finite vs infinite
● Extensions: Varying set of arms, Contextual Bandits, etc.
18. Bandit Algorithms Setting
For each (user, show) request:
● Actions: set of candidate images available
● Reward: how many minutes did the user play from that impression
● Environment: Netflix homepage in user’s device
● Learner: its goal is to maximize the cumulative reward after N requests
Learner Environment
Action
Reward
Context
19. Specific challenges
● Play attribution and reward assignment
○ Incremental effect of the image on top of recommender system
● Only one image per title can be presented
○ Although inherently it is a ranking problem
Would you play because the movie is recommended or because of the artwork? Or both?
20. Specific challenges
● Change effect
○ Can changing images too often make users confused?
Session 1 Session 2 Session 3 ... Session N
Sequence A
Sequence B
21. ● We have control over the set of actions
○ How many images per show
○ Image design
● What makes a good asset?
○ Representative (no clickbait)
○ Differential
○ Informative
○ Engaging
Actions
Personal (i.e. contextual)
22. Intuition for Personalized Assets
● Emphasize themes through different artwork according to some
context (user, viewing history, country, etc.)
Preferences in genre
23. Intuition for Personalized Assets
● Emphasize themes through different artwork according to some
context (user, viewing history, country, etc)
Preferences in cast members
24. Epsilon Greedy for MABs
● Unbiased
training data
● Like AB test
across actions
● Greedy
● Select optimal
action
Explore
ε 1-ε
Exploit
25. ● Learn a binary classifier per image to predict probability of play
● Pick the winner (arg max)
Member
(context)
Features
Image Pool
Model 1
Winner
arg
max
Model 2
Model 3
Model 4
Greedy Exploit Policy
26. Take Fraction Example: Luke Cage
Take Fraction = 1 / 3
Play
No play
User A
User B
User C
27. ● Unbiased offline evaluation from explore data
Offline metric: Replay [Li et al, 2010]
Offline Take Fraction = 2 / 3
User 1 User 2 User 3 User 4 User 5 User 6
Random Assignment
Play?
Model Assignment
28. Offline Replay
● Context matters
● Artwork diversity matters
● Personalization wiggles
around most popular images
Lift in Replay in the various algorithms as
compared to the Random baseline
29. Online results
● Rollout to our >125M member base
● Most beneficial for less known titles
● Compression from title -level offline metrics due to cannibalization
between titles
31. Considerations for the greedy policy
● Explore
○ Bandwidth allocation and cost of exploration
○ New vs existing titles
● Exploit
○ Model synchronisation
○ Title availability
○ Frequency of model update
○ Incremental updates vs batch training
■ Stationarity of title popularities
?
?
?
? ??
?
34. Netflix Promotions
Netflix homepage is an expensive real-estate (opportunity cost):
- so many titles to promote
- so few opportunities to win a “moment of truth”
D1 D2 D3 D4 D5
Promote?▶ ▶ ▶ ▶
Probability of
Play
Days
35. Netflix Promotions
Netflix homepage is an expensive real-estate (opportunity cost):
- so many titles to promote
- so few opportunities to win a “moment of truth”
Traditional (correlational) ML systems:
- take action if probability of positive reward is high, irrespective of reward
base rate
- don’t model incremental effect of taking action
D1 D2 D3 D4 D5
Promote?▶ ▶ ▶ ▶
Probability of
Play
Days
36. Incrementality from Advertising
● Goal: Measure ad effectiveness.
● Incrementality: The difference
in the outcome because the ad
was shown; the causal effect of
the ad.
$1.1M
$1.0M
$100k
Other
Advertisers’
Ads
Control Treatment
Revenue
Random Assignment*
*Johnson, Garrett A. and Lewis, Randall A. and Nubbemeyer, Elmar I, Ghost Ads: Improving the Economics of Measuring Online Ad Effectiveness (January 12, 2017).
Simon Business School Working Paper No. FR 15-21. Available at SSRN: https://ssrn.com/abstract=2620078
37. Incrementality Based Policy
● Goal: Select title for promotion that benefits most from being
shown in billboard
○ Member can play title from other sections on the homepage or search
○ Popular titles likely to appear on homepage anyway: Trending Now
○ Better utilize most expensive real-estate on the homepage!
● Define policy to be incremental with respect to probability of play
38. Incrementality Based Policy on Billboard
● Goal: Recommend title which has the largest additional benefit from
being presented on the Billboard
○ Recommend titles with argmax of
39. Which titles benefit from Billboard?
Title A benefits much more
than Title C by being shown
on the Billboard
Scatter plot of incremental vs baseline probability of
play for various members.
40. Offline & Online Results
● Incrementality based policy
sacrifices replay by selecting a
lesser known title that would
benefit from being shown on the
Billboard.
● Our implementation of
incrementality is able to shift
engagement within the candidate
pool.
Lift in Replay in the various algorithms as
compared to a random baseline
42. Action selection orchestration
● Neighboring image selection influences result
● Title-level optimization is not enough
Row A
(diverse
images)
Row B
(the
microphone
row)
Stand-up comedy
43. Automatic image selection
● Generating new artwork is costly and time consuming
● Develop algorithm to predict asset quality from raw image
Raw image Box-art
44. Long-term Reward: Road to RL
● Maximize long term reward: reinforcement learning
○ User long term joy rather than play clicks or duration.