7. Intuition for Personalized Assets
● Emphasize themes through different artwork according to some
context (user, viewing history, country, etc.)
Preferences in genre
8. Intuition for Personalized Assets
● Emphasize themes through different artwork according to some
context (user, viewing history, country, etc)
Preferences in cast members
9. Bandit Algorithms Setting
For each (user, show) request:
● Actions: set of candidate images available
● Reward: how many minutes did the user play from that impression
● Environment: Netflix homepage in user’s device
● Learner: its goal is to maximize the cumulative reward after N requests
Learner Environment
Action
Reward
Context
10. Numerous Variants
● Different Strategies: ε-Greedy, Thompson Sampling (TS), Upper Confidence
Bound (UCB), etc.
● Different Environments:
○ Stochastic and stationary: Reward is generated i.i.d. from a distribution
specific to the action. No payoff drift.
○ Adversarial: No assumptions on how rewards are generated.
● Different objectives: Cumulative regret, tracking the best expert
● Continuous or discrete set of actions, finite vs infinite
● Extensions: Varying set of arms, Contextual Bandits, etc.
11. Specific challenges
● Play attribution and reward assignment
○ Incremental effect of the image on top of recommender system
● Only one image per title can be presented
○ Although inherently it is a ranking problem
Would you play because the movie is recommended or because of the artwork? Or both?
12. Specific challenges
● Change effect
○ Can changing images too often make users confused?
Session 1 Session 2 Session 3 ... Session N
Sequence A
Sequence B
13. ● We have control over the set of actions
○ How many images per show
○ Image design
● What makes a good asset?
○ Representative (no clickbait)
○ Differential
○ Informative
○ Engaging
Actions
Personal (i.e. contextual)
15. ● Learn a binary classifier per image to predict probability of play
● Pick the winner (arg max)
Member
(context)
Features
Image Pool
Model 1
Winner
arg
max
Model 2
Model 3
Model 4
Greedy Policy Example
16. Take Fraction Example: Luke Cage
Take Fraction = 1 / 3
Play
No play
User A
User B
User C
17. ● Unbiased offline evaluation from explore data
Offline metric: Replay [Li et al, 2010]
Offline Take Fraction = 2 / 3
User 1 User 2 User 3 User 4 User 5 User 6
Random Assignment
Play?
Model Assignment
18. Offline Replay
● Context matters
● Artwork diversity matters
● Personalization wiggles
around most popular images
Lift in Replay in the various algorithms as
compared to the Random baseline
19. Online results
● Rollout to our >130M member base
● Most beneficial for lesser known titles
● Compression from title -level offline metrics due to cannibalization
between titles
21. Action selection orchestration
● Neighboring image selection influences result
● Title-level optimization is not enough
Row A
(diverse
images)
Row B
(the
microphone
row)
Stand-up comedy
22. Automatic image selection
● Generating new artwork is costly and time consuming
● Develop algorithm to predict asset quality from raw image
23. Long-term Reward: Road to RL
● Maximize long term reward: reinforcement learning
○ User long term joy rather than plays