Over the years, the Netflix UI has evolved from a sparse and static webpage into an immersive, video-centric experience tailored to a variety of platforms. In this talk, I’ll describe the simple but powerful framework that Netflix uses to evolve the product experience: we ask our members, through online A/B tests, which of several possible experiences resonate with them. I’ll also describe the steps we are taking to democratize access to experimentation across the company so that we can explore more ideas and identify those that deliver more value to our members.
5. How do we make the right decisions to evolve the product?
Ask a
HiPPO
Group
debate
Copy the
competition
Hire some
experts
Answer: none of the above
We A/B test every idea
6. What is an A/B Test?
Netflix Members
Compare
business
outcome
Version ‘A’ (Control)
Version ‘B’ (Test)
8. Simply put, experimentation enables
better decisions about how to evolve
the Netflix service to deliver more joy
to our members
9. Democratization: A/B Experimentation Scales.
Many members “vote” on proposed experiences.
Qualitative customer research
A/B Experimentation
Hundreds Ten of Thousands MillionsOne
HiPPO and internal experts
10. How we scale.What we try.How we decide.
Our members
vote with their
actions on how to
evolve the
product to deliver
more joy!
Democratization 1 Democratization 2 Democratization 3
11. Why not just roll out a new
feature and see what happens?
12. Product A: Standard box art Product B: Upside-down box art
Hypothesis: Upside-down box art will increase engagement
Outcome metric: Streaming
13. On December 21, we launched the new version of our Product.
Streaming hours spiked!
Launch Product B and Observe
14. Based on the results, would you roll out this UI?
15. It turns out that Bird Box also launched on December 21.
Correlation tempts us to infer causality
16. It turns out that Bird Box also launched on December 21.
Now we don’t know WHY streaming hours spiked.
Correlation tempts us to infer causality
21. How we scale.What we try.
Because we test,
and because most
ideas are not
winners, product
innovation ideas
can come from
anywhere.
How we decide.
Democratization 1 Democratization 2 Democratization 3
24. Convert this idea into a hypothesis
Hypothesis format:
If we make change X, it will affect member behavior
in a way that makes metric Y improve.
Action Impact Metrics
25. Hypothesis for this experiment
Presenting a row of short previews
will increase awareness of these titles
and make it easier for members
to find something to watch,
increasing engagement.
26. ● Would you release (or not release) the feature regardless
of A/B test results?
● Are the potential results meaningful to our business?
● Is it even possible to validate the hypothesis
through an experiment?
● Is there a well-defined causal relationship?
Confirm that the experiment is worth
running
27. Run the experiment
*Hold everything else constant!
Netflix Members
(“Target population”)
Random
sample
Compare
business
outcome
(statistical
analysis
of metrics)
Version ‘A’ (Control)
Version ‘B’ (Test)
28. Analyze the results
Primary Metric No statistically significant difference
Secondary Metric 1 No statistically significant difference
Secondary Metric 2 Statistically significant improvement
Comparing the behavior of members
who saw the new ‘Previews’ experience
with those who saw the incumbent, we found:
31. I am not
a cat
I am a
cat
Two ways to be correct
True
Positive
True
Negative
32. I am not
a cat
I am a
cat
Two ways to make an error
False
Negative
False
Positive
33. I am not
a cat
I am a
cat
I am not
a cat
I am a
cat
True
Positive
True
Negative
False
Negative
False
Positive
Four possible outcomes
34. This result is still uncertain;
it could be a false positive
(also called “Type I Error”)
Suppose we saw a 1% increase in our
primary metric in our A/B test.
False Positive or Type I error
We think there’s an effect due to the
experiment, but there isn’t.
I am a
cat
35. This result is also uncertain;
it could be a false negative
(also called “Type II Error”)
Suppose we saw no change to our primary
metric in our A/B test.
False Negative or Type II Error
We don’t think there’s an effect due to the
experiment, but there is.
I am not
a cat
36. Three things can impact this uncertainty
1. Effect Size: The difference in the metric value between
the control cell (Version A) and the test cell (Version B)
Larger difference leads to easier, more reliable detection
A B A B
versus
37. 2. Sample Size: The number of members in each test cell
Larger sample leads to easier, more reliable detection
Three things can impact this uncertainty
38. 3. Variance: How disparate (vs consistent) are the metrics
among the populations participating in the experiment?
Smaller variance leads to easier, more reliable detection
Three things can impact this uncertainty
Heavy streamers.
Medium streamers.
Light streamers.
versus
39. We work with these knobs to reduce the
occurrence of false negatives
We design experiments to
have sufficient “statistical
power”, so that if there is
indeed a difference, we can
detect it most of the time.
I am not
a cat
False Negative or Type II Error
We don’t think there’s an effect due to the
experiment, but there is.
40. And we choose a tolerance level for false
positives
We measure
“statistical significance”,
and we accept that
in a few cases (generally 5%),
statistically significant results
are just noise.
False Positive or Type I error
We think there’s an effect due to the
experiment, but there isn’t.
I am a
cat
41. How to interpret “statistical significance”
Version A
(Control)
Version B
(Test)
Metric 87.2 87.9
p-value N/A 0.026
Since the p-value is less than 0.05 (a pre-chosen threshold for false
positives), the result is “statistically significant”.
We conclude that the test treatment is the reason
for the observed 0.7 increase in the metric.
Observed effect is a 0.7 increase
in this metric
There is only a 2.6% chance
of observing an effect at least this
large if Control and Test were the
same*
*If this were an “A/A test” - that is, under the “null hypothesis”.
43. But there is still no way to know
whether the result of
a specific experiment
is a false positive / negative
44. Hence, we also need to interpret results
with strong judgment to decide on a “win”
● Do results align with hypothesis?
● Does the metric story hang together?
● Is there other supporting
or refuting evidence,
such as patterns
across similar test cells?
● Do results repeat?
46. Test results are expected for
most decisions, and we’ve
invested in an internal
platform to support our
experimentation program.
Democratize access and
contributions.
47. Three pillars
1. Trustworthiness
2. Inclusivity
3. Scalability
An interdisciplinary collaboration:
● Engineering: software, data, UI.
● Data science / statistics.
● Computation.
● Product design and product management.
49. Scaling scientists:
building a modular,
democratized platform.
Users can contribute to each
module (R, Python):
Eng concerns abstracted
away.
Many possible analysis flows.
Results surfaced in notebooks
and our UI.
50. Scaling scientists:
building a modular,
democratized platform.
Users can contribute to each
module (R, Python):
Eng concerns abstracted
away.
Many possible analysis flows.
Results surfaced in notebooks
and our UI.
51. 1. Efficient, science-centric platform = Automation deeper
into the workflows of test analysts.
Scaling Decision Makers
More time for creative problem solving, exploratory work
to generate hypotheses, research. . . .
52. Scaling Decision Makers
2. UI Solutions.
In success, accessible and intuitive
results presentations that permit for
confident decision making.
54. How we scale.
Our platform level
investments allow
scientists to
contribute directly,
and empower a
variety of decision
makers.
What we try.How we decide.
Democratization 1 Democratization 2 Democratization 3
55. Decision making and Experimentation at Netflix:
Democratization, three ways.
56. Democratization 1 Democratization 2 Democratization 3
How we decide.
Our members
vote with their
actions on how to
evolve the
product to deliver
more joy!
What we try.
Because we test,
and because most
ideas are not
winners, product
innovation ideas
can come from
anywhere.
How we scale.
Our platform level
investments allow
scientists to
contribute directly,
and empower a
variety of decision
makers.