UX STRAT Online 2020: Dr. Martin Tingley, Netflix

Democratizing
Decision making and
Experimentation at
Netflix
Martin Tingley

How we scale.What we try.How we decide.
Democratization 1 Democratization 2 Democratization 3

How do we make the right decisions to evolve the product?
Ask a
HiPPO
Group
debate
Copy the
competition
Hire some
experts
Answer: none of the above
We A/B test every idea

What is an A/B Test?
Netflix Members
Compare
business
outcome
Version ‘A’ (Control)
Version ‘B’ (Test)

Why Netflix believes
in Experimentation

Simply put, experimentation enables
better decisions about how to evolve
the Netflix service to deliver more joy
to our members

Democratization: A/B Experimentation Scales.
Many members “vote” on proposed experiences.
Qualitative customer research
A/B Experimentation
Hundreds Ten of Thousands MillionsOne
HiPPO and internal experts

How we scale.What we try.How we decide.
Our members
vote with their
actions on how to
evolve the
product to deliver
more joy!

Why not just roll out a new
feature and see what happens?

Product A: Standard box art Product B: Upside-down box art
Hypothesis: Upside-down box art will increase engagement
Outcome metric: Streaming

On December 21, we launched the new version of our Product.
Streaming hours spiked!
Launch Product B and Observe

Based on the results, would you roll out this UI?

It turns out that Bird Box also launched on December 21.
Correlation tempts us to infer causality

It turns out that Bird Box also launched on December 21.
Now we don’t know WHY streaming hours spiked.
Correlation tempts us to infer causality

Experimentation substantiates causality
If, instead, we had run an experiment with both versions, we could
have seen that Product ‘A’ caused higher streaming hours.

Despite having expert colleagues
who generate, explore, and
experiment with creative ideas
to improve our service…

… most experiments do not “win.”
We have lots of ideas. Most are not successful.
Experiments let our members tell us which ones are good.

Democratization of Ideation.
Great product ideas can come from anyone!
PMs, Designers, Engineers, Scientists, Executives . . .

How we scale.What we try.
Because we test,
and because most
ideas are not
winners, product
innovation ideas
can come from
anywhere.
How we decide.

Current
product
New
idea:
Mobile
Previews
It all starts with an idea

Convert this idea into a hypothesis
Hypothesis format:
If we make change X, it will affect member behavior
in a way that makes metric Y improve.
Action Impact Metrics

Hypothesis for this experiment
Presenting a row of short previews
will increase awareness of these titles
and make it easier for members
to find something to watch,
increasing engagement.

● Would you release (or not release) the feature regardless
of A/B test results?
● Are the potential results meaningful to our business?
● Is it even possible to validate the hypothesis
through an experiment?
● Is there a well-defined causal relationship?
Confirm that the experiment is worth
running

Run the experiment
*Hold everything else constant!
Netflix Members
(“Target population”)
Random
sample
Compare
business
outcome
(statistical
analysis
of metrics)
Version ‘A’ (Control)
Version ‘B’ (Test)

Analyze the results
Primary Metric No statistically significant difference
Secondary Metric 1 No statistically significant difference
Secondary Metric 2 Statistically significant improvement
Comparing the behavior of members
who saw the new ‘Previews’ experience
with those who saw the incumbent, we found:

Building intuition about those
statistics...

I am not
a cat
I am a
cat
Two ways to be correct
True
Positive
True
Negative

I am not
a cat
I am a
cat
Two ways to make an error
False
Negative
False
Positive

I am not
a cat
I am a
cat
I am not
a cat
I am a
cat
True
Positive
True
Negative
False
Negative
False
Positive
Four possible outcomes

This result is still uncertain;
it could be a false positive
(also called “Type I Error”)
Suppose we saw a 1% increase in our
primary metric in our A/B test.
False Positive or Type I error
We think there’s an effect due to the
experiment, but there isn’t.
I am a
cat

This result is also uncertain;
it could be a false negative
(also called “Type II Error”)
Suppose we saw no change to our primary
metric in our A/B test.
False Negative or Type II Error
We don’t think there’s an effect due to the
experiment, but there is.
I am not
a cat

Three things can impact this uncertainty
1. Effect Size: The difference in the metric value between
the control cell (Version A) and the test cell (Version B)
Larger difference leads to easier, more reliable detection
A B A B
versus

2. Sample Size: The number of members in each test cell
Larger sample leads to easier, more reliable detection

3. Variance: How disparate (vs consistent) are the metrics
among the populations participating in the experiment?
Smaller variance leads to easier, more reliable detection
Heavy streamers.
Medium streamers.
Light streamers.
versus

We work with these knobs to reduce the
occurrence of false negatives
We design experiments to
have sufficient “statistical
power”, so that if there is
indeed a difference, we can
detect it most of the time.
I am not
a cat
False Negative or Type II Error
We don’t think there’s an effect due to the
experiment, but there is.

And we choose a tolerance level for false
positives
We measure
“statistical significance”,
and we accept that
in a few cases (generally 5%),
statistically significant results
are just noise.
False Positive or Type I error
We think there’s an effect due to the
experiment, but there isn’t.
I am a
cat

How to interpret “statistical significance”
Version A
(Control)
Version B
(Test)
Metric 87.2 87.9
p-value N/A 0.026
Since the p-value is less than 0.05 (a pre-chosen threshold for false
positives), the result is “statistically significant”.
We conclude that the test treatment is the reason
for the observed 0.7 increase in the metric.
Observed effect is a 0.7 increase
in this metric
There is only a 2.6% chance
of observing an effect at least this
large if Control and Test were the
same*
*If this were an “A/A test” - that is, under the “null hypothesis”.

Statistics help us
reduce error rates
and make good decisions
in the face of uncertainty

But there is still no way to know
whether the result of
a specific experiment
is a false positive / negative

Hence, we also need to interpret results
with strong judgment to decide on a “win”
● Do results align with hypothesis?
● Does the metric story hang together?
● Is there other supporting
or refuting evidence,
such as patterns
across similar test cells?
● Do results repeat?

Test results are expected for
most decisions, and we’ve
invested in an internal
platform to support our
experimentation program.
Democratize access and
contributions.

Three pillars
1. Trustworthiness
2. Inclusivity
3. Scalability
An interdisciplinary collaboration:
● Engineering: software, data, UI.
● Data science / statistics.
● Computation.
● Product design and product management.

3. Scalability
Scientists
Decision
Makers

Scaling scientists:
building a modular,
democratized platform.
Users can contribute to each
module (R, Python):
Eng concerns abstracted
away.
Many possible analysis flows.
Results surfaced in notebooks
and our UI.

1. Efficient, science-centric platform = Automation deeper
into the workflows of test analysts.
Scaling Decision Makers
More time for creative problem solving, exploratory work
to generate hypotheses, research. . . .

Scaling Decision Makers
2. UI Solutions.
In success, accessible and intuitive
results presentations that permit for
confident decision making.

Democratizing
access and
contributions.

How we scale.
Our platform level
investments allow
scientists to
contribute directly,
and empower a
variety of decision
makers.
What we try.How we decide.

Decision making and Experimentation at Netflix:
Democratization, three ways.

How we decide.
Our members
vote with their
actions on how to
evolve the
product to deliver
more joy!
What we try.
Because we test,
and because most
ideas are not
winners, product
innovation ideas
can come from
anywhere.
How we scale.
Our platform level
investments allow
scientists to
contribute directly,
and empower a
variety of decision
makers.

Thank You!
www.research.netflix.com

UX STRAT Online 2020: Dr. Martin Tingley, Netflix

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a UX STRAT Online 2020: Dr. Martin Tingley, Netflix

Similar a UX STRAT Online 2020: Dr. Martin Tingley, Netflix (20)

Más de UX STRAT

Más de UX STRAT (20)

Último

Último (20)

UX STRAT Online 2020: Dr. Martin Tingley, Netflix