Statistics for CRO - Conversion Conference London

@THCapper
YOUR RESULTS
ARE INVALID
STATISTICS FOR CRO

A Good A/B Test Result:
“10% Uplift, With 95%
Significance”

● What does this mean?
● Is this correct?
Significance”

Marketer: “Roll it out!”
Statistician (me): *sobs*

You will learn today:
● The most common serious errors in
A/B testing
● How to avoid them
● How to interpret your result
● Whether to roll it out

1. Test design
2. Results interpretation
3. Decision

Jargon: Null Hypothesis
● The hypothesis that your variant and
original are functionally equivalent

Jargon: P-Value
● The chance of a result this extreme if
the null hypothesis is true
● E.g. 0.05 for 95% significance

Jargon: Critical Value
● What you compare your p-value with
when deciding whether to reject the
null hypothesis

A B
C D E F
Multivariate Testing
Landing Page:
Product Pages:

A B
C D
Landing Page:
Product Page:

A
C D
BA
C D
B

A
C D
BA
C D
BA
C D
B

A
C D
B A
C D
BA
C D
BA
C D
B

A BLanding Page:
A: 5%
B: 7.5%

C D
C: 5%
D: 7.5%
Product Page:

A
C
D
B A
C D
BA
C D
BA
C D
B
AC:
0%
BD:
5%
BC:
10%
AD:
10%

False Positives
Test: Healthy
Test: Ill

False Positives
Test: Healthy
Test: Ill
False Negatives

False Positives
Test: Healthy
Test: Ill False Positives
False Negatives

Multiple Testing
1 A/A test:
5% chance of achieving 95% significance.

Multiple Testing
1 A/A Test: 5% chance

Multiple Testing
1 A/A Test:
2 A/A Tests:
5% chance
9.75% chance

Multiple Testing
1 A/A Test:
2 A/A Tests:
3 A/A Tests:
5% chance
9.75% chance
14.26% chance

Multiple Testing
1 A/A Test:
2 A/A Tests:
3 A/A Tests:
4 A/A Tests:
5% chance
9.75% chance
14.26% chance
18.55% chance

Multiple Testing
1 A/A Test:
2 A/A Tests:
3 A/A Tests:
4 A/A Tests:
n A/A Tests:
5% chance
9.75% chance
14.26% chance
18.55% chance
1-0.95^n

Multiple Testing
Solutions:
1. Accept risk of false positives

Multiple Testing
Solutions:
2. Bonferroni correction

Bonferroni Approximation
Standard: P-value vs………..…. 0.05

Bonferroni Approximation
Standard: P-value vs………..….
Approximation: P-value vs…...
0.05
0.05/N

Bonferroni Correction
Standard: P-value vs………..….
Bonferroni: P-value vs……….
0.05
1-(1-0.05)^(1/N)

Multiple Testing
Solutions:
2. Bonferroni correction
3. Holm-Bonferroni correction

Choosing the Right Metric
Conversion Rate
vs.
Average Session Value

Choosing the Right Metric
Conversion Rate
vs.
Average Session Value Profit?

Stopping Rules
Common: When my test reaches
significance.

“Significance so far” varies over time.

Stopping Rules
Y Y Y Y Y N N N N N

Stopping Rules
Y Y Y Y Y Y YN N N

Exceptions
https://en.wikipedia.org/wiki/Sequential_probability_ratio_test

Stopping Rules
Solutions:
1. Sequential testing - e.g. Optimizely
2. Bayesian testing - e.g. VWO
3. Predetermined sample size

evanmiller.org/ab-testing/sample-size.html

Sample Size for Average
Session Value Testing

=stdev(B:B)
=stdev.s(B:B)
Standard Deviation

powerandsamplesize.com/Calculators/

Test Design Recap
Contamination
Multiple
Testing
Metric
Choice
Stopping
Rules

Interpreting the P-value
1 test reaches 95% significance:
5% chance of data this extreme if
variants functionally equivalent.

Analogy
Question: How likely is it that my
analytics or site are broken?

Analogy
Question: How likely is it that my
analytics or site are broken?
Non-Answer: We only go a whole day
with no conversions once every 2
months.

Analytics is broken with
probability 1 or 0.

Interpreting the P-value
Question: How likely is it that this
variation actually does nothing?
Non-Answer: We’d only see a
difference this big 5% of the time.

Meanwhile in Industry Tools:
● “Chance to beat baseline”
● “We are 95% certain that the changes
in test “B” will improve your
conversion rate”

Unanswered Questions
Question: How likely is it that the
increase will be less than predicted?

Question: How likely is it that the
increase will be negative?

One Mistake
Probability of Outcome given Data
vs.
Probability of Data given Null

Question: How likely is it that these
results are a fluke?

Confidence Interval of
Conversion Rate

Overlapping Confidence
Intervals

evanmiller.org/ab-testing/t-test.html

Results Interpretation Recap
Check
Revenue
P-Value
Confidence
Intervals

But what about this?
Significance”

“10% Uplift, With 60% Significance”
● 40% chance of data at least this
extreme if variation functionally
identical

“10% Uplift, With 60% Significance”
● 40% chance of data at least this
extreme if variation functionally
identical
● The variation is probably better than
the baseline

Drug Trials
vs.
Investment Banking

Are You OK With False
Positives?

Data is Expensive:
● Opportunity Cost
● Exploration vs. Exploitation

Historical Comparisons are
Invalid

Hang on…
Why Should I Care About
Significance?

1. Ignoring Significance
Doesn’t Allow You to Ignore
Statistics

Risk Factors:
● Agility
● Business attitudes
● What’s the worst that
could happen?

Decision Recap
Significant
vs. Winning
Risk
Exploration
vs. Exploitation

1. Think about significance and risk
during test design

2. Remember your real KPI: Profit

3. You’re not testing medicines

@THCapper
Takeaways:
1. Think about significance and risk
during test design
2. Remember your real KPI: Profit
3. You’re not testing medicines

Statistics for CRO - Conversion Conference London

Recomendados

Recomendados

Más contenido relacionado

Similar a Statistics for CRO - Conversion Conference London

Similar a Statistics for CRO - Conversion Conference London (20)

Más de Tom Capper

Más de Tom Capper (13)

Último

Último (20)

Statistics for CRO - Conversion Conference London