Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven

Donal McMahon
Weapons of Math Instruction:
Evolving from Data-Driven to Science-Driven
Director of Data Science, Indeed

Convince you to use the scientific method.
Then, I’ll teach you how.

We’ve hosted 6 eng tech talks on this topic!

Many other industries are also now becoming data-driven

1 Democratize decision-making
2 Better decisions
3 Increase decision velocity
4 Improve collaboration via ego removal
Why data-driven?

Is data-driven an accurate descriptor?

Why do you need to be science-driven?
A cautionary tale

Dreaming big and about to change the world
Donal PM
disclaimer: sadly not real childhood photos

Idea
Modernize our mobile site to improve job seeker experience

Change 1
Increased spacing
between jobs
Control Treatment

Change 2
Replaced orange text with
buttons for:
● New
● Apply with your Indeed
Resume
Control Treatment

Change 3
Removed sponsored jobs
Control Treatment

Change 4
Other minor UI tweaks
● Salary range
● Home button
● Fonts
Control Treatment

Convinced software developers to implement

We ran an A/B test and generated lots of data

We drew contradictory conclusions

What is job seeker experience?

What’s a job?
One that’s anywhere on the page, or one that’s viewed?

What’s an acceptable metric trade off?

Resolution strategy: be more data-driven

So, we threw
more data at each other

different hypotheses + different data + different metrics
∴ different conclusions

We learn geology the morning
after the earthquake.
Ralph Waldo Emerson.

The Scientific Method
Observation Question Hypothesis Experiment Analysis Conclusion

Remainder of this talk
1 What did we do?
2 Why was it wrong?
3 How can you do it better?

What did we do?
Nothing

Why was it wrong?
1
Didn’t establish baseline for job seeker experience, or measures
2
When we failed, we had no knowledge backlog for future work

How can you do it better?
Nano
Study real job seeker sessions
Micro
Partner with experts (UX) to gather qualitative data
Macro
Large scale data analysis and observation via experimentation

Nano: study real job seeker sessions
Query 1
Click on
Job A
Click on
Job C
Query 2
Click on
Job D
Apply

Not only is the universe stranger
than we imagine, it is stranger
than we can imagine.
Sir Arthur Eddington

How can you do it better? A shameless plug
Micro: partner with experts (UX) to gather qualitative data
medium.com/indeed-data-science

Micro: partner with experts (UX) to gather qualitative data
1
Real-life observation
2
Interviews
3
Content analysis (surveys)

Macro: large scale data analysis and observation via experimentation
Common Question
What’s a worthwhile/launchable metric trade-off?

Reality
You’re making trade-offs implicitly already

Learn your implicit local trade-off function
Run multiple simple perturbation experiments, all the time

Observation via experimentation
Applies
JobAlert
Signups

Observation via experimentation
Applies
JobAlert
Signups
Current state

Learn your current implicit trade-offs via experimentation
Applies
JobAlert
Signups
Expt 1: bold Apply with
your Indeed Resume

Applies
JobAlert
Signups
Expt 2: add pixel
whitespace to
JobAlert UI
Expt 1: bold Apply with
your Indeed Resume

Applies
JobAlert
Signups

Compare your current state to all pareto efficient alternatives
Applies
JobAlert
Signups

For each pareto efficient alternative you have a tradeoff
Applies
JobAlert
Sign-ups
ΔApplies
ΔJobAlerts

Implicit tradeoff
Each JobAlert sign-up is worth 1.7 Applies

Why was it wrong?
1
We never prioritized the most important question(s)
2
By bundling questions, we couldn’t answer any, learn and improve

Research Question
Potential
Impact
Complexity
Time To
Learn
What are good measures for job seeker experience? ? ? ?
How can we help job seeker navigate to their desired
job more quickly?
? ? ?
How can we clearly denote sponsored content? ? ? ?
… ... ... ...

What did we do?
Modernize the mobile interface to improve job seeker
experience

Why was it wrong?
1
Hypothesis was ill-defined and vague
2
No established metrics
3
No clear success/failure criteria

1
Determine one or more hypothesis
“Does extra whitespace between job cards help job seekers to navigate quicker.”
2
Agree on the data, metrics and acceptable trade-offs up front
Suggested metrics: (i) time to click, (ii) click rate, (iii) time to hire

Important Question #1
How many metrics?

Spoiler
3

Your product is a high dimensional hypercube

2D hypercube 3D hypercube 4D hypercube
5D hypercube 6D hypercube 7D hypercube

How many metrics?
We need a low-dimensional representation
that preserves almost all of the signal

How many metrics?
Singular value decomposition (SVD)

How many metrics using SVD

Important Question #2
How do you choose great metrics?

This is a full academic
discipline
Some dedicated their 20’s to this!

You need to decide on a target (θ)
Choosing metrics

Termed the estimand in statistics (θ)

Choose how you’ll aim for the target
Choosing metrics

Estimator and Estimate (θ)

Mathematical criteria for metric evaluation
1
Bias
2
Variance
3
System complexity

Mathematical criteria
1
Bias
2
Variance
3
System complexity

Bias

It can be easy to miss bias

Hidden bias in our example
Estimate “time to hire” for job seekers

Job seeker First action Still active Hired
1 01/01/2016 Yes No
2 01/22/2016 No 01/25/2016
3 02/04/2016 No 02/23/2016
4 02/17/2016 No No
... ... ... ...
... ... ... ...
n 04/23/2016 Yes No

Initial Metric Proposal
Average time to hire for job seekers who were hired

Solution
Estimate typical time to hire using Kaplan-Meier Estimate

Time (t)
Estimated time to hire

Variance - a measure of data spread
Low variance High variance

Variance is fundamental
for valid statistical inference

Science assumes “innocent until proven guilty”
We often term this our null hypothesis (H0)

Proof required beyond reasonable doubt
In order to reject the null hypothesis

Variance is your estimate of uncertainty, i.e. doubt

Note
We often choose the
Minimum Variance Unbiased Estimator (MVUE)

Not Always MVUE
Occasionally you
might trade bias for
variance
e.g. machine learning
Low variance High variance
HighbiasLowbias

Product development isn’t linear

Sometimes there are multiple potential targets

Or the target is partially blocked

Or it keeps moving

There is no catch-all mathematical formula
to measure and account for system complexity

But that doesn’t mean you shouldn’t try to estimate
it and factor it into decisions

Search
Tap
Apply
Interview
Offer
“I need a job”
Hire
Covered
extensively
in Ketan’s talk

Which also involves prediction brackets

You predict a winner for each game and awarded points if correct
16
9
5
4
✅
✅
✅
1
9
5
4
̶ my prediction ̶ actual result

If you predict an upset early, success/failure compounds
16
9
5
4
✅
✅
✅
1
9
5
4
̶ my prediction ̶ actual result
9 1
4 ✅ 4
4 1

● Downstream compounded loss
● Number of bracket participants
● Points awarded at each stage
System
complexity
factors

How to win your NCAA pool
Simulate the downstream effect of all potential decisions
Check whether it increases/decreases your win probability

Reminder - How can you do it better?
1
Determine one or more hypothesis
“Does extra whitespace between job cards help job seekers to navigate quicker.”
2
Agree on the data, metrics and acceptable trade-offs up front
Metrics: (i) time to click, (ii) click rate, (iii) time to hire

What did we do?
Ran a single treatment experiment where we
simultaneously changed four components

Why was it wrong?
Couldn’t disentangle the effects of the 4 different treatments

Run a full factorial experiment

Full Factorial Experiment
Suggestion
A: Whitespace
B: Orange text
C: Salary range

Increased statistical power, and simultaneous
testing of interaction effects

i.e. you’ll learn more and learn quicker

What did we do?
1
Cobbled data together from different sources
2
Defined different metrics
3
Invested a lot of time analysing tests

To consult the statistician after an
experiment is finished is often merely
to ask her to conduct a post mortem
examination. She can perhaps say
what the experiment died of.
R.A. Fisher

Why was it wrong?
Opinion-driven, time sink, unsatisfying for all involved

With correct setup, this should be trivial

Existing metric New metric
Existing product
New product

Existing product Uninteresting
New product

Existing product Uninteresting Metric Innovation
New product

New product Product Innovation

New product Product Innovation Uninformative

Never use new data or metrics
to validate new products!

What did we do?
Drew two different conclusions

Why was it wrong?
1
Didn’t learn anything
2
Lost team trust

Should follow directly from analysis

The Goldilocks syndrome
A/B test
(-1%, 1%] (1%, 5%] (5%, ∞](-5%, -1%][-∞, -5%]Outcome
Conclusion too cold too cold too cold
Just right,
declare victory
too hot

Retain healthy skepticism
Always look for bugs
Check for repeatability via holdbacks

The Complete Scientific Method
nano,
micro,
macro
prioritize,
implicit
trade-offs
bias &
variance,
3 metrics
full
factorial
design
trivial,
no data
innovation
Goldilocks
syndrome,
repeatability

nano,
micro,
macro
prioritize,
implicit
trade-offs
bias &
variance,
3 metrics
full
factorial
design
trivial,
no data
innovation
Goldilocks
syndrome,
repeatability

Data-driven can be disorientating in a world of abundant data
Be science-driven, i.e. use the scientific method to add necessary structure
Invest in the observation, question and hypothesis stages
Parting Thoughts

Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven

Similar a Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven (20)

Más de indeedeng

Más de indeedeng (17)

Último

Último (20)

Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven