2. Example: Hospitalization on health
What’s wrong with estimating this model from observational data?
Health
tomorrow
Hospital
visit today Effect?
Arrow means “X causes Y”
3. Confounds
The effect and cause might be
confounded by a common cause,
and be changing together as a
result
Health
tomorrow
Hospital
visit today Effect?
Health
today
Dashed circle means “unobserved”
4. Confounds
If we only get to observe them
changing together, we can’t
estimate the effect of
hospitalization changing alone
Health
tomorrow
Hospital
visit today Effect?
Health
today
5. “To find out what happens when you change
something, it is necessary to change it.”
-GEORGE BOX
6. Hospital
visit today
Random assignment
Random assignment determines the treatment independent of any
confounds
Health
tomorrowEffect?
Health
today
Coin flip
Double lines mean
“intervention”
7. Counterfactuals
To isolate the causal effect, we have to change one and only one
thing (hospital visits), and compare outcomes
+ vs
(what happened)
Reality
(what would have happened)
Counterfactual
8. Counterfactuals
We never get to observe what would have happened if we did
something else, so we have to estimate it
+ vs
(what happened)
Reality
(what would have happened)
Counterfactual
9. Random assignment
We can use randomization to create two groups that differ only in
which treatment they receive, restoring symmetry
+
World 1 World 2
Heads Tails
10. Random assignment
We can use randomization to create two groups that differ only in
which treatment they receive, restoring symmetry
+
World 1 World 2
Heads Tails
11. Random assignment
We can use randomization to create two groups that differ only in
which treatment they receive, restoring symmetry
+
World 1 World 2
13. Problems
Random assignment is the “gold standard” for causal inference, but
can be misleading under certain circumstances
◦ Small sample sizes
◦ Researcher degrees of freedom
◦ Publication bias
◦ P-hacking
16. Electronic copy available at: https://ssrn.com/abstract=1850704
designed to demonstrate something false: that certain songs
can change listeners’ age. Everything reported here actually
happened.1
Study 1:musical contrast and subjective age
In Study 1, we investigated whether listening to a children’s
song induces an age contrast, making people feel older. In
exchange for payment, 30 University of Pennsylvania under-
graduates sat at computer terminals, donned headphones, and
were randomly assigned to listen to either a control song
(“Kalimba,” an instrumental song by Mr. Scruff that comes
free with the Windows 7 operating system) or a children’s
song (“Hot Potato,” performed by The Wiggles).
After listening to part of the song, participants com-
pleted an ostensibly unrelated survey: They answered the
question “How old do you feel right now?” by choosing
among five options (very young, young, neither young nor
old, old, and very old). They also reported their father’s
age, allowing us to control for variation in baseline age
across participants.
An analysis of covariance (ANCOVA) revealed the pre-
dicted effect: People felt older after listening to “Hot Potato”
tematic analysis of how researcher degrees of freedom influ-
ence statistical significance. Impatient readers can consult
Table 3.
“How Bad Can It Be?” Simulations
Simulationsof common researcher degreesof
freedom
We used computer simulations of experimental data to esti-
mate how researcher degrees of freedom influence the proba-
bility of a false-positive result. These simulations assessed
the impact of four common degrees of freedom: flexibility in
(a) choosing among dependent variables, (b) choosing sample
size, (c) using covariates, and (d) reporting subsets of experi-
mental conditions. We also investigated various combinations
of these degrees of freedom.
We generated random samples with each observation inde-
pendently drawn from a normal distribution, performed sets of
analyses on each sample, and observed how often at least one
of the resulting p values in each sample was below standard
significance levels. For example, imagine a researcher who
collects two dependent variables, say liking and willingness to
by guest on November 20, 2011pss.sagepub.comDownloaded from
1360 Simmonset al.
of ambiguous information and remarkably adept at reaching
justifiable conclusions that mesh with their desires (Babcock
& Loewenstein, 1997; Dawson, Gilovich, & Regan, 2002;
Gilovich, 1983; Hastorf & Cantril, 1954; Kunda, 1990; Zuck-
erman, 1979). This literature suggests that when we as
researchers face ambiguous analytic decisions, we will tend to
conclude, with convincing self-justification, that the appropri-
ate decisions are those that result in statistical significance
(p ≤ .05).
Ambiguity is rampant in empirical research. As an exam-
ple, consider a very simple decision faced by researchers ana-
lyzing reaction times: how to treat outliers. In a perusal of
roughly 30 Psychological Science articles, we discovered con-
siderable inconsistency in, and hence considerable ambiguity
about, this decision. Most (but not all) researchers excluded
some responses for being too fast, but what constituted “too
fast” varied enormously: the fastest 2.5%, or faster than 2 stan-
dard deviations from the mean, or faster than 100 or 150 or
200 or 300 ms. Similarly, what constituted “too slow” varied
enormously: the slowest 2.5% or 10%, or 2 or 2.5 or 3 stan-
dard deviations slower than the mean, or 1.5 standard devia-
tions slower from that condition’s mean, or slower than 1,000
or 1,200 or 1,500 or 2,000 or 3,000 or 5,000 ms. None of these
decisions is necessarily incorrect, but that fact makes any of
them justifiable and hence potential fodder for self-serving
(adjusted M = 2.54 years) than after listening to the control
song (adjusted M = 2.06 years), F(1, 27) = 5.06, p = .033.
In Study 2, we sought to conceptually replicate and extend
Study 1. Having demonstrated that listening to a children’s
song makes people feel older, Study 2 investigated whether
listening to a song about older age makes people actually
younger.
Study 2:musical contrast and chronological
rejuvenation
Using the same method as in Study 1, we asked 20 University
of Pennsylvania undergraduates to listen to either “When I’m
Sixty-Four” by The Beatles or “Kalimba.” Then, in an ostensi-
bly unrelated task, they indicated their birth date (mm/dd/
yyyy) and their father’s age. We used father’s age to control
for variation in baseline age across participants.
An ANCOVA revealed the predicted effect: According to
their birth dates, people were nearly a year-and-a-half younger
after listening to “When I’m Sixty-Four” (adjusted M = 20.1
years) rather than to “Kalimba” (adjusted M = 21.5 years),
F(1, 17) = 4.92, p = .040.
Discussion
18. False-Positive Psychology 1361
Table 1. Likelihood of ObtainingaFalse-Positive Result
Significance level
Researcher degrees of freedom p < .1 p < .05 p < .01
Situation A:two dependent variables (r = .50) 17.8% 9.5% 2.2%
Situation B: addition of 10 more observations
per cell
14.5% 7.7% 1.6%
Situation C: controllingfor gender or interaction
of gender with treatment
21.6% 11.7% 2.7%
Situation D: dropping(or not dropping) one of
three conditions
23.2% 12.6% 2.8%
Combine Situations A and B 26.0% 14.4% 3.3%
Combine Situations A,B,and C 50.9% 30.9% 8.4%
Combine Situations A,B,C,and D 81.5% 60.7% 21.5%
Note: The table reportsthe percentage of 15,000 simulated samplesin which at least one of a
set of analyseswassignificant. Observationswere drawn independently from anormal distribu-
tion.Baseline isatwo-condition design with 20 observationsper cell.Resultsfor SituationA were
obtained by conductingthree t tests,one on each of two dependent variablesand athird on the
average of these two variables.Resultsfor Situation B were obtained by conductingone t test after
collecting20 observationsper cell and another after collectingan additional 10 observationsper
cell.Resultsfor Situation C were obtained by conductinga t test,an analysisof covariance with a
19.
20. Caveats / limitations
Random assignment is the “gold standard” for causal inference, but
it has some limitations:
◦ Randomization often isn’t feasible and/or ethical
◦ Experiments are costly in terms of time and money
◦ It’s difficult to create convincing parallel worlds
◦ Inevitably people deviate from their random assignments
Anyone can flip a coin, but it’s difficult to create convincing parallel
worlds
25. Natural experiments
Sometimes we get lucky and nature effectively runs experiments for
us, e.g.:
◦ As-if random: People are randomly exposed to water sources
◦ Instrumental variables: A lottery influences military service
◦ Discontinuities: Star ratings get arbitrarily rounded
◦ Difference in differences: Minimum wage changes in just one state
26. Natural experiments
Sometimes we get lucky and nature effectively runs experiments for
us, e.g.:
◦ As-if random: People are randomly exposed to water sources
◦ Instrumental variables: A lottery influences military service
◦ Discontinuities: Star ratings get arbitrarily rounded
◦ Difference in differences: Minimum wage changes in just one state
Experiments happen all the time, we just have to notice them
27. As-if random
Idea: Nature randomly assigns
conditions
Example: People are randomly
exposed to water sources (Snow,
1854)
http://bit.ly/johnsnowmap
30. Figure 4: Average Revenue around Discontinuous Changes in Rating
Notes: Each restaurant’s log revenue is de-meaned to normalize a restaurant’s average log
revenue to zero. Normalized log revenues are then averaged within bins based on how far the
restaurant’s rating is from a rounding threshold in that quarter. The graph plots average log
revenue as a function of how far the rating is from a rounding threshold. All points with a
positive (negative) distance from a discontinuity are rounded up (down).
Regression
discontinuities
Idea: Things change around an
arbitrarily chosen threshold
Example: Star ratings get
arbitrarily rounded (Luca, 2011)
http://bit.ly/yelpstars
31. Difference in
differences
Idea: Compare differences after a
sudden change with trends in a
control group
Example: Minimum wage changes
in just one state (Card & Krueger,
1994)
http://stats.stackexchange.com/a/125266
32. Natural experiments: Caveats
Natural experiments are great, but:
◦ Good natural experiments are hard to find
◦ They rely on many (untestable) assumptions
◦ The treated population may not be the one of interest
33. Natural experiments: Caveats
Natural experiments are great, but:
◦ Good natural experiments are hard to find
◦ They rely on many (untestable) assumptions
◦ The treated population may not be the one of interest
Sometimes we can use additional data + algorithms to
automatically find natural experiments
48. Confound: Correlated demand
Some views would have
happened anyway due to
correlated demand
We call these convenience clicks
Direct
views of
hats
Referred
click-
throughs
to gloves
Effect?
Demand
for winter
items
49. Ideally we would run an A/B test where randomly selected people
see recommendations
Ideal experiment
World 1 World 2
50. Natural experiment
Instead, we can exploits sudden shocks in traffic to focal products as
an instrumental variable
Direct
views of
focal
product
Referred
click-
throughsEffect?
Demand
External
shock
51. Usual
approach
Think hard for a source
of random variation that
only directly affects focal
product (e.g., author
wins award)
52. Usual
approach
Think hard for a source
of random variation that
only directly affects focal
product (e.g., author
wins award)
Problem: Impossible to
rule out side effects
53. New approach: Automatically discovering
natural experiments
Look for products that receive shocks in direct traffic, while their
recommendations do not
54. New approach: Automatically discovering
natural experiments
Look for products that receive shocks in direct traffic, while their
recommendations do not
55. New approach: Automatically discovering
natural experiments
Applying this method to 23 million pageviews in the Bing Toolbar
logs, we find over 4,000 such natural experiments
56. New approach: Automatically discovering
natural experiments
Causal click-through rate is just the marginal gain in clicks during the
shock
Causal effect = Change in recommendation clicks / Size of shock
57. Causal click-through rate is just the
marginal gain in clicks during the
shock
Causal effect = Change in rec clicks /
Size of shock
Results
58. Although recommendation click-
throughs account for a large fraction
of traffic, at least 75% of this activity
would likely occur in the absence of
recommendations*
Results
*With lots of caveats