Insights from psychology on lack of reproducibility
1. Insights from psychology
on lack of reproducibility
Dorothy V. M. Bishop
Professor of Developmental Neuropsychology
University of Oxford
@deevybee
Talk given at All Souls seminar on Reproducibility and Open Research, 31/10/18
http://users.ox.ac.uk/~phys1213/ReproAtASC.html
2. The four horsemen of the Apocalypse
P-hacking
Publication biasLow power
HARKing
6. 1956
De Groot
Failure to distinguish between
hypothesis-testing and hypothesis-
generating (exploratory) research
-> misuse of statistical tests
Historical timeline: concerns about reproducibility
Describes P-hacking (though that term not used)
Situation when 10 statistical tests done in a study
“….when N=10 it is as if one participates… in a game of chance
with “probability of losing” α for each “draw” or “throw”. The
probability that we do not lose a single time in 10 draws can be
calculated in the case that the draws are independent; it equals (1
− α)^10. For α = 0.05, the traditional 5% level, this becomes 0.9510
= 0.60. This means, therefore, that we have a 40% chance of
rejecting at least one of our 10 null hypotheses — falsely”
8. 1956
De Groot
1975
Greenwald
The “file drawer” problem
1979
Rosenthal
Prejudice against the null
“As it is functioning in at least some areas of
behavioral science research, the research-
publication system may be regarded as a
device for systematically generating and
propagating anecdotal information.”
Publication bias
1969
Cohen
Nonsignificant findings not
published: literature gives distorted
impression
10. If we’ve known about this for
decades, why haven’t the problems
been fixed?
No one cause: need to consider research
environment and incentives
This talk: focus on cognitive biases that make it
hard to do science well
Idea: doing good science is in opposition to many
of our natural ways of thinking
11. Cognitive biases that make it hard
to do science well
• Failure to understand probability
• Tendency to see patterns in things
• Confirmation bias
• Errors of omission seen as acceptable
• Need for narrative
12. A certain town is served by two hospitals. In the larger hospital about 45
babies are born each day, and in the smaller hospital about 15 babies are
born each day. As you know, about 50% of all babies are boys. However, the
exact percentage varies from day to day. Sometimes it may be higher than
50%, sometimes lower. For a period of 1 year, each hospital recorded the
days on which more than 60% of the babies born were boys. Which hospital
do you think recorded more such days?
1.The larger hospital
2.The smaller hospital
3.About the same (that is, within 5% of each other)
Example from Daniel Kahneman & Amos Tversky
Consider this problem
13. Example from Daniel Kahneman & Amos Tversky
A certain town is served by two hospitals. In the larger hospital about 45
babies are born each day, and in the smaller hospital about 15 babies are
born each day. As you know, about 50% of all babies are boys. However, the
exact percentage varies from day to day. Sometimes it may be higher than
50%, sometimes lower. For a period of 1 year, each hospital recorded the
days on which more than 60% of the babies born were boys. Which hospital
do you think recorded more such days?
1.The larger hospital
2.The smaller hospital
3.About the same (that is, within 5% of each other)
Expected value:
Hosp15 = 57 days
Hosp45 = 26 days
14. Example from Daniel Kahneman & Amos Tversky
A certain town is served by two hospitals. In the larger hospital about 45
babies are born each day, and in the smaller hospital about 15 babies are
born each day. As you know, about 50% of all babies are boys. However, the
exact percentage varies from day to day. Sometimes it may be higher than
50%, sometimes lower. For a period of 1 year, each hospital recorded the
days on which more than 60% of the babies born were boys. Which hospital
do you think recorded more such days?
1.The larger hospital
2.The smaller hospital
3.About the same (that is, within 5% of each other)
Expected value:
Hosp15 = 57 days
Hosp45 = 26 days
Day
Hospital 15 Hospital 45
Small sample gives noisier
estimates: red line bounces
around much more than blue line
15. Insensitivity to sample size
• People have strong intuitions about random
sampling;
• These intuitions are wrong in fundamental
respects;
• These intuitions are shared by naive subjects and
by trained scientists;
• Intuitions are applied with unfortunate
consequences in the course of scientific inquiry
Tversky, A., & Kahneman, D. (1971). Belief in the law of small
numbers. Psychological Bulletin, 76, 105-110.
16. Work in progress: The experimenter game
• You have a budget to improve reading ability across Oxfordshire –
potential for roll-out to hundreds of schools. You’ve been offered
a remedy that claims to boost children’s reading ability by half a
standard deviation
• If you buy it and it turns out useless, you’ll lose a lot of money
• If you buy it and it really works, it will be worth a lot of money
• You’re not sure whether to trust the vendor – you think there’s a
50:50 chance that it really works
• You can run some tests on samples of children, but it costs
money – the more children you test, the more expensive.
• You have an optimization problem!
• So what’s your experimental strategy?
17. You decide to run a study with two groups of N children
What value of N should you start with? - let’s try 20 per group
Here’s a sample of data: do you think this is sufficient to decide
whether to adopt/reject the intervention?
18. This sample was drawn from population with no real difference
19. Here’s another sample of data: do you think this is sufficient to
decide whether to adopt/reject the intervention?
20. This time the sample was drawn from a sample with a true
effect.
With small sample, difference can look small when there is a
true effect – this illustrates problem of LOW POWER
22. This time, the impression from the sample of data gives a better
indication of the true effect in the population
But how reliable this impression is depends on the effect size, i.e.
the separation in the means of the population distributions
23. Population
effect size
= .2
Population
effect size
= .5
Separation between red dots (drawn from population with true
effect) and grey dots (drawn from population with no effect)
shows sample size where can reliably detect a true effect
24. Failure to appreciate power of ‘the prepared mind’
Tendency to see patterns in things
26. Example from Lazic, S. (2016) Experimental Design for Laboratory Biologists
Position of bomb hits:
General has map of bomb hits and wants to know if bombs were
dropped at random or whether some sites are being targeted.
Which map suggests targeting? Blue, red or neither?
27. General message: we tend to assume random data are regular, and
so try to interpret patterns when there is irregularity
The blue map may look as if there is targeting, especially if there are
potential targets at A or B.
In fact, blue X and Y co-ordinates were selected at random.
The red map does not suggest targeting, but it is not random. The co-
ordinates were selected to be evenly distributed, and then jittered
A
B
28. But! seeing novel patterns in complex data is one of the most
important and exciting aspects of science!
Consider Brodmann (1909): identified brain regions with different cell
types – not obvious: required expertise and painstaking study
Bailey and von Bonin (1951) noted problems in Brodmann's approach
— lack of observer independency, reproducibility and objectivity
Yet Brodman’s areas stood test of time: still used today
29. Special expertise or Jesus in toast?
How to decide
• Eradicate subjectivity from methods
• Adopt standards from industry for checking/double-
checking
• Automate data collection and analysis as far as possible
• Make recordings of methods (e.g. Journal of Visualised
Experiments)
• Make data and analysis scripts open
31. How to do good science
That is the idea that we all hope you have learned in
studying science in school… ….
It’s a kind of scientific integrity, a principle of scientific
thought that corresponds to a kind of utter honesty—a
kind of leaning over backwards.
For example, if you’re doing an experiment, you should
report everything that you think might make it invalid—
not only what you think is right about it: other causes
that could possibly explain your results; and things you
thought of that you’ve eliminated by some other
experiment, and how they worked—to make sure the
other fellow can tell they have been eliminated.
Richard Feynman,
Caltech 1974 commencement address
32. Wason task:
a way of thinking about experimental design
Each card has a number on one side and a patch of colour on the
other.
You are asked to test the hypothesis that – for these 4 cards - if an
even number appears on one side, then the opposite side is red.
• Are any of the cards irrelevant to the hypothesis?
• Are any of the cards critical to the hypothesis?
• Which card(s) would you turn over to test the hypothesis?
A B C D
33. Wason task:
a way of thinking about experimental design
Each card has a number on one side and patch of colour on the other.
You are asked to test the hypothesis that – for these 4 cards - if an
even number appears on one side, then the opposite side is red.
• Usual response is B & C are critical.
• But C is not critical (we’re testing ‘if P then Q’, not ‘if Q then P’)
• D is critical as it has potential to disconfirm hypothesis – but usually
overlooked
A B C D
34. Wason task:
Shows how confirmation bias can affect
experimental design
We need to design experiments to look for disconfirmation of a theory .
In practice: "To test a hypothesis, we think of a result that would be found if the
hypothesis were true and then look for that result" (J. Baron, 1988, p. 231).
In survey of 84 scientists (physicists,biologists, psychologists,
sociologists) Mahoney (1976) found fewer than 10% correctly identified
the critical cards
35. “The self-deception comes
in that over the next 20
years, people believed
they saw specks of light
that corresponded to what
they thought Vulcan
should look during an
eclipse: round objects
crossing the face of the
sun, which were
interpreted as transits of
Vulcan.”
Confirmation bias at level of observations:
Seeing what you expect to see
36. • Cherry-picking may not be deliberate
• We find it much easier to process and remember information
that agrees with our viewpoint
Confirmation bias affects how we remember
and process information
37. 37
Twin studies of SLI
probandwise
concordance:
same-sex twins
MZ DZ
Lewis & Thompson, 1992 .86 .48
Bishop et al, 1995 .70 .46
Tomblin & Buckwalter, 1998 .96 .69
Hayiou-Thomas et al, 2005 .36 .33
A personal example: Slide from talks I gave on genetics
of language disorder
Twin concordance
points to genetic
influence when
MZ > DZ
38. 38
Twin studies of SLI
probandwise
concordance:
same-sex twins
MZ DZ
Lewis & Thompson, 1992 .86 .48
Bishop et al, 1995 .70 .46
Tomblin & Buckwalter, 1998 .96 .69
Hayiou-Thomas et al, 2005 .36 .33
I continued to use the original slide after 2005, despite
this additional study I had co-authored
I failed to mention this in talks for several years – I literally forgot about it –
presumably because it did not fit!
39. 39
Example also illustrates how we will do further research
to try to make sense of data that does not fit our ideas –
but look far less closely when data does fit.
For denouement of this story, see
41. Most literature reviews cherry-pick the evidence
(that’s why I’m not identifying this specific e.g.)
“Regardless of etiology, cerebellar neuropathology
commonly occurs in autistic individuals. Cerebellar
hypoplasia and reduced cerebellar Purkinje cell
numbers are the most consistent neuropathologies
linked to autism [8, 9, 10, 11, 12, 13]. MRI studies
report that autistic children have smaller cerebellar
vermal volume in comparison to typically developing
children [14].”
Example: Study published in 2013
• I was surprised by this introduction to a paper, as it did not fit my impression of the
literature on neuropathology in autism: but the authors seemed to cite a lot of
supportive evidence
• I checked to see if there was a relevant meta-analysis: there was….
42. Standardized mean difference is +ve when cerebellar volume is greater in ASD
Meta-analysis: Traut et al (2018) https://doi.org/10.1016/j.biopsych.2017.09.029
Though Webb et al (ref 14) did find area of vermis smaller in ASD after covarying cerebellum size
Ref [14] –
larger
cerebellum
Other studies
mostly found
no difference
or increase –
opposite of
what claimed
in 2013 paper
43. Confirmation bias tends to produce errors of
omission – these are generally thought to be less
serious than errors of commission (i.e. making
stuff up)
But consequences can be major
44. Errors of omission in reporting research
“[I]t is a truly gross ethical violation for a researcher to
suppress reporting of difficult-to-explain or embarrassing
data in order to present a neat and attractive package
to a journal editor.” (Greenwald, 1975, p. 19)
“Failure to report results from a clinical trial is equivalent
to fraud.” Iain Chalmers, personal communication
45. Consequence of omission errors in literature reviews
• When we read a peer-reviewed paper, we tend to trust the
citations that back up a point
• When we come to write our own paper, we cite the same
materials
• A good scientist won’t cite papers without reading them, but
even this won’t save you from bias – you inherit it from prior
papers
• If prior papers only cite materials agreeing with a viewpoint,
that viewpoint gets entrenched
• You won’t know – unless you explicitly search – that there are
other papers that give a different picture
46. The (partial*) solution
Always start with a systematic review
• Systematic review
• Collecting and summarise all empirical evidence that fits
pre-specified eligibility criteria to address a specific
question
• Meta-analysis
• Use statistical methods to summarise the results of
these studies
*But depends on finding all relevant papers
47. Example from Lazic, S. (2016) Experimental Design for Laboratory Biologists
100 relevant studies on gene/disease association
95 studies find no association.
Negative findings tend not to
be mentioned in Abstracts
5 false positive results. Disease
and gene mentioned in
Abstract
Pubmed search for
disease AND gene
5 supporting and one
negative study found
49. Let’s take another look at that cerebellum paper:
statements that are not untrue, but are misleading
“Regardless of etiology, cerebellar neuropathology
commonly occurs in autistic individuals. Cerebellar
hypoplasia and reduced cerebellar Purkinje cell
numbers are the most consistent neuropathologies
linked to autism [8, 9, 10, 11, 12, 13]. MRI studies
report that autistic children have smaller cerebellar
vermal volume in comparison to typically developing
children [14].” Impression of large body of work, but
mostly reviews of same few studies
Study 14 by Webb et al: found overall increase in cerebellum size:
smaller vermis effect only after adjusting total cerebellar volume
50. In terms of ethical behaviour, rank order the following
behaviours:
• Omission of relevant studies
• Stating that a study found something that it didn’t
• Stating a study result that was true, but in a misleading
way
51. I don’t know of studies looking at this in science reporting,
but analogous behaviour rated in studies of negotiation by
Rogers et al (2016)
• Omission of relevant information
• Lying (untrue statement)
• Stating something that is true, but in a misleading way
(paltering)
23%
5%
32%
Honesty
judgement
in
negotiation
study*
*Rogers, T. et al (2016). Artful paltering: The risks and rewards of using truthful statements to
mislead others. Journal of Personality and Social Psychology, 112(3), 456-473.
Neither omission of information nor paltering seen as
honest, but both are more acceptable than lying
52. How common are these in literature reviews? Does it
matter?
• Omission of relevant studies
• Stating that a study found something that it didn’t
• Stating a study result that was true, but in a misleading
way
My view:
Adoption of these behaviours in science is likely to depend on:
(a) Is it rewarded?
(b) Will it be detected?
(c) If it is, could you avoid blame?
(d) Are there obvious victims?
(e) Is ‘everyone doing it’?
Overlooked victims:
• Potential users (patients, etc)
• Researchers trying to build on results
• Funders
53. A further, overarching problem
The need for narrative
“Another reason why HARKed research reports may fare better in
the review and publication process is that they not only provide a
better fit to a specific good science script, they may also provide a
better fit to the more general good story script.
Positing a theory serves as an effective "initiating event." It gives
certain events significance and justifies the investigators'
subsequent purposeful activities directed at the goal of testing the
hypotheses. And, when one HARKs, a "happy ending” (i.e.,
confirmation) is guaranteed.”
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social
54. Darwinian processes in survival of ideas
“Examples of memes are tunes, ideas, catch-
phrases, clothes fashions, ways of making
pots or of building arches. Just as genes
propagate themselves in the gene pool by
leaping from body to body via sperms or
eggs, so memes propagate themselves in the
meme pool by leaping from brain to brain via
a process which, in the broad sense, can be
called imitation.”
R. Dawkins
55. Successful meme
• Easy to understand, remember, and communicate
to others
• Not helped by reporting everything!
• Not helped by reporting null results!
• May be influenced by whether confers advantage to
the person communicating
• Survival does not depend on whether they are
useful, true, or potentially harmful
56. Cognitive biases pervade every
step of the research process
Reading literature Confirmation bias, Omissions
Experimental design
Confirmation bias, Law of
small numbers
Experimental observations
Seeing patterns,
Confirmation bias
Data analysis
Confirmation bias, Seeing
patterns, Law of small
numbers, Omissions
Scientific reporting
Confirmation bias,
Omissions, Need for
narrative
57. Will anything change?
“It really is striking just for how long there have been
reports about the poor quality of research
methodology, inadequate implementation of research
methods and use of inappropriate analysis procedures
as well as lack of transparency of reporting. All have
failed to stir researchers, funders, regulators,
institutions or companies into action”. Bustin, 2014
Reasons for optimism
• Concern from those who use research:
• Doctors and patients
• Pharma companies
• Concern from funders
• Increase in studies quantifying the problem
• Social media
58. 58
Professor Dorothy Bishop, FRS, FMedSci, FBA,
Wellcome Trust Principal Research Fellow,
Department of Experimental Psychology,
Anna Watts Building,
Woodstock Road,
Oxford,
OX2 6GG. @deevybee
http://deevybee.blogspot.com/2012/11/bishopblog-catalogue-
updated-24th-nov.html
https://www.slideshare.net/deevybishop
https://orcid.org/0000-0002-2448-4033