# Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022

29 de May de 2023
1 de 51

### Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022

• 1. Deborah G. Mayo error@vt.edu Virginia Tech. 26 May 2023 Malfunction, error, failure: How to learn from scientific mistakes? Institute of Philosophy, Hungarian Academy of Sciences Budapest Hungary “Errors of the Error Gatekeepers: The case of Statistical Significance: 2016-2022”
• 2. Statistical methods are gatekeepers against error, especially random error “[tests of significance] are constantly in use to distinguish real effects from such apparent effects [due to] errors of random sampling, or of uncontrolled variability” (R. A.Fisher 1956, 79) Statistical significance tests are our “first line of defense against being fooled by randomness” (Benjamini, 2016, 1). 2
• 3. Statistical significance tests are a small part of error statistical tools tools for controlling the probabilities of misleading interpretations of data—error probabilities (statistical significance tests, confidence intervals, randomization, resampling) 3
• 4. In order to avoid being fooled by randomness • a small P-value is required before inferring evidence of a genuine effect (i.e., for falsifying H0 (“due to chance”) • P-values are distance measures with this inversion: the smaller it is, the larger the distance between x and H0 • 4
• 5. 5 Testing reasoning • Small P-values indicate* some underlying discrepancy from H0 because very probably (1- P) you would have seen a less impressive difference were H0 true. *(until an audit is conducted testing assumptions, I use “indicate”)
• 6. Error control is lost by abuses of P-values • multiple testing, cherry-picking, stopping just when the data look good, data-dredging invalidate p-values
• 7. In testing the mean of a standard normal distribution Optional Stopping
• 8. Replication crisis • Optional stopping found to be a major source of lack of replication • Low P-values not found when an independent group seeks to replicate with a more stringent protocol
• 9. Replication crisis leads to “reforms” • Several are welcome: it has encouraged some social sciences to take a page from medical trials: • require preregistration of protocol, replication checks, adjustments to take account of selection effects • Others are radical: leading to replacing P-values with methods less able to control erroneous interpretations of data 9
• 10. I will be talking about recent gatekeeper controversies Statistical associations, executive directors, journal editors, task forces, meta-scientists Scientists and practitioners, philosophers, Skeptical consumers of statistics 10
• 11. The American Statistical Association ASA P-value project 2015 • The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. • We hoped that a statement from the world's largest professional association of statisticians would …draw renewed and vigorous attention to changing the practice of …statistical inference.(Wasserstein and Lazar 2016, p. 129) 11
• 12. The ASA P-value “pow wow”* in 2015 *2 dozen participants; I was a ‘philosophical observer’ 12
• 13. (I) 2016 ASA Statement on P- values • “Nothing in the ASA statement is new” • P-values cannot be interpreted without knowing how many tests have been done, stopping rules, data-dredging, selective reporting. • P-values are not measures of effect size, are not posterior probabilities; statistical significance is not substantive importance, no evidence against is not evidence for a null hypothesis 13
• 14. (II) 2019 Executive Director Editorial in The American Statistician: Abandon ‘significance’ Surprisingly, in 2019… • “It is time to stop using the term “statistically significant” entirely. (Wasserstein, Schirm & Lazar 2019) • You may use P-values, but don’t assess them by preset thresholds (e.g., .05, .01,.005): No significance/ no threshold view 14
• 15. 15 “Scientists rise up against significance” • “Retire Statistical Significance” (Amrhein et al., 2019) was published in Nature to herald the Executive Director’s editorial
• 16. Karen Kafadar (then ASA president): a new gatekeeper is needed Scientists, lawyers, judges were asking her… “‘Can I ask you since ASA says we're not supposed to use P values anymore: How are we supposed to evaluate scientific merit of reported studies?’” She appoints a new Task Force 16
• 17. (III) The 2019 American Statistical Association (ASA) Task Force on Statistical Significance and Replicability was put in a very odd position: The ASA appointed us to “address concerns that [(II)] [a 2019 ASA executive director’s] editorial] might be mistakenly interpreted as official ASA policy” [as was (I)] (III Benjamini et al., 2021) 17
• 18. 18 Why the confusion? • Wasserstein (ASA exec director) is first author of both • (II) claims (I) the 2016 statement “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” and announce “We take that step here….‘statistically significant’—don’t say it and don’t use it” (my emphasis).
• 19. 19 “We take that step here…abandon statistical significance” (II)
• 20. 20 (I) American Statistical Association (ASA): 2016 Statement on P-values (6 principles) (“nothing new”) (II) 2019 Executive Director Editorial in The American Statistician: (Abandon ‘significance’) (III) (2020-2021) The American Statistical Association (ASA) President’s Task Force on Statistical Significance and Replicability (Do not abandon significance)
• 21. 21 The ASA President’s Task Force: Linda Young, National Agric Stats, U of Florida (Co-Chair) Xuming He, University of Michigan (Co-Chair) Yoav Benjamini, Tel Aviv University Dick De Veaux, Williams College (ASA Vice President) Bradley Efron, Stanford University Scott Evans, George Washington U (ASA Pubs Rep) Mark Glickman, Harvard University (ASA Section Rep) Barry Graubard, National Cancer Institute Xiao-Li Meng, Harvard University Vijay Nair, Wells Fargo and University of Michigan Nancy Reid, University of Toronto Stephen Stigler, The University of Chicago Stephen Vardeman, Iowa State University Chris Wikle, University of Missouri
• 22. 22 Despite the pandemic, by July 2020 had a document for the ASA Board • But the ASA didn’t follow its own charge to “endorse and share [it] with scientists and journal editors”; • For nearly a year it was in limbo • Turned down for publication in various journals (e.g., Nature, Science)
• 23. The ASA President’s Task Force (1 page) opposed abandoning statistical significance tests: “The use of P-values and significance testing, properly applied and interpreted are important tools that should not be abandoned. Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability” 23
• 24. The (III) Task Force also states: “P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature”. 24
• 25. 25 ? Is Nature or Science keen to write about: “Scientists Rise Up In Favor of Well- Understood Methods”
• 26. • “Scientists Rise Up Against Statistical Significance” (Nature) 26 What Sells
• 27. 27 • The same drive for sensational findings that leads to biases in science is seen in meta- science • especially if it takes the form of scapegoating statistical significance tests
• 28. 28 • Many sign on to the no-threshold view thinking it blocks perverse incentives to data dredge, multiple test, and P hack when confronted with a large, statistically nonsignificant P value. • Carefully considered, the reverse seems true.
• 29. 29 • In a world without thresholds, it would be hard to hold the data dredgers accountable for reporting a nominally small P-value through ransacking, data dredging, trying and trying again. • “whether a p-value passes any arbitrary threshold should not be considered at all" in presenting/ interpreting data (according to III)
• 30. No-thresholds, no tests, no (statistical) falsification • If you can’t say ahead of time about any result that it will not be allowed to count in favor of a claim, then you do not test that claim. • What is the point of insisting on replication if at no stage can one say the effect failed to replicate (no matter how many failures)? 30
• 31. 31 I’m very sympathetic to moving away from accept/reject uses of tests (inductive behavior philosophy)—argued for decades • In my reformulation of tests*, instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not warranted with severity *developed over the years with others (Spanos, Cox, Hand) To be clear …
• 32. Reflects a general philosophy of inference outside statistics • A claim C is warranted to the extent C has been subjected to and passes a test that probably would have found it specifiably false (if it is). • This probability is the severity with which it has passed (it may be qualitative) 32
• 33. 33 What is learned from the malfunctions at the ASA? • The President’s Task Force report (III) finally appeared in July 2021 in Annals of Applied Statistics, where Kafadar is editor-in-chief • A disclaimer in (II) Exec Director’s 2019 editorial would have avoided the confusion and the offense to opposing views within the ASA. • It hurt the profession and is a serious embarrassment for the ASA
• 34. 1. Disclaimers • ASA Board finally required a disclaimer to the executive director’s editorial just about 1 year ago • Too little, too late 34
• 35. 35 2. An umbrella group like the ASA should provide a neutral forum for debating rival methods For the ASA to take sides in these long- standing controversies—or even to appear to do so—encourages • groupthink, bandwagon effects, appeals to popularity, fear
• 36. 36 The opening of the ASA Executive Director’s editorial (III) admits the statistical community does not agree about statistical methods, and “in fact it may never do so” (p. 2). It speaks of “the echoes of ‘statistics wars’ still simmering today (Mayo 2018).” Some see the replication crisis as an opportunity to win the war in favor of their preferred method
• 37. 3. Signs of going beyond merely enforcing proper use of statistical significance tests: • the proposed reform revolves around a conception or philosophy at odds with that of statistical significance testing 37
• 38. A key disagreement: the role of Probability Error statisticians: to control error probabilities Bayesians (likelihoodists): to assign degrees of belief or support in claims (e.g., Bayes factors, Bayesian posterior probabilities, likelihood ratios) 38
• 39. 39 The 2016 editorial (I) was already at the edge of the precipice ... before the step forward in 2019 (II)
• 40. 40 Malfunction began at the very end of (I) the 2016 ASA statement: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. …confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors” (only 2 non-Bayesians at the pow-wow)
• 41. 41 • The same data-dredged hypothesis can occur in Bayes factors, and Bayesian updating—priors can also be data dependent • But they lack the direct grounds to criticize flouting of error statistical control (without violating a standard principle) • Which says condition on the actual data x; error probabilities consider outcomes other than x (the likelihood principle)
• 42. Many “reforms” offered as alternatives to significance tests, reject (frequentist) error probabilities • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert, The Likelihood Principle 1988, 78) 42
• 43. 43 Bayesian clinical trialists say they are are in a quandary • The FDA requires adjustments for multiple testing • They admit “the type I error was inflated in the Bayesian adaptive designs … [but] adjustments to posterior probabilities, are not required for multiple looks at the data” • “Given the recent discussions to abandon significance testing, it may be useful to move away from controlling type I error entirely in trial designs.” (Ryan et al. 2020)
• 44. • There are contexts for Bayesian and frequentist methods • The goals are sufficiently different to retain both for different contexts • But to avoid being fooled by randomness, you need error probability control—many Bayesians agree 44
• 45. 4. Who/What was most effective in gatekeeping the gatekeepers of error-prone inference? • Persistency (and courage) of the ex-President and her Task Force (III) • Leaders in clinical trials where the incentives for showing an effect are highest (NEJM: refuses to give up their gatekeeping role via P-values) • Social scientists, skeptical consumers (e.g., philosophers) prepared to critically challenge high priests--chuzpah • Numerous conferences, workshops, editorials 45
• 47. 47 The 2016 ASA Statement’s Six Principles 1) P-values can indicate how incompatible the data are with a specified statistical model 2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone 3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold 4) Proper inference requires full reporting and transparency. (P-values can’t be interpreted without knowing about multiple testing, data-dredging) 5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result 6) By itself, a p-value is not a good measure of evidence
• 48. References • Amrhein, V., Greenland, S., and McShane, B. (2019), “Comment: Retire Statistical Significance,” Nature, 567, 305-308. • Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and Frequentist Analysis’, Statistical Science 19, 58–80. • Benjamini, Y. (2016). It's not the P-values’ fault. Comment on Wasserstein and Lazar (2016), supplemental material (online). • Benjamini, Y., De Veaux, R., Efron, B., Evans, S., Glickman, M., Graubard, B., He, X., Meng, X.-L., Reid, N., Stigler, S., Vardeman, S., Wikle, C., Wright, T., Young, L., & Kafadar, K. (2021). The ASA President's Task Force statement on statistical significance and replicability. Annals of Applied Statistics, 15(3), 1084–1085. • Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. • Hardwicke T, Ioannidis J. Petitions in scientific argumentation: dissecting the request to retire statistical significance. Eur J Clin Invest. 2019. • Harrington, D., D'Agostino, R., Gatsonis, C., Hogan, J., Hunter, M., Normand, S.-L., Drazen, J., & Hamel, M. (2019). New guidelines for statistical reporting in the journal. New England Journal of Medicine, 381(3), 285– 286. 48
• 49. • Ioannidis, J. 2019. The importance of predefined rules and prespecified statistical analyses: Do not abandon significance. JAMA 321 (21): 2067– 2068. • Ioannidis, J. (2019). Correspondence: Retiring statistical significance would give bias a free pass. Nature, 567(7749), 461. https://doi.org/10.1038/d41586-019-00969-2 • Kafadar, K. (2019). The year in review…And more to come. President's corner. Amstatnews, 510, 3– 4. • Lakens, D., et al., (mega-team of 85 authors). (2018). “Justify Your Alpha,” Nature Human Behavior 2, 168-171. • Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. • Mayo, D. G. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2). • Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. (2019). “P-value Thresholds: Forfeit at Your Peril,” European Journal of Clinical Investigation 49(10). EJCI-2019-0447 49
• 50. • Mayo, D. G. (2020). Rejecting Statistical Significance Tests: Defanging the Arguments. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. 236-256. • Mayo, D. G. (2020). “Significance Tests: Vitiated or Vindicated by the Replication Crisis in Psychology?” Review of Philosophy and Psychology 12: 101-120. DOIhttps://doi.org/10.1007/s13164-020-00501- w • Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” Harvard Data Science Review 2.1. • Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220. • Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest. Conservation Biology : The Journal of the Society for Conservation Biology, J36(1), 13861. https://doi.org/10.1111/cobi.13861. • Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357. 50
• 51. 51 • New England Journal of Medicine (2019). Author guidelines. Available from: https://www.nejm.org/authorcenter/new-manuscripts. • Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a Bayesian adaptive trial design? BMC Medical Research Methodology, 20(1), 1– 9. • Wasserstein, R., & Lazar, N. (2016). The ASA's statement on p-values: Context, process and purpose. American Statistician, 70(2), 129– 133. • Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05”. American Statistician, 73(S1), 1– 19. Blog posts on ErrorStatistics.com: • (11/04/2019). “On some self defeating aspects of the ASA’s 2019 recommendations on statistical significance tests.” • (6/17/2019).“’The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean’ (Some Recommendations)(ii).” • (6/20/2021).“At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability.”