Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft
SIGIR 2019 · July 23rd · ParisPicture by dalbera
Current Statistical Testing Practice
• According to surveys by Sakai & Carterette
–60-75% of IR papers use significance te...
t-test and Wilcoxon are the de facto choice
Is this a good choice?
3
Our Journey
4
vanRijsbergen
Hull@SIGIR
Savoy@IP&M
Wilbur@JIS
Zobel@SIGIR
Voorhees&Buckley@SIGIR
Voorhees@SIGIR
Sakai@SIGIR...
Our Journey
4
vanRijsbergen
Hull@SIGIR
Savoy@IP&M
Wilbur@JIS
Zobel@SIGIR
Voorhees&Buckley@SIGIR
Voorhees@SIGIR
Sakai@SIGIR...
Our Journey
4
vanRijsbergen
Hull@SIGIR
Savoy@IP&M
Wilbur@JIS
Zobel@SIGIR
Voorhees&Buckley@SIGIR
Voorhees@SIGIR
Sakai@SIGIR...
Our Journey
4
vanRijsbergen
Hull@SIGIR
Savoy@IP&M
Wilbur@JIS
Zobel@SIGIR
Voorhees&Buckley@SIGIR
Voorhees@SIGIR
Sakai@SIGIR...
Our Journey
• Theoretical and empirical arguments
for and against specific tests
• 2-tailed tests at α=.05 with AP and P@1...
Main reason?
No control of the data generating process
6
PROPOSAL FROM SIGIR 2018
• Build a generative model
of the joint distribution of
system scores
• So that we can simulate
scores on new,
random topi...
• Build a generative model
of the joint distribution of
system scores
• So that we can simulate
scores on new,
random topi...
Stochastic Simulation
11
Experimental
Baseline
• We use copula
models, which
separate:
1. Marginal
distributions, of
indiv...
Stochastic Simulation
11
Experimental
Baseline
• We use copula
models, which
separate:
1. Marginal
distributions, of
indiv...
Research Question
• Which is the test that…
1. Maintaining Type I errors at the α level,
2. Has the highest statistical po...
Factors Under Study
• Paired test: Student’s t, Wilcoxon, Sign,
Bootstrap-shift, Permutation
• Measure: AP, nDCG@20, ERR@2...
We report results on
>500 million p-values
1.5 years of CPU time
¯_(ツ)_/¯
14
TYPE I ERRORS
Index
NA
TRECSystems
Topics
Simulation such that μE = μB
16
Experimental
Baseline
Simulation such that μE = μB
16
Experimental
Baseline
Simulation such that μE = μB
16
Experimental
Baseline
Simulation such that μE = μB
16
Experimental
Baseline
μE = μB
Simulation such that μE = μB
16
Experimental
Baseline
Tests
p-values
μE = μB
Simulation such that μE = μB
• Repeat for each measure and topic set size n
–1,667,000 times
–≈8.3 million 2-tailed p-valu...
Type I Errors by α | n 2-tailed
18
Not so interested in specific points but in trends
Type I Errors by α | n 2-tailed
20
Type I Errors by α | n 2-tailed
20
• Wilcoxon and Sign have higher error rates than expected
• Wilcoxon better in P@10 and...
Type I Errors by α | n 2-tailed
20
• Bootstrap has high error rates too
• Tends to correct with sample size because it est...
Type I Errors by α | n 2-tailed
20
• Bootstrap has high error rates too
• Tends to correct with sample size because it est...
Type I Errors - Summary
• Wilcoxon, Sign and Bootstrap test tend to make
more errors than expected
• Increasing sample siz...
TYPE II ERRORS
Simulation such that μE = μB + δ
23
Index
NA
TRECSystems
Topics
Experimental
Baseline
Simulation such that μE = μB + δ
23
Experimental
Baseline
Simulation such that μE = μB + δ
23
Experimental
Baseline
Simulation such that μE = μB + δ
23
Experimental
Baseline
μE = μB+δ
Simulation such that μE = μB + δ
23
Experimental
Baseline
Tests
p-values
μE = μB+δ
Simulation such that μE = μB + δ
• Repeat for each measure, topic set size n
and effect size δ
–167,000 times
–≈8.3 millio...
Power by δ | n α=.05, 2-tailed
25
ideally
Power by δ | n α=.05, 2-tailed
26
• Clear effect of effect size δ
• Clear effect of sample size n
• Clear effect of measur...
Power by δ | n α=.05, 2-tailed
26
• Clear effect of effect size δ
• Clear effect of sample size n
• Clear effect of measur...
Power by δ | n α=.05, 2-tailed
26
• Clear effect of effect size δ
• Clear effect of sample size n
• Clear effect of measur...
Power by δ | n α=.05, 2-tailed
26
• Clear effect of effect size δ
• Clear effect of sample size n
• Clear effect of measur...
Power by α | δ n=50, 2-tailed
27
Power by α | δ n=50, 2-tailed
27
• With small δ, Wilcoxon and Bootstrap consistently the most powerful
• With large δ, Per...
Type II Errors - Summary
• All tests, except Sign, behave very similarly
• Bootstrap and Wilcoxon are consistently
a bit m...
TYPE III ERRORS
Type III what?
• A wrong directional decision
based on the correct rejection
of a non-directional hypothesis
• Example:
– ...
Type III Errors by δ | n α=.05
31
• Clear effect of δ and n
• P@10 and RR substantially more problematic because of higher...
Type III Errors by δ | n α=.05
31
• Bootstrap tends to correct with sample size
• Wilcoxon stays the same, and Sign test g...
Type III Errors by δ | n α=.05
31
• Bootstrap tends to correct with sample size
• Wilcoxon stays the same, and Sign test g...
Type III Errors in Practice
• How much of a problem could this be?
• Example: AP and n=50 topics
–Improvement of +0.01 ove...
CONCLUSIONS
What We Did
• First empirical study of actual error rates with IR-like data
• Comprehensive
– Paired test: Student’s t, Wi...
Recommendations
• Don’t use Wilcoxon or Sign tests anymore
• For statistics other than the mean, use
permutation, and boot...
Próxima SlideShare
Cargando en…5
×

Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors

216 visualizaciones

Publicado el

Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.

Publicado en: Ciencias
  • Sé el primero en comentar

Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors

  1. 1. Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 · July 23rd · ParisPicture by dalbera
  2. 2. Current Statistical Testing Practice • According to surveys by Sakai & Carterette –60-75% of IR papers use significance testing –In the paired case (2 systems, same topics): • 65% use the paired t-test • 25% use the Wilcoxon test • 10% others, like Sign, Bootstrap & Permutation 2
  3. 3. t-test and Wilcoxon are the de facto choice Is this a good choice? 3
  4. 4. Our Journey 4 vanRijsbergen Hull@SIGIR Savoy@IP&M Wilbur@JIS Zobel@SIGIR Voorhees&Buckley@SIGIR Voorhees@SIGIR Sakai@SIGIR Smuckeretal.@SIGIR Sakai@SIGIR Paraparetal.@JASIST Cormack&Lynam@SIGIRSmuckeretal.@CIKM Urbano&Nagler@SIGIR 1980 1990 2000 2010 2020 Sanderson&Zobel@SIGIR Urbanoetal.@SIGIR Carterette@TOIS Carterette@ICTIR Sakai@SIGIRForum Urbano@JIR
  5. 5. Our Journey 4 vanRijsbergen Hull@SIGIR Savoy@IP&M Wilbur@JIS Zobel@SIGIR Voorhees&Buckley@SIGIR Voorhees@SIGIR Sakai@SIGIR Smuckeretal.@SIGIR Sakai@SIGIR Paraparetal.@JASIST Cormack&Lynam@SIGIRSmuckeretal.@CIKM Urbano&Nagler@SIGIR 1980 1990 2000 2010 2020 Sanderson&Zobel@SIGIR Urbanoetal.@SIGIR Carterette@TOIS Carterette@ICTIR Sakai@SIGIRForum 1st Period Statistical testing unpopular Theoretical arguments around test assumptions Urbano@JIR
  6. 6. Our Journey 4 vanRijsbergen Hull@SIGIR Savoy@IP&M Wilbur@JIS Zobel@SIGIR Voorhees&Buckley@SIGIR Voorhees@SIGIR Sakai@SIGIR Smuckeretal.@SIGIR Sakai@SIGIR Paraparetal.@JASIST Cormack&Lynam@SIGIRSmuckeretal.@CIKM Urbano&Nagler@SIGIR 1980 1990 2000 2010 2020 Sanderson&Zobel@SIGIR Urbanoetal.@SIGIR Carterette@TOIS Carterette@ICTIR Sakai@SIGIRForum 2nd Period Empirical studies appear Resampling-based tests and t-test Urbano@JIR
  7. 7. Our Journey 4 vanRijsbergen Hull@SIGIR Savoy@IP&M Wilbur@JIS Zobel@SIGIR Voorhees&Buckley@SIGIR Voorhees@SIGIR Sakai@SIGIR Smuckeretal.@SIGIR Sakai@SIGIR Paraparetal.@JASIST Cormack&Lynam@SIGIRSmuckeretal.@CIKM Urbano&Nagler@SIGIR 1980 1990 2000 2010 2020 Sanderson&Zobel@SIGIR Urbanoetal.@SIGIR Carterette@TOIS Carterette@ICTIR Sakai@SIGIRForum 3rd Period Wide adoption of statistical testing Long-pending discussion about statistical practice Urbano@JIR
  8. 8. Our Journey • Theoretical and empirical arguments for and against specific tests • 2-tailed tests at α=.05 with AP and P@10, almost exclusively • Limited data, resampling from the same topics • No control over the null hypothesis • Discordances or conflicts among tests, but no actual error rates 5
  9. 9. Main reason? No control of the data generating process 6
  10. 10. PROPOSAL FROM SIGIR 2018
  11. 11. • Build a generative model of the joint distribution of system scores • So that we can simulate scores on new, random topics (no content, only scores) • Unlimited data • Full control over H0 Stochastic Simulation 9 test AP p-values Model Urbano & Nagler, SIGIR 2018
  12. 12. • Build a generative model of the joint distribution of system scores • So that we can simulate scores on new, random topics (no content, only scores) • Unlimited data • Full control over H0 • The model is flexible, and can be fit to existing data to make it realistic Stochastic Simulation 10 TREC IR systems test AP AP p-values Model Urbano & Nagler, SIGIR 2018
  13. 13. Stochastic Simulation 11 Experimental Baseline • We use copula models, which separate: 1. Marginal distributions, of individual systems • Give us full knowledge and control over H0 2. Dependence structure, among systems μE μB
  14. 14. Stochastic Simulation 11 Experimental Baseline • We use copula models, which separate: 1. Marginal distributions, of individual systems • Give us full knowledge and control over H0 2. Dependence structure, among systems μE μB
  15. 15. Research Question • Which is the test that… 1. Maintaining Type I errors at the α level, 2. Has the highest statistical power, 3. Across measures and sample sizes, 4. With IR-like data? 12
  16. 16. Factors Under Study • Paired test: Student’s t, Wilcoxon, Sign, Bootstrap-shift, Permutation • Measure: AP, nDCG@20, ERR@20, P@10, RR • Topic set size n: 25, 50, 100 • Effect size δ: 0.01, 0.02, …, 0.1 • Significance level α: 0.001, …, 0.1 • Tails: 1 and 2 • Data to fit stochastic models: TREC 5-8 Ad Hoc and 2010-13 Web 13
  17. 17. We report results on >500 million p-values 1.5 years of CPU time ¯_(ツ)_/¯ 14
  18. 18. TYPE I ERRORS
  19. 19. Index NA TRECSystems Topics Simulation such that μE = μB 16 Experimental Baseline
  20. 20. Simulation such that μE = μB 16 Experimental Baseline
  21. 21. Simulation such that μE = μB 16 Experimental Baseline
  22. 22. Simulation such that μE = μB 16 Experimental Baseline μE = μB
  23. 23. Simulation such that μE = μB 16 Experimental Baseline Tests p-values μE = μB
  24. 24. Simulation such that μE = μB • Repeat for each measure and topic set size n –1,667,000 times –≈8.3 million 2-tailed p-values –≈8.3 million 1-tailed p-values • Grand total of >250 million p-values • Any p<α corresponds to a Type I error
  25. 25. Type I Errors by α | n 2-tailed 18 Not so interested in specific points but in trends
  26. 26. Type I Errors by α | n 2-tailed 20
  27. 27. Type I Errors by α | n 2-tailed 20 • Wilcoxon and Sign have higher error rates than expected • Wilcoxon better in P@10 and RR because of symmetricity • Even worse as sample size increases (with RR too)
  28. 28. Type I Errors by α | n 2-tailed 20 • Bootstrap has high error rates too • Tends to correct with sample size because it estimates the sampling distribution better
  29. 29. Type I Errors by α | n 2-tailed 20 • Bootstrap has high error rates too • Tends to correct with sample size because it estimates the sampling distribution better • Permutation and t-test have nearly ideal behavior • Permutation very slightly sensitive to sample size • t-test remarkably robust to it
  30. 30. Type I Errors - Summary • Wilcoxon, Sign and Bootstrap test tend to make more errors than expected • Increasing sample size helps Bootstrap, but hurts Wilcoxon and Sign even more • Permutation and t-test have nearly ideal behavior across measures, even with small sample size • t-test is remarkably robust • Same conclusions with 1-tailed tests 21
  31. 31. TYPE II ERRORS
  32. 32. Simulation such that μE = μB + δ 23 Index NA TRECSystems Topics Experimental Baseline
  33. 33. Simulation such that μE = μB + δ 23 Experimental Baseline
  34. 34. Simulation such that μE = μB + δ 23 Experimental Baseline
  35. 35. Simulation such that μE = μB + δ 23 Experimental Baseline μE = μB+δ
  36. 36. Simulation such that μE = μB + δ 23 Experimental Baseline Tests p-values μE = μB+δ
  37. 37. Simulation such that μE = μB + δ • Repeat for each measure, topic set size n and effect size δ –167,000 times –≈8.3 million 2-tailed p-values –≈8.3 million 1-tailed p-values • Grand total of >250 million p-values • Any p>α corresponds to a Type II error
  38. 38. Power by δ | n α=.05, 2-tailed 25 ideally
  39. 39. Power by δ | n α=.05, 2-tailed 26 • Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ)
  40. 40. Power by δ | n α=.05, 2-tailed 26 • Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ) • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially for small n
  41. 41. Power by δ | n α=.05, 2-tailed 26 • Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ) • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially for small n • Permutation and t-test are almost identical again • Very close to Bootstrap as sample size increases
  42. 42. Power by δ | n α=.05, 2-tailed 26 • Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ) • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially for small n • Wilcoxon is very similar to Permutation and t-test • Even slightly better with small n or δ, specially for AP, nDCG and ERR (it’s indeed more efficient with some asymmetric distributions)
  43. 43. Power by α | δ n=50, 2-tailed 27
  44. 44. Power by α | δ n=50, 2-tailed 27 • With small δ, Wilcoxon and Bootstrap consistently the most powerful • With large δ, Permutation and t-test catch up with Wilcoxon
  45. 45. Type II Errors - Summary • All tests, except Sign, behave very similarly • Bootstrap and Wilcoxon are consistently a bit more powerful across significance levels – But more Type I errors! • With larger effect sizes and sample sizes, Permutation and t-test catch up with Wilcoxon, but not with Bootstrap • Same conclusions with 1-tailed tests 28
  46. 46. TYPE III ERRORS
  47. 47. Type III what? • A wrong directional decision based on the correct rejection of a non-directional hypothesis • Example: – We observe a positive result, 𝐸 > 𝐵 – We run a 2-tailed test, 𝐻0: 𝜇 𝐸 = 𝜇 𝐵 – Find 𝑝 < 𝛼, so we reject and conclude 𝜇 𝐸 > 𝜇 𝐵 – But 𝑯 𝟎 is non-directional – What if we just got lucky, and really 𝝁 𝑬 < 𝝁 𝑩? 30
  48. 48. Type III Errors by δ | n α=.05 31 • Clear effect of δ and n • P@10 and RR substantially more problematic because of higher σ
  49. 49. Type III Errors by δ | n α=.05 31 • Bootstrap tends to correct with sample size • Wilcoxon stays the same, and Sign test gets even worse
  50. 50. Type III Errors by δ | n α=.05 31 • Bootstrap tends to correct with sample size • Wilcoxon stays the same, and Sign test gets even worse
  51. 51. Type III Errors in Practice • How much of a problem could this be? • Example: AP and n=50 topics –Improvement of +0.01 over the baseline –2-tailed t-test comes up significant –7.3% probability that it is a Type III error and your system is actually worse –Is that too high? 32
  52. 52. CONCLUSIONS
  53. 53. What We Did • First empirical study of actual error rates with IR-like data • Comprehensive – Paired test: Student’s t, Wilcoxon, Sign, Bootstrap-shift, Permutation – Measure: AP, nDCG@20, ERR@20, P@10, RR – Topic set size: 25, 50, 100 – Effect size: 0.01, 0.02, …, 0.1 – Significance level: 0.001, …, 0.1 – Tails: 1 and 2 • More than 500 million p-values • All data and many more plots are available online https://github.com/julian-urbano/sigir2019-statistical 34
  54. 54. Recommendations • Don’t use Wilcoxon or Sign tests anymore • For statistics other than the mean, use permutation, and bootstrap only if you have many topics • For typical tests about mean scores, the t-test is simple, the most robust, behaves as expected w.r.t. Type I errors, and is nearly as powerful as the Bootstrap. Keep using it 35

×