SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
A Comparison of the Optimality 
of Statistical Significance Tests 
for Information Retrieval Evaluation 
Julián Urbano, Mónica Marrero and Diego Martín 
Department of Computer Science · University Carlos III of Madrid 
The problem: is system A more effective than system B? 
The drill: evaluate with a test collection and run a statistical significance test 
The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? 
The reason: test assumptions are violated, so which one is optimal in practice? 
Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) 
Data and Methods 
· TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 
o 110 runs, 5995 pairs of systems 
· Randomly split topics in T1 and T2, as if two collections 
o Evaluate all runs and compute p-values 
o Compare p-values from T1 with p-values from T2 
o 1000 trials, 12M p-values per test, 60M in total 
· Interpret pairs of p-values for different α levels 
T2 
A ≻B A ≺B A ≻≻B A ≺≺B 
T1 
A ≻B Non-significance 
A≻≻B 
Lack of 
power 
Minor 
error 
Success 
Major 
error 
Non-significance rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
.001 .005 .01 .05 .1 
Significance level a 
Non-significants / Total 
0.3 0.35 0.4 0.45 0.5 0.6 
Previous Work 
Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 
· Wilcoxon more powerful than t-test, but more errors 
Smucker et al. ‘07, ‘09 
· bootstrap test overly powerful, though similar to t-test 
and permutation 
· Wilcoxon and sign unreliable, should use permutation 
· Power: bootstrap test gives more significant results 
· Safety: t-test gives fewer errors 
· Exactness: Wilcoxon test best tracks the nominal level 
· The permutation test is not optimal in practice 
· Error rates seem lower than expected; focus on power 
Success rate 
.001 .005 .01 .05 .1 
Significance level a 
Successes / Total significants 
0.76 0.78 0.80 0.82 0.84 0.86 
Take-Home Messages 
Lack of power rate 
.001 .005 .01 .05 .1 
Significance level a 
Lacks of power / Total significants 
0.12 0.14 0.16 0.18 0.20 
Minor error rate 
t-test 
permutation 
bootstrap 
Wilcoxon 
sign 
y=x 
.001 .005 .01 .05 .1 
Significance level a 
Minor errors / Total significants 
0.001 0.002 0.005 0.010 0.020 
Major error rate 
.001 .005 .01 .05 .1 
Significance level a 
Major errors / Total significants 
5e-07 5e-06 5e-05 5e-04 
Global error rate 
.0001 .0005.001 .005 .01 .05 .1 .5 
Significance level a 
Minor and Major errors / Total significants 
5e-04 2e-03 5e-03 2e-02 5e-02 
Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant

Más contenido relacionado

Similar a A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

Cross validation
Cross validationCross validation
Cross validationRidhaAfrawe
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision makingShankha Goswami
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...TESCO - The Eastern Specialty Company
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxbelay41
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Matt Hansen
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec domsBabasab Patil
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narrSarah
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykSMayankSharma
 
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hkajithsrc
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methodsSubash John
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptxZohairMughal1
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Processknksmart
 

Similar a A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation (20)

Cross validation
Cross validationCross validation
Cross validation
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
Site Verification: Tools and Best Practices to Accurately Meter Complex, High...
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
Burden Testing, Theory, and Practice
Burden Testing, Theory, and PracticeBurden Testing, Theory, and Practice
Burden Testing, Theory, and Practice
 
Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3Process Capability: Steps 1 to 3
Process Capability: Steps 1 to 3
 
Instrument Transformer Testing
Instrument Transformer TestingInstrument Transformer Testing
Instrument Transformer Testing
 
SE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptxSE%200-Testing%20(2).pptx
SE%200-Testing%20(2).pptx
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
 
Hph7300week14winter2009narr
Hph7300week14winter2009narrHph7300week14winter2009narr
Hph7300week14winter2009narr
 
Facility Location
Facility Location Facility Location
Facility Location
 
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ ykblckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
blckboxtesting.ppt il.;io'/ ulio'[ yjko8i[0'-p/ yk
 
Complete Site Testing
Complete Site TestingComplete Site Testing
Complete Site Testing
 
Chap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc HkChap 9 A Process Capability & Spc Hk
Chap 9 A Process Capability & Spc Hk
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Fault simulation – application and methods
Fault simulation – application and methodsFault simulation – application and methods
Fault simulation – application and methods
 
SE 09 (test design techs).pptx
SE 09 (test design techs).pptxSE 09 (test design techs).pptx
SE 09 (test design techs).pptx
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
 

Más de Julián Urbano

Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
 

Más de Julián Urbano (20)

Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Último

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 

Último (20)

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 

A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation

  • 1. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation Julián Urbano, Mónica Marrero and Diego Martín Department of Computer Science · University Carlos III of Madrid The problem: is system A more effective than system B? The drill: evaluate with a test collection and run a statistical significance test The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation? The reason: test assumptions are violated, so which one is optimal in practice? Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α) Data and Methods · TREC Robust 2004: 100 topics from Ad Hoc 7 and 8 o 110 runs, 5995 pairs of systems · Randomly split topics in T1 and T2, as if two collections o Evaluate all runs and compute p-values o Compare p-values from T1 with p-values from T2 o 1000 trials, 12M p-values per test, 60M in total · Interpret pairs of p-values for different α levels T2 A ≻B A ≺B A ≻≻B A ≺≺B T1 A ≻B Non-significance A≻≻B Lack of power Minor error Success Major error Non-significance rate t-test permutation bootstrap Wilcoxon sign .001 .005 .01 .05 .1 Significance level a Non-significants / Total 0.3 0.35 0.4 0.45 0.5 0.6 Previous Work Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06 · Wilcoxon more powerful than t-test, but more errors Smucker et al. ‘07, ‘09 · bootstrap test overly powerful, though similar to t-test and permutation · Wilcoxon and sign unreliable, should use permutation · Power: bootstrap test gives more significant results · Safety: t-test gives fewer errors · Exactness: Wilcoxon test best tracks the nominal level · The permutation test is not optimal in practice · Error rates seem lower than expected; focus on power Success rate .001 .005 .01 .05 .1 Significance level a Successes / Total significants 0.76 0.78 0.80 0.82 0.84 0.86 Take-Home Messages Lack of power rate .001 .005 .01 .05 .1 Significance level a Lacks of power / Total significants 0.12 0.14 0.16 0.18 0.20 Minor error rate t-test permutation bootstrap Wilcoxon sign y=x .001 .005 .01 .05 .1 Significance level a Minor errors / Total significants 0.001 0.002 0.005 0.010 0.020 Major error rate .001 .005 .01 .05 .1 Significance level a Major errors / Total significants 5e-07 5e-06 5e-05 5e-04 Global error rate .0001 .0005.001 .005 .01 .05 .1 .5 Significance level a Minor and Major errors / Total significants 5e-04 2e-03 5e-03 2e-02 5e-02 Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant