SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
On Comparing Classifiers:
Pitfalls to Avoid and a
Recommended Approach
(cited by 581)
Author: Steven L.Salzberg
Presented by: Mehmet Ali Abbasoğlu &
Mustafa İlker Saraç
10.04.2014
Contents
1. Motivation
2. Comparing Algorithms
3. Definitions
4. Problems
5. Recommended Approach
6. Conclusion
Motivation
● Be careful about comparative studies of classification
and other algorithms.
○ It is easy to result in statistically invalid conclusions.
● How to chose which algorithm to use for a new
problem?
● Using brute force one can easily find a phenomenon or
pattern that looks impressive.
○ REALLY?
Motivation
● You have lots of data
○ Choose one from UCI repository
● You have many classification methods to compare
But,
● Any differences in classification accuracy that reach
statistical significance should be reported as important?
○ Think again!
Comparing Algorithms
● Many new algorithms has problems according to a
survey conducted by Prechelt.
○ 29% not evaluated on a real problem
○ 8% compared to more than one alternative on real
data
● A survey by Flexer on experimental neural network
papers in leading journals
○ Only 3 out of 43 used a seperate data set for tuning
parameters.
Comparing Algorithms
● Drawbacks of reporting results on a well studied data
set, e.g. a data set from UCI repository
○ It is hard to improve results
○ Prone to statistical accidents
○ They are fine to see initial results for your new
algorithm
● It seems easy to change known algorithms a little then
use comparisons to report improved results.
○ High risk of statistical invalidity
○ Better apply new algorithms
Definitions
● Statistical significance
○ In statistics, a result is considered significant not because
it is important or meaningful, but because it has been
predicted as unlikely to have occurred by chance alone.
● t-test
○ Used to determine whether two sets of data are
significantly different from each other
● p-value
○ Probability of getting the same results when comparing 2
hypothesis.
● null hypothesis
○ The default position, initial state of the data
Problem 1 :
Small repository of datasets
● It is difficult to produce major new results using well-
studied and widely shared data.
● Suppose 100 people are studying the effect of
algorithms A and B
● At least 5 will get results statistically significant at p <=
0.05
● Clearly results are due to chance.
○ The ones who get significant results will publish
○ While others will simply move on to other experiments.
Problem 2 :
Statistical validity
● Statistics offer many tests that are desined to measure
the significance of any difference
● These tests are not designed with computational
experiments in mind.
● For example
○ 14 different variations of classifier algorithms
○ 11 different datasets
○ 154 variations, 154 changes to be significant
○ Actual p-value used is 154*0.05 = 7.7
○ multiplicy effect
Problem 2 :
Statistical validity
● Let the significance for each level be α
● Chance for making right conclusion for one experiment
is (1 - α )
● Assuming experiments are independent of one another,
chance for getting n experiments correct is (1 - α )n
● Chances of not making correct conclusion is 1- ( 1 - α )n
● Substituting α = 0.05
● Chances for making incorrect conclusion is 0.9996
● To obtain results significant at 0.05 level with 154 tests
1 - ( 1 - α )n
< 0.05
α < 0.003
● This adjustment is known as Bonferroni Adjustment.
Problem 3 :
Experiments are not independent
● The t-test assumes that the test sets for
each algorithm are independent.
● Generally two algorithms are compared on
the same data set
○ Obviously the test sets are not independent.
Problem 4 :
Only considers overall accuracy
● Comparison must consider 4 number when a common
test set is used for comparing two algorithms
○ A got right and B got wrong ( A > B )
○ B got right and A got wrong ( B > A )
○ Both algorithms got right
○ Both algorithms got wrong
● If only two algorithms compared
○ Throw out ties
○ Compare A > B vs B > A
● If more than two algorithms compared
○ Use “Analysis of Variance” (ANOVA)
○ Bonferroni adjustment for multiple test
Problem 5 :
Repeated tuning
● Researchers tune their algorithms repeatedly to perform
optimally on a data set.
● Whenever tuning takes place, every adjustment should
really be considered as a separate experiment.
○ For example if 10 tuning experiments were
attempted, then p-value should be 0.005 instead of
0.05.
● When one uses an algorithm that has been used before,
the algorithm may already have been tuned on public
databases.
Problem 5 :
Repeated tuning
● Recommended approach:
○ Reserve a portion of the training set as a tuning set
○ Repeatedly test the algorithm and adjust parameters on tuning
set.
○ Measure accuracy on the test data.
Problem 5 :
Generalizing results
● Common methodological approach
○ pick several datasets from UCI repository
○ perform series of experiments
■ measuring classification accuracy
■ learning rates
● It is not valid to make general statements about other
datasets.
○ The repository is not an unbiased sample of classification
problems.
● Someone can write an algorithm that works very well on
some of the known datasets
○ Anyone familiar with the data may be biased.
A Recommended Approach
1. Choose other algorithms to include in the comparison.
2. Chose a benchmark data set.
3. Divide the data set into k subsets for cross validation
○ Typically k = 10
○ For small data sets, chose larger k.
A Recommended Approach
4. Run cross-validation
○ For each of the k subsets of the data set D, create a training
set T = D - k
○ Divide T into two subsets: T1
(training) and T2
(tuning)
○ Once parameters are optimized, re-run training on T
○ Measure accuracy on k
○ Overall accuracy is averaged across all k partitions.
5. Compare algorithms
● In case of multiple data sets, Bonferroni adjustment
should be applied.
Conclusion
● Authors do not mean to discourage emprical
comparisons
● They try to provide suggestions to avoid pitfalls
● They suggest that
○ Statistical tools should be used carefully.
○ Every details of the experiment should be reported.
Thank you!

Más contenido relacionado

La actualidad más candente

Psyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.comPsyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.comMcdonaldRyan117
 
Psyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.comPsyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.comBartholomew88
 
PSYC 355 Inspiring Innovation/tutorialrank.com
 PSYC 355 Inspiring Innovation/tutorialrank.com PSYC 355 Inspiring Innovation/tutorialrank.com
PSYC 355 Inspiring Innovation/tutorialrank.comjonhson158
 
Comparison statisticalsignificancetestir
Comparison statisticalsignificancetestirComparison statisticalsignificancetestir
Comparison statisticalsignificancetestirClaudia Ribeiro
 
Why we run cronbach’s alpha
Why we run cronbach’s alphaWhy we run cronbach’s alpha
Why we run cronbach’s alphaAiden Yeh
 
Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )Hasnat Israq
 
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYSTATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYkeerthana151
 
Imputation of missing data in clinical trials
Imputation of missing data in clinical trialsImputation of missing data in clinical trials
Imputation of missing data in clinical trialsSeema Ahirwar
 
Psyc 355 Effective Communication / snaptutorial.com
Psyc 355  Effective Communication / snaptutorial.comPsyc 355  Effective Communication / snaptutorial.com
Psyc 355 Effective Communication / snaptutorial.comHarrisGeorg39
 
Psyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.comPsyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.comrobertleew40
 
Psyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.comPsyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.comBaileya73
 
Non parametrics
Non parametricsNon parametrics
Non parametricsRyan Sain
 
Research Methology -Factor Analyses
Research Methology -Factor AnalysesResearch Methology -Factor Analyses
Research Methology -Factor AnalysesNeerav Shivhare
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsNitin George
 

La actualidad más candente (20)

Psyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.comPsyc 355Education Specialist / snaptutorial.com
Psyc 355Education Specialist / snaptutorial.com
 
Psyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.comPsyc 355 Effective Communication - tutorialrank.com
Psyc 355 Effective Communication - tutorialrank.com
 
Data analysis
Data analysisData analysis
Data analysis
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of Variance
 
PSYC 355 Inspiring Innovation/tutorialrank.com
 PSYC 355 Inspiring Innovation/tutorialrank.com PSYC 355 Inspiring Innovation/tutorialrank.com
PSYC 355 Inspiring Innovation/tutorialrank.com
 
Comparison statisticalsignificancetestir
Comparison statisticalsignificancetestirComparison statisticalsignificancetestir
Comparison statisticalsignificancetestir
 
Why we run cronbach’s alpha
Why we run cronbach’s alphaWhy we run cronbach’s alpha
Why we run cronbach’s alpha
 
Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )Basic Concepts of Non-Parametric Methods ( Statistics )
Basic Concepts of Non-Parametric Methods ( Statistics )
 
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRYSTATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
 
Imputation of missing data in clinical trials
Imputation of missing data in clinical trialsImputation of missing data in clinical trials
Imputation of missing data in clinical trials
 
Psyc 355 Effective Communication / snaptutorial.com
Psyc 355  Effective Communication / snaptutorial.comPsyc 355  Effective Communication / snaptutorial.com
Psyc 355 Effective Communication / snaptutorial.com
 
Psyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.comPsyc 355 Enhance teaching-snaptutorial.com
Psyc 355 Enhance teaching-snaptutorial.com
 
Psyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.comPsyc 355 Exceptional Education / snaptutorial.com
Psyc 355 Exceptional Education / snaptutorial.com
 
Error analytical
Error analyticalError analytical
Error analytical
 
Non parametrics
Non parametricsNon parametrics
Non parametrics
 
The Chi Square Test
The Chi Square TestThe Chi Square Test
The Chi Square Test
 
Research Methology -Factor Analyses
Research Methology -Factor AnalysesResearch Methology -Factor Analyses
Research Methology -Factor Analyses
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Khurram
KhurramKhurram
Khurram
 
Mann Whitney U test
Mann Whitney U testMann Whitney U test
Mann Whitney U test
 

Similar a CS550 Presentation - On comparing classifiers by Slazberg

Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptxChemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptxHakimuNsubuga2
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfElih Sutisna Yanto
 
Artificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 NegnevitskyArtificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 Negnevitskylopanath
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Manoj Sharma
 
Planning of experiment in industrial research
Planning of experiment in industrial researchPlanning of experiment in industrial research
Planning of experiment in industrial researchpbbharate
 
hypothesis teesting
 hypothesis teesting hypothesis teesting
hypothesis teestingkpgandhi
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsLeanleaders.org
 
Quantitative methodology part one.compressed
Quantitative methodology part one.compressedQuantitative methodology part one.compressed
Quantitative methodology part one.compressedMaria Sanchez
 
Worked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluationWorked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluationGH Yeoh
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluationkhairulhuda242
 
Machine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection TestsMachine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection TestsAnant Corporation
 
TEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxTEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxmattinsonjanel
 

Similar a CS550 Presentation - On comparing classifiers by Slazberg (20)

Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptxChemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
Chemometrics-ANALYTICAL DATA SIGNIFICANCE TESTS.pptx
 
CHAPTER 4- Lesson A
CHAPTER 4- Lesson ACHAPTER 4- Lesson A
CHAPTER 4- Lesson A
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdf
 
Artificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 NegnevitskyArtificial Intelligence Chapter 9 Negnevitsky
Artificial Intelligence Chapter 9 Negnevitsky
 
chapter12.ppt
chapter12.pptchapter12.ppt
chapter12.ppt
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...
 
T test
T testT test
T test
 
Planning of experiment in industrial research
Planning of experiment in industrial researchPlanning of experiment in industrial research
Planning of experiment in industrial research
 
hypothesis teesting
 hypothesis teesting hypothesis teesting
hypothesis teesting
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
A05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat TestsA05 Continuous One Variable Stat Tests
A05 Continuous One Variable Stat Tests
 
Quantitative methodology part one.compressed
Quantitative methodology part one.compressedQuantitative methodology part one.compressed
Quantitative methodology part one.compressed
 
Worked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluationWorked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluation
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
 
Machine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection TestsMachine Learning with Spark and Cassandra - Model Selection Tests
Machine Learning with Spark and Cassandra - Model Selection Tests
 
TEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxTEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docx
 
UNIT 5.pptx
UNIT 5.pptxUNIT 5.pptx
UNIT 5.pptx
 
Introduction to meta analysis
Introduction to meta analysisIntroduction to meta analysis
Introduction to meta analysis
 

Más de mustafa sarac

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma sonmustafa sarac
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Latka december digital
Latka december digitalLatka december digital
Latka december digitalmustafa sarac
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualmustafa sarac
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpymustafa sarac
 
Math for programmers
Math for programmersMath for programmers
Math for programmersmustafa sarac
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizmustafa sarac
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?mustafa sarac
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mimustafa sarac
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?mustafa sarac
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Marketsmustafa sarac
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimimustafa sarac
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0mustafa sarac
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tshmustafa sarac
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008mustafa sarac
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guidemustafa sarac
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020mustafa sarac
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dicemustafa sarac
 

Más de mustafa sarac (20)

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
 
The book of Why
The book of WhyThe book of Why
The book of Why
 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
 

Último

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxviniciusperissetr
 
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证nhjeo1gg
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一F La
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证nhjeo1gg
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 

Último (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptx
 
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 

CS550 Presentation - On comparing classifiers by Slazberg

  • 1. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach (cited by 581) Author: Steven L.Salzberg Presented by: Mehmet Ali Abbasoğlu & Mustafa İlker Saraç 10.04.2014
  • 2. Contents 1. Motivation 2. Comparing Algorithms 3. Definitions 4. Problems 5. Recommended Approach 6. Conclusion
  • 3. Motivation ● Be careful about comparative studies of classification and other algorithms. ○ It is easy to result in statistically invalid conclusions. ● How to chose which algorithm to use for a new problem? ● Using brute force one can easily find a phenomenon or pattern that looks impressive. ○ REALLY?
  • 4. Motivation ● You have lots of data ○ Choose one from UCI repository ● You have many classification methods to compare But, ● Any differences in classification accuracy that reach statistical significance should be reported as important? ○ Think again!
  • 5. Comparing Algorithms ● Many new algorithms has problems according to a survey conducted by Prechelt. ○ 29% not evaluated on a real problem ○ 8% compared to more than one alternative on real data ● A survey by Flexer on experimental neural network papers in leading journals ○ Only 3 out of 43 used a seperate data set for tuning parameters.
  • 6. Comparing Algorithms ● Drawbacks of reporting results on a well studied data set, e.g. a data set from UCI repository ○ It is hard to improve results ○ Prone to statistical accidents ○ They are fine to see initial results for your new algorithm ● It seems easy to change known algorithms a little then use comparisons to report improved results. ○ High risk of statistical invalidity ○ Better apply new algorithms
  • 7. Definitions ● Statistical significance ○ In statistics, a result is considered significant not because it is important or meaningful, but because it has been predicted as unlikely to have occurred by chance alone. ● t-test ○ Used to determine whether two sets of data are significantly different from each other ● p-value ○ Probability of getting the same results when comparing 2 hypothesis. ● null hypothesis ○ The default position, initial state of the data
  • 8. Problem 1 : Small repository of datasets ● It is difficult to produce major new results using well- studied and widely shared data. ● Suppose 100 people are studying the effect of algorithms A and B ● At least 5 will get results statistically significant at p <= 0.05 ● Clearly results are due to chance. ○ The ones who get significant results will publish ○ While others will simply move on to other experiments.
  • 9. Problem 2 : Statistical validity ● Statistics offer many tests that are desined to measure the significance of any difference ● These tests are not designed with computational experiments in mind. ● For example ○ 14 different variations of classifier algorithms ○ 11 different datasets ○ 154 variations, 154 changes to be significant ○ Actual p-value used is 154*0.05 = 7.7 ○ multiplicy effect
  • 10. Problem 2 : Statistical validity ● Let the significance for each level be α ● Chance for making right conclusion for one experiment is (1 - α ) ● Assuming experiments are independent of one another, chance for getting n experiments correct is (1 - α )n ● Chances of not making correct conclusion is 1- ( 1 - α )n ● Substituting α = 0.05 ● Chances for making incorrect conclusion is 0.9996 ● To obtain results significant at 0.05 level with 154 tests 1 - ( 1 - α )n < 0.05 α < 0.003 ● This adjustment is known as Bonferroni Adjustment.
  • 11. Problem 3 : Experiments are not independent ● The t-test assumes that the test sets for each algorithm are independent. ● Generally two algorithms are compared on the same data set ○ Obviously the test sets are not independent.
  • 12. Problem 4 : Only considers overall accuracy ● Comparison must consider 4 number when a common test set is used for comparing two algorithms ○ A got right and B got wrong ( A > B ) ○ B got right and A got wrong ( B > A ) ○ Both algorithms got right ○ Both algorithms got wrong ● If only two algorithms compared ○ Throw out ties ○ Compare A > B vs B > A ● If more than two algorithms compared ○ Use “Analysis of Variance” (ANOVA) ○ Bonferroni adjustment for multiple test
  • 13. Problem 5 : Repeated tuning ● Researchers tune their algorithms repeatedly to perform optimally on a data set. ● Whenever tuning takes place, every adjustment should really be considered as a separate experiment. ○ For example if 10 tuning experiments were attempted, then p-value should be 0.005 instead of 0.05. ● When one uses an algorithm that has been used before, the algorithm may already have been tuned on public databases.
  • 14. Problem 5 : Repeated tuning ● Recommended approach: ○ Reserve a portion of the training set as a tuning set ○ Repeatedly test the algorithm and adjust parameters on tuning set. ○ Measure accuracy on the test data.
  • 15. Problem 5 : Generalizing results ● Common methodological approach ○ pick several datasets from UCI repository ○ perform series of experiments ■ measuring classification accuracy ■ learning rates ● It is not valid to make general statements about other datasets. ○ The repository is not an unbiased sample of classification problems. ● Someone can write an algorithm that works very well on some of the known datasets ○ Anyone familiar with the data may be biased.
  • 16. A Recommended Approach 1. Choose other algorithms to include in the comparison. 2. Chose a benchmark data set. 3. Divide the data set into k subsets for cross validation ○ Typically k = 10 ○ For small data sets, chose larger k.
  • 17. A Recommended Approach 4. Run cross-validation ○ For each of the k subsets of the data set D, create a training set T = D - k ○ Divide T into two subsets: T1 (training) and T2 (tuning) ○ Once parameters are optimized, re-run training on T ○ Measure accuracy on k ○ Overall accuracy is averaged across all k partitions. 5. Compare algorithms ● In case of multiple data sets, Bonferroni adjustment should be applied.
  • 18. Conclusion ● Authors do not mean to discourage emprical comparisons ● They try to provide suggestions to avoid pitfalls ● They suggest that ○ Statistical tools should be used carefully. ○ Every details of the experiment should be reported.