SlideShare a Scribd company logo
1 of 63
Download to read offline
3
The forgotten practicalities of machine learning:
Evaluating machine learning models
and their diagnostic value
Gaël Varoquaux
Based on Varoquaux  Colliot,
hal.archives-ouvertes.fr/hal-03682454
Evaluation: the Achilles heel of machine learning practice
Brain imaging prediction challenge [Traut... 2022]
Autism status prediction
10 000 € incentives
Analysts overfit the public set
G Varoquaux 1
Evaluation: the Achilles heel of machine learning practice
Kaggle competitions [Varoquaux and Cheplygina 2022]
Observed improvement in score
Lung cancer
Classification
Prize: $1 000 000
Test size: max 1K
-0.75 0.0 +0.75
Pneumonia
Detection (localization)
Prize: $ 30 000
Test size: 3 000
-0.15 0.0 +0.15
Covid 19
Abnormality localization
Prize: $ 100 000
Test size: 1 200
-0.01 0.0 +0.01
Nerve
Segmentation
Prize: $100 000
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Many submis-
sions fall within
evaluation error
G Varoquaux 1
Evaluation: the Achilles heel of machine learning practice
Deep learning literature [Bouthillier... 2021]
90.0
92.5
95.0
97.5
100.0
cifar10
2012 2014 2016 2018 2020
85
90
95
100
sst2
non-'SOTA' results
Significant
Non-Significant
Year
Accuracy
Published im-
provements often
fall within error
bars
G Varoquaux 1
Problem setting: defining objects
(x, y) ∈ X × Y drawn from D
a function f : X → Y such that f (x) ≈ y Notation: ŷ
def
= f (x)
Evaluate the risk
Error function l : Y × Y → Ò captures the cost of errors
Expected error Åx,y∌D

l(f (x), y)

.Training might be done with
- another D
- another l
G Varoquaux 2
Outline
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
G Varoquaux 3
1 Performance metrics
Metrics must capture and reflect application
Going further for images analysis [Maier-Hein... 2022]
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Summary metric: accuracy
Accuracy: counts the fraction of error
Simple summary
Overly simplified picture
Distinguishing false detections from misses
eg false detection of a brain tumor?
- Very bad if leads to brain surgery
- Minor if leads to confirmatory imaging
Relevant cost (error metric) dependent on application setting
G Varoquaux 6
Confusion matrix: beyond accuracy
Truth: Disease status
D+ D−
Predicted:
test
result
T+ TP FP
T−
FN TN
G Varoquaux 7
Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
G Varoquaux 8
[Varoquaux and Colliot 2022]
Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN
G Varoquaux 8
[Varoquaux and Colliot 2022]
Metrics on positive and negative classes
T denotes test: classifier output; D denotes diseased status.
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN Estimates P(T+ | D+)
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP Estimates P(T− | D−)
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP Estimates P(D+ | T+)
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN Estimates P(D− | T−)
G Varoquaux 8
[Varoquaux and Colliot 2022]
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Accuracy, sensitivity, specificity Truth: Disease status
D+ D−
Predicted:
test
result
T+
TP FP
T−
FN TN
Accuracy
uninformative under class imbalance
90% of class 0
⇒ predicting only class 0 gives Acc=90%
Balanced accuracy: errors on class 0 and class 1
Sensitivity (also called recall): fraction of class 1 retrieved. TP
TP + FN
Specificity: fraction of class 0 actually classified as 0. TN
TN + FP
Balanced accuracy: 1
2 (sensitivity + specificity)
Sensitivity: P(T+|D+) Specificity: P(T − |D − )
G Varoquaux 10
Asking the right question: P(T+|D+) vs P(D+|T+)
Positive predictive value (via Bayes’ theorem):
P(D+ | T+) =
sensitivity × prevalence
(1 − specificity) × (1 − prevalence) + sensitivity × prevalence
.
Diseased
Test positive
Summary metric: Markedness: PPV + NPV - 1
Drawback: depends on prevalence
⇒ Characterizes not only the classifier, but also the dataset
G Varoquaux 11
Odds ratios for invariance to sampling
Definition: Odds of a O(a) =
P(a)
1 − P(a)
Likelihood ratio of positive class:
LR+ =
O(D+|T+)
O(D + )
=
Sensitivity
1 − Specificity
Independent of class prevalence1
Use prevalence on target population to compute O(D+|T+)
Useful to extrapolate from case-control study to target
population
1Proof in Varoquaux  Colliot “Evaluating machine learning models and their diagnostic value”
G Varoquaux 12
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Multi-threshold metrics: ROC curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
True
positive
rate
=
sensitivity
or
recall
Excellent, AUC=0.95
Good, AUC=0.88
Poor, AUC=0.75
Chance, AUC=0.50
Classifiers about continuous
scores
ROC obtained by varying the
threshold
Summary metric: AUC
Area Under the Curve
Chance = .5
G Varoquaux 14
Multi-threshold metrics: Precision Recall curve
0.0 0.2 0.4 0.6 0.8 1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
= TPR or sensitivity
=
PPV
Excellent, AUC=0.96
Good, AUC=0.92
Poor, AUC=0.78
Chance, AUC=0.57
Better dynamical range than
ROC for imbalanced classes
Area Under the Curve:
Chance depends on class
imbalance
G Varoquaux 15
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Interpreting classifier score as a probability? – Calibration
Calibration
Average error rate for all
samples with score s is s
Computed in bins on score s
ECE:
expected calibration error
Average error on all bins
0.0 0.2 0.4 0.6 0.8 1.0
Mean confidence score
0.0
0.2
0.4
0.6
0.8
1.0
Observed
fraction
of
positive
Underconfident
Overconfident
Calibration plot
Perfectly calibrated
GaussianNB
.Does not control individual probabilities
G Varoquaux 17
Classifier score as a probability? – Individual probabilities
Does the classifier approach P(y|X)?
Metric:
Brier score =
Õ
i
(ŝi − yi)
2
Observed (binary) label
Confidence score
Minimal for ŝ = P(y|X)
Scaled version:
Brier skill score = 1 −
Brier(ŝ, y)
Brier(ȳ, y)
Class prevalence
1 = perfect prediction
0 = scores not more informative than class prevalence
G Varoquaux 18
Choose well your metrics
Machine learning research chases metrics
These should reflect application as well as possible
Worry about P(D + |T+)
- Accuracy reasonable proxy only for balanced classes
- LR+ interesting to keep in mind
Worry about uncertainty
- Calibration quantifies average errors
- Brier scores controls individual uncertainty
A single number does not tell the whole story
G Varoquaux 19
2 Evaluation procedures
Controlled estimation of e = Åx,y∌D

l(f (x), y)

Corresponding research paper: [Bouthillier... 2021]
Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
G Varoquaux 21
Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
For application claims: eg medicine
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
For machine-learning research (claims on algorithms)
G Varoquaux 21
Model evaluation 101
In-sample error
is not meaningful
X, y ̞⊄
⊄ f̂
Å estimated wrong
G Varoquaux 22
Model evaluation 101
In-sample error
is not meaningful
X, y ̞⊄
⊄ f̂
Å estimated wrong
Typically:
Test set
Train set
Full data
typically 10% trade off better learn-
ing vs better estimation
G Varoquaux 22
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
External validation Before putting in production
We are given f : X → Y
Xtest different enough from Xtrain
- No repeated acquisition of same individual in train  test
[Little... 2017]
- Ideally: show generalization to new site, later in time...
Xtest representative of target population
Sample Xtest:
- To match statistical moments
- To minimize a confounding association (shortcuts)
[Chyzhyk... 2018]
G Varoquaux 24
Evaluation error: Sampling noise on test set
Evaluation quality is limited by number of test examples
Sampling noise1 for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Confidence intervals ntest = 1 000 interval: 5.7%
ntest = 10 000 interval: 1.8%
ntest = 100 000 interval: 0.6%
1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and
sampling other data will lead to other results.
G Varoquaux 25
[Varoquaux 2018]
Evaluation error: Sampling noise on test set
“in the wild”
102
103
104
105
106
Test set size
0
1
2
3
4
Standard
deviation
(%
acc) In Theory:
From a Binomial
In Practice:
Random splits
Binom(n', 0.66)
Binom(n', 0.95)
Binom(n', 0.91)
Glue-RTE BERT
(n'=277)
Glue-SST2 BERT
(n'=872)
CIFAR10 VGG11
(n'=10000)
G Varoquaux 26
[Bouthillier... 2021]
Evaluation noise is not negligible – in Kaggle competitions
Lung cancer classification
Test size: max 1K
Smaller improvements than noise
-0.75 0.0 +0.75
Diminishing returns
Schizophrenia classification
Test size: 120
-0.2 0.0 +0.2
Diminishing returns
Lung tumor segmentation
Test size: max 6k
Poorer score on private set
-0.05 0.0 +0.05
Overfit
Nerve segmentation
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Actual
improvement
G Varoquaux 27
[Varoquaux and Cheplygina 2022]
Confidence intervals  statistical testing
Evaluate uncertainty to know when to stop, what to trust
(diminishing returns, creeping complexity)
Confidence interval1: Range of values compatible with the
observations
Null hypothesis testing: p-value: the chance to observe the
results if a null hypothesis were true
Typical null: model comparison
model p1 and p2 give same expected error
1Technically not making the difference with a credible interval
G Varoquaux 28
Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Confidence intervals:
N 65% 80% 90% 95%
100 [-9.0% 9.0%] [-8.0% 8.0%] [-6.0% 5.0%] [-5.0% 4.0%]
1000 [-3.0% 2.9%] [-2.5% 2.4%] [-1.9% 1.8%] [-1.4% 1.3%]
10000 [-0.9% 0.9%] [-0.8% 0.8%] [-0.6% 0.6%] [-0.4% 0.4%]
100000 [-0.3% 0.3%] [-0.2% 0.2%] [-0.2% 0.2%] [-0.1% 0.1%]
Table from [Varoquaux and Colliot 2022]
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29
Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Non-parametric:
- Confidence intervals of performance of a prediction rule
Bootstrap test set
- Statistical testing (compare prediction rules with correlated errors)
Permutations [Bandos... 2005]
Sample null by randomly switching predictions from p1 and p2.
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
Benchmarking to conclude on good training procedures
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
We want machine-learning research claims
(novel frobnicate improves prediction)
Many arbitrary components
torch.manual seed(3407) ?? [Picard 2021]
Useless to tune random seeds
(for weights init, dropout, data augmentation)
will not carry over to new training data
G Varoquaux 31
Arbitrary variance in a machine-learning benchmark
Variance when rerunning an evaluation,
modifying arbitrary elements:
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
relative scale
Across various computer vision and NLP tasks
G Varoquaux 32
[Bouthillier... 2021]
Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
[Bouthillier... 2021]
Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
Better evaluation
Sample multiple times these arbitrary choices [Bouthillier... 2021]
Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
G Varoquaux 34
Test set
Train set
Full data
Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
But, a full learning pipeline comes with hyper-parameters
These are set to minimize error
on left-out data
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
G Varoquaux 34
Test set
Train set
Full data
Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
G Varoquaux 35
Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
Multiple splits
+ automated tuning of hyperparameters Validation set
Full data
Test set
Train set
Hyper-parameter tuning
Outer loop
G Varoquaux 35
Hyper-parameter optimization procedures
Grid search
Vary each hyper-parameter on a grid
G Varoquaux 36
Hyper-parameter optimization procedures
Grid search
Random search
[Bergstra and Bengio 2012]
prefer to grid-search for
more than 2 params
Sub-optimal hyper-
parameters on models
routinely lead to invalid
conclusions
See refs in [Bouthillier... 2021] Region of good
hyperparameters
Hyperparameter 1
Hyperparameter
2
Grid Search
Randomized
Search
(important hyperparameter)
(unimportant
hyperparameter)
G Varoquaux 36
Benchmarking training procedures (eg to compare them)
Control arbitrary fluctuations (that will not generalize)
Sample all:
data sampling
Multiple train-test splits (cross-validation)
Test set
Train set
Full data
arbitrary choices (seeds)
Randomize them all
hyper-parameters
Hyper-parameter optimization Too expensive to randomize
In practice: set hyper parameters once, then randomize model seeds
and data splits [Bouthillier... 2021]
Counterintuitive: randomization decorrelates errors, improving benchmarks
G Varoquaux 37
Accounting for variance in conclusions
Confidence intervals  statistical testing
Test set
Train set
Full data
Statistical testing with multiple folds
Challenge: folds are not independent1
.t-test/Wilcoxon across folds are not valid
Don’t divide std by number of folds
Approximate corrections for dependence across folds2
5x2cv: repeat 5 times randomized 2-fold [Dietterich 1998]
Corrected resampled t-test statistic [Nadeau and Bengio 2003]
1Train sets overlap, and often test sets also do.
2Does not account for sources of variance beyond data sampling, eg random seeds, hyper parameters.
G Varoquaux 38
Statistical tests: across datasets
(more general claims on training pipelines)
Challenge:
metrics not comparable across datasets
=⇒ Tests based on rank statistics
Wilcoxon signed rank test
Tests how often p1 outperforms p2 across datasets
G Varoquaux 39
[Demšar 2006]
Statistical tests: multiple pipelines across datasets
Challenge: multiple comparisons1
The Wilcoxon-Holm approach [Demšar 2006]
Pairwise comparisons + Bonferroni-Holm correction
The Friedman-Nemenyi approach2 [Demšar 2006]
1. Friedman test across all pipelines (omnibus)
2. Critical difference via Nemenyi test 1
2
3
4
5
4.2000
clf13.7667
clf23.5000
clf4
2.0000clf5
1.5333clf3
Accuracy (rank)
Critical difference diagrams
Replicability analysis [Dror... 2017]
Perform dataset-level pairwise tests
Combine by testing (partial conjunction multiple-testing):
“Does p1 perform better than p2 on at least u datasets?”
More powerful for a small number of datasets
1If we do many tests, some will show large differences by chance.
2The Holm approach can be more interesting when considering comparisons to a referent classifier.
G Varoquaux 40
Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires  15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
G Varoquaux 41
[Demšar 2008]
Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires  15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons2
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
2FDR (False Discovery Rate) attempts to solve this.
G Varoquaux 41
[Demšar 2008]
Statistical tests: accounting for effect sizes
Neyman-Pearson view of hypothesis testing
Two hypothesis, H0 and H1
H1: p1 outperforms p2 by a margin1
Which is mostly likely? H0 H1
H0 H1
Requires the choice of the margin
Related to superiority testing in clinical trials
[Lesaffre 2008]
1Related to the rejection region in the Neyman-Pearson lemma.
G Varoquaux 42
Pragmatic compromises
Test on P(p1  p2)  ÎŽ
ÎŽ  .5: Neyman-Pearson view
Evaluate P(p1  p2) by resampling
Randomize everything: data splits, seeds,...
Gaussian approximation: amounts to comparing differences to
standard deviations
Not an inference on the expected difference in performance1
1Unlike standard error, standard deviation does not shrink to zero with the number of resampling.
G Varoquaux 43
[Bouthillier... 2021]
Summary – comparing learning procedures
Sample multiple sources of variance:
- data split
- random seeds
- hyper-parameter tuning
Null-hypothesis testing: no t-test on cross-validation!
Don’t mis-interpret p-value:
- Not significant: more data could change that
- Significant differences may be trivial
Detect practical differences:
delta in performance  standard deviation
G Varoquaux 44
Better experimental procedures
Crack the black box open
A prediction score is seldom insightful
Ablation studies: remove/change atomic elements
Learning curves
Better benchmarking in these
Tune hyper-parameters to the same quality
Randomize everything
Account for variance in conclusions
G Varoquaux 45
Summary – Better benchmarking procedures
Reminder: Your validation
measure is intrinsically unreli-
able (sampling noise)
An arbitrary choice (random
seed) may give seemingly-good
results that do not generalize
Optimizing test accuracy will
explore the tails
Evaluation is a bottleneck
300
200
100
30
Number of available samples   
­2% +2%
­4% +4%
­5% +5%
­7% +7%
­15% +12%
G Varoquaux 46
Model evaluation
Choice of performance metrics
Should be suited to the application setting
Machine learning does metric chasing Ć 
Evaluation procedures
Account for variance
Different sources of variance for applying prediction rules vs
learning them
Careful benchmarking is crucial
Optimistic flukes will not generalize
What is our purpose? External validity a
G Varoquaux 47
@GaelVaroquaux
References I
A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in
areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873,
2005.
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of
Machine Learning Research, 13:281, 2012.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto,
N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in
machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021.
D. Chyzhyk, G. Varoquaux, B. Thirion, and M. Milham. Controlling a confound in predictive
models with a test set minimizing its effect. In 2018 International Workshop on
Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2018.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of
Machine Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop
on Evaluation Methods for Machine Learning in conjunction with ICML, page 65.
Citeseer, 2008.
G Varoquaux 48
References II
T. G. Dietterich. Approximate statistical tests for comparing supervised classification
learning algorithms. Neural computation, 10(7):1895–1923, 1998.
R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language
processing: Testing significance with multiple datasets. Transactions of the
Association for Computational Linguistics, 2017.
E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU
hospital for joint diseases, 66(2), 2008.
M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording.
Using and understanding cross-validation strategies. perspectives on saeb et al.
GigaScience, 6(5):1–6, 2017.
L. Maier-Hein, A. Reinke, E. Christodoulou, B. Glocker, P. Godau, F. Isensee, J. Kleesiek,
M. Kozubek, M. Reyes, M. A. Riegler, ... Metrics reloaded: Pitfalls and recommendations
for image analysis validation. arXiv preprint arXiv:2206.01653, 2022.
G Varoquaux 49
References III
A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and
C. Archambeau. Overfitting in bayesian optimization: an empirical study and
early-stopping solution. arXiv preprint arXiv:2104.08166, 2021.
C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3):
239–281, 2003.
D. Picard. Torch. manual seed (3407) is all you need: On the influence of random seeds
in deep learning architectures for computer vision. arXiv:2109.08203, 2021.
N. Traut, K. Heuer, G. Lemaı̂tre, A. Beggiato, D. Germanaud, M. Elmaleh, A. Bethegnies,
L. Bonnasse-Gahot, W. Cai, S. Chambon, ... Insights from an autism imaging biomarker
challenge: promises and threats to biomarker discovery. NeuroImage, 255:119171, 2022.
G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars.
Neuroimage, 180:68–77, 2018.
G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological
failures and recommendations for the future. NPJ digital medicine, 5(1):1–8, 2022.
G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic
value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022.
G Varoquaux 50

More Related Content

What's hot

Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionalityNikhil Sharma
 
07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
An Introduction to Anomaly Detection
An Introduction to Anomaly DetectionAn Introduction to Anomaly Detection
An Introduction to Anomaly DetectionKenneth Graham
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineMusa Hawamdah
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Naive Bayes
Naive BayesNaive Bayes
Naive BayesCloudxLab
 
Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜
Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜
Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜AI Robotics KR
 
Lecture10 - NaĂŻve Bayes
Lecture10 - NaĂŻve BayesLecture10 - NaĂŻve Bayes
Lecture10 - NaĂŻve BayesAlbert Orriols-Puig
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Simplilearn
 
(Radhika) presentation on chapter 2 ai
(Radhika) presentation on chapter 2 ai(Radhika) presentation on chapter 2 ai
(Radhika) presentation on chapter 2 aiRadhika Srinivasan
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithmsUtkarsh Sharma
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 

What's hot (20)

Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
07 regularization
07 regularization07 regularization
07 regularization
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
An Introduction to Anomaly Detection
An Introduction to Anomaly DetectionAn Introduction to Anomaly Detection
An Introduction to Anomaly Detection
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜
Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜
Bayesian Inference : Kalman filter 에서 Optimization êčŒì§€ - êč€í™ë°° ë°•ì‚Źë‹˜
 
Lecture10 - NaĂŻve Bayes
Lecture10 - NaĂŻve BayesLecture10 - NaĂŻve Bayes
Lecture10 - NaĂŻve Bayes
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
(Radhika) presentation on chapter 2 ai
(Radhika) presentation on chapter 2 ai(Radhika) presentation on chapter 2 ai
(Radhika) presentation on chapter 2 ai
 
Evaluating classification algorithms
Evaluating classification algorithmsEvaluating classification algorithms
Evaluating classification algorithms
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 

Similar to Evaluating machine learning models and their diagnostic value

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Errors2
Errors2Errors2
Errors2sjsuchaya
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdfssuserdce5c21
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptxshuchismitjha2
 
Download the presentation
Download the presentationDownload the presentation
Download the presentationbutest
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
An infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docxAn infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docxgreg1eden90113
 
How to Measure Uncertainty
How to Measure UncertaintyHow to Measure Uncertainty
How to Measure UncertaintyRandox
 
Common mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculationsCommon mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculationsGH Yeoh
 
7. evaluation of diagnostic test
7. evaluation of diagnostic test7. evaluation of diagnostic test
7. evaluation of diagnostic testAshok Kulkarni
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdfEmerson Ceras
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methodsKrish_ver2
 
Screening
ScreeningScreening
Screeningemphemory
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 

Similar to Evaluating machine learning models and their diagnostic value (20)

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Errors2
Errors2Errors2
Errors2
 
Inferences about Two Proportions
 Inferences about Two Proportions Inferences about Two Proportions
Inferences about Two Proportions
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdf
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
An infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docxAn infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docx
 
Testing a claim about a mean
Testing a claim about a mean  Testing a claim about a mean
Testing a claim about a mean
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
How to Measure Uncertainty
How to Measure UncertaintyHow to Measure Uncertainty
How to Measure Uncertainty
 
Common mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculationsCommon mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculations
 
7. evaluation of diagnostic test
7. evaluation of diagnostic test7. evaluation of diagnostic test
7. evaluation of diagnostic test
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Screening
ScreeningScreening
Screening
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 

More from Gael Varoquaux

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 

More from Gael Varoquaux (20)

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Evaluating machine learning models and their diagnostic value

  • 1. 3 The forgotten practicalities of machine learning: Evaluating machine learning models and their diagnostic value Gaël Varoquaux Based on Varoquaux Colliot, hal.archives-ouvertes.fr/hal-03682454
  • 2. Evaluation: the Achilles heel of machine learning practice Brain imaging prediction challenge [Traut... 2022] Autism status prediction 10 000 € incentives Analysts overfit the public set G Varoquaux 1
  • 3. Evaluation: the Achilles heel of machine learning practice Kaggle competitions [Varoquaux and Cheplygina 2022] Observed improvement in score Lung cancer Classification Prize: $1 000 000 Test size: max 1K -0.75 0.0 +0.75 Pneumonia Detection (localization) Prize: $ 30 000 Test size: 3 000 -0.15 0.0 +0.15 Covid 19 Abnormality localization Prize: $ 100 000 Test size: 1 200 -0.01 0.0 +0.01 Nerve Segmentation Prize: $100 000 Test size 5.5K -0.04 0.0 +0.04 Improvement of top model on 10% best Winner gap Evaluation noise between public and private sets Many submis- sions fall within evaluation error G Varoquaux 1
  • 4. Evaluation: the Achilles heel of machine learning practice Deep learning literature [Bouthillier... 2021] 90.0 92.5 95.0 97.5 100.0 cifar10 2012 2014 2016 2018 2020 85 90 95 100 sst2 non-'SOTA' results Significant Non-Significant Year Accuracy Published im- provements often fall within error bars G Varoquaux 1
  • 5. Problem setting: defining objects (x, y) ∈ X × Y drawn from D a function f : X → Y such that f (x) ≈ y Notation: ŷ def = f (x) Evaluate the risk Error function l : Y × Y → Ò captures the cost of errors Expected error Åx,y∌D l(f (x), y) .Training might be done with - another D - another l G Varoquaux 2
  • 6. Outline 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration 2 Evaluation procedures Evaluating a prediction rule Evaluating a training procedure G Varoquaux 3
  • 7. 1 Performance metrics Metrics must capture and reflect application Going further for images analysis [Maier-Hein... 2022]
  • 8. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 9. Summary metric: accuracy Accuracy: counts the fraction of error Simple summary Overly simplified picture Distinguishing false detections from misses eg false detection of a brain tumor? - Very bad if leads to brain surgery - Minor if leads to confirmatory imaging Relevant cost (error metric) dependent on application setting G Varoquaux 6
  • 10. Confusion matrix: beyond accuracy Truth: Disease status D+ D− Predicted: test result T+ TP FP T− FN TN G Varoquaux 7
  • 11. Metrics on positive and negative classes Sensitivity (also called recall): fraction of positive samples actually retrieved Sensitivity = TP TP+FN Specificity: fraction of negative samples actually classified as negative Specificity = TN TN+FP G Varoquaux 8 [Varoquaux and Colliot 2022]
  • 12. Metrics on positive and negative classes Sensitivity (also called recall): fraction of positive samples actually retrieved Sensitivity = TP TP+FN Specificity: fraction of negative samples actually classified as negative Specificity = TN TN+FP Positive predictive value (PPV, also called precision): fraction of the positively classified samples which are indeed positive PPV = TP TP+FP Negative predictive value (NPV): fraction of the negatively classified samples which are indeed negative NPV = TN TN+FN G Varoquaux 8 [Varoquaux and Colliot 2022]
  • 13. Metrics on positive and negative classes T denotes test: classifier output; D denotes diseased status. Sensitivity (also called recall): fraction of positive samples actually retrieved Sensitivity = TP TP+FN Estimates P(T+ | D+) Specificity: fraction of negative samples actually classified as negative Specificity = TN TN+FP Estimates P(T− | D−) Positive predictive value (PPV, also called precision): fraction of the positively classified samples which are indeed positive PPV = TP TP+FP Estimates P(D+ | T+) Negative predictive value (NPV): fraction of the negatively classified samples which are indeed negative NPV = TN TN+FN Estimates P(D− | T−) G Varoquaux 8 [Varoquaux and Colliot 2022]
  • 14. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 15. Accuracy, sensitivity, specificity Truth: Disease status D+ D− Predicted: test result T+ TP FP T− FN TN Accuracy uninformative under class imbalance 90% of class 0 ⇒ predicting only class 0 gives Acc=90% Balanced accuracy: errors on class 0 and class 1 Sensitivity (also called recall): fraction of class 1 retrieved. TP TP + FN Specificity: fraction of class 0 actually classified as 0. TN TN + FP Balanced accuracy: 1 2 (sensitivity + specificity) Sensitivity: P(T+|D+) Specificity: P(T − |D − ) G Varoquaux 10
  • 16. Asking the right question: P(T+|D+) vs P(D+|T+) Positive predictive value (via Bayes’ theorem): P(D+ | T+) = sensitivity × prevalence (1 − specificity) × (1 − prevalence) + sensitivity × prevalence . Diseased Test positive Summary metric: Markedness: PPV + NPV - 1 Drawback: depends on prevalence ⇒ Characterizes not only the classifier, but also the dataset G Varoquaux 11
  • 17. Odds ratios for invariance to sampling Definition: Odds of a O(a) = P(a) 1 − P(a) Likelihood ratio of positive class: LR+ = O(D+|T+) O(D + ) = Sensitivity 1 − Specificity Independent of class prevalence1 Use prevalence on target population to compute O(D+|T+) Useful to extrapolate from case-control study to target population 1Proof in Varoquaux Colliot “Evaluating machine learning models and their diagnostic value” G Varoquaux 12
  • 18. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 19. Multi-threshold metrics: ROC curve 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 0.0 0.2 0.4 0.6 0.8 1.0 True positive rate = sensitivity or recall Excellent, AUC=0.95 Good, AUC=0.88 Poor, AUC=0.75 Chance, AUC=0.50 Classifiers about continuous scores ROC obtained by varying the threshold Summary metric: AUC Area Under the Curve Chance = .5 G Varoquaux 14
  • 20. Multi-threshold metrics: Precision Recall curve 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision = TPR or sensitivity = PPV Excellent, AUC=0.96 Good, AUC=0.92 Poor, AUC=0.78 Chance, AUC=0.57 Better dynamical range than ROC for imbalanced classes Area Under the Curve: Chance depends on class imbalance G Varoquaux 15
  • 21. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 22. Interpreting classifier score as a probability? – Calibration Calibration Average error rate for all samples with score s is s Computed in bins on score s ECE: expected calibration error Average error on all bins 0.0 0.2 0.4 0.6 0.8 1.0 Mean confidence score 0.0 0.2 0.4 0.6 0.8 1.0 Observed fraction of positive Underconfident Overconfident Calibration plot Perfectly calibrated GaussianNB .Does not control individual probabilities G Varoquaux 17
  • 23. Classifier score as a probability? – Individual probabilities Does the classifier approach P(y|X)? Metric: Brier score = Õ i (ŝi − yi) 2 Observed (binary) label Confidence score Minimal for ŝ = P(y|X) Scaled version: Brier skill score = 1 − Brier(ŝ, y) Brier(ȳ, y) Class prevalence 1 = perfect prediction 0 = scores not more informative than class prevalence G Varoquaux 18
  • 24. Choose well your metrics Machine learning research chases metrics These should reflect application as well as possible Worry about P(D + |T+) - Accuracy reasonable proxy only for balanced classes - LR+ interesting to keep in mind Worry about uncertainty - Calibration quantifies average errors - Brier scores controls individual uncertainty A single number does not tell the whole story G Varoquaux 19
  • 25. 2 Evaluation procedures Controlled estimation of e = Åx,y∌D l(f (x), y) Corresponding research paper: [Bouthillier... 2021]
  • 26. Definitions: what are we benchmarking? Senario 1: a prediction rule: We are given f : X → Y Senario 2: a training procedure: We are given: a procedure that outputs a prediction rule f̂ from training data (X, y) ∈ (X × Y)n G Varoquaux 21
  • 27. Definitions: what are we benchmarking? Senario 1: a prediction rule: We are given f : X → Y For application claims: eg medicine Senario 2: a training procedure: We are given: a procedure that outputs a prediction rule f̂ from training data (X, y) ∈ (X × Y)n For machine-learning research (claims on algorithms) G Varoquaux 21
  • 28. Model evaluation 101 In-sample error is not meaningful X, y ̞⊄ ⊄ f̂ Å estimated wrong G Varoquaux 22
  • 29. Model evaluation 101 In-sample error is not meaningful X, y ̞⊄ ⊄ f̂ Å estimated wrong Typically: Test set Train set Full data typically 10% trade off better learn- ing vs better estimation G Varoquaux 22
  • 30. 2 Evaluation procedures Evaluating a prediction rule Evaluating a training procedure
  • 31. External validation Before putting in production We are given f : X → Y Xtest different enough from Xtrain - No repeated acquisition of same individual in train test [Little... 2017] - Ideally: show generalization to new site, later in time... Xtest representative of target population Sample Xtest: - To match statistical moments - To minimize a confounding association (shortcuts) [Chyzhyk... 2018] G Varoquaux 24
  • 32. Evaluation error: Sampling noise on test set Evaluation quality is limited by number of test examples Sampling noise1 for ntest = 1000: -10% -5% 0% +5% +10% Binomial distribution of error on test accuracy -2% +2% Confidence intervals ntest = 1 000 interval: 5.7% ntest = 10 000 interval: 1.8% ntest = 100 000 interval: 0.6% 1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and sampling other data will lead to other results. G Varoquaux 25 [Varoquaux 2018]
  • 33. Evaluation error: Sampling noise on test set “in the wild” 102 103 104 105 106 Test set size 0 1 2 3 4 Standard deviation (% acc) In Theory: From a Binomial In Practice: Random splits Binom(n', 0.66) Binom(n', 0.95) Binom(n', 0.91) Glue-RTE BERT (n'=277) Glue-SST2 BERT (n'=872) CIFAR10 VGG11 (n'=10000) G Varoquaux 26 [Bouthillier... 2021]
  • 34. Evaluation noise is not negligible – in Kaggle competitions Lung cancer classification Test size: max 1K Smaller improvements than noise -0.75 0.0 +0.75 Diminishing returns Schizophrenia classification Test size: 120 -0.2 0.0 +0.2 Diminishing returns Lung tumor segmentation Test size: max 6k Poorer score on private set -0.05 0.0 +0.05 Overfit Nerve segmentation Test size 5.5K -0.04 0.0 +0.04 Improvement of top model on 10% best Winner gap Evaluation noise between public and private sets Actual improvement G Varoquaux 27 [Varoquaux and Cheplygina 2022]
  • 35. Confidence intervals statistical testing Evaluate uncertainty to know when to stop, what to trust (diminishing returns, creeping complexity) Confidence interval1: Range of values compatible with the observations Null hypothesis testing: p-value: the chance to observe the results if a null hypothesis were true Typical null: model comparison model p1 and p2 give same expected error 1Technically not making the difference with a credible interval G Varoquaux 28
  • 36. Statistical testing with a given test set Test set Train set Full data Simple distribution of metrics, eg accuracy1: binomial Confidence intervals: N 65% 80% 90% 95% 100 [-9.0% 9.0%] [-8.0% 8.0%] [-6.0% 5.0%] [-5.0% 4.0%] 1000 [-3.0% 2.9%] [-2.5% 2.4%] [-1.9% 1.8%] [-1.4% 1.3%] 10000 [-0.9% 0.9%] [-0.8% 0.8%] [-0.6% 0.6%] [-0.4% 0.4%] 100000 [-0.3% 0.3%] [-0.2% 0.2%] [-0.2% 0.2%] [-0.1% 0.1%] Table from [Varoquaux and Colliot 2022] 1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022] G Varoquaux 29
  • 37. Statistical testing with a given test set Test set Train set Full data Simple distribution of metrics, eg accuracy1: binomial Non-parametric: - Confidence intervals of performance of a prediction rule Bootstrap test set - Statistical testing (compare prediction rules with correlated errors) Permutations [Bandos... 2005] Sample null by randomly switching predictions from p1 and p2. 1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022] G Varoquaux 29
  • 38. 2 Evaluation procedures Evaluating a prediction rule Evaluating a training procedure
  • 39. Benchmarking to conclude on good training procedures We are given: a procedure that outputs a prediction rule f̂ from training data (X, y) ∈ (X × Y)n We want machine-learning research claims (novel frobnicate improves prediction) Many arbitrary components torch.manual seed(3407) ?? [Picard 2021] Useless to tune random seeds (for weights init, dropout, data augmentation) will not carry over to new training data G Varoquaux 31
  • 40. Arbitrary variance in a machine-learning benchmark Variance when rerunning an evaluation, modifying arbitrary elements: 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 eter 0.5 mentation 0 1 vgg 0 1 average relative scale Across various computer vision and NLP tasks G Varoquaux 32 [Bouthillier... 2021]
  • 41. Uncertainty due to test set sampling The test set remains a limited sample of the population The train-test split is an arbitrary choice 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 eter 0.5 mentation 0 1 vgg 0 1 average G Varoquaux 33 Test set Train set Full data [Bouthillier... 2021]
  • 42. Uncertainty due to test set sampling The test set remains a limited sample of the population The train-test split is an arbitrary choice 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 eter 0.5 mentation 0 1 vgg 0 1 average G Varoquaux 33 Test set Train set Full data Better evaluation Sample multiple times these arbitrary choices [Bouthillier... 2021]
  • 43. Better evaluation procedure Cross-validation multiple train-test splits Randomize all arbitrary factors in folds G Varoquaux 34 Test set Train set Full data
  • 44. Better evaluation procedure Cross-validation multiple train-test splits Randomize all arbitrary factors in folds But, a full learning pipeline comes with hyper-parameters These are set to minimize error on left-out data Test set Full data Validation set Train set Make model choices Evaluate model G Varoquaux 34 Test set Train set Full data
  • 45. Benchmarks accounting for hyper-parameter selection Setting hyper-parameters is part of the pipeline It must be evaluated One option: train, validation, and test splits + manual tuning of hyperparameters Test set Full data Validation set Train set Make model choices Evaluate model Rampant overfit of validation set [Makarova... 2021] G Varoquaux 35
  • 46. Benchmarks accounting for hyper-parameter selection Setting hyper-parameters is part of the pipeline It must be evaluated One option: train, validation, and test splits + manual tuning of hyperparameters Test set Full data Validation set Train set Make model choices Evaluate model Rampant overfit of validation set [Makarova... 2021] Multiple splits + automated tuning of hyperparameters Validation set Full data Test set Train set Hyper-parameter tuning Outer loop G Varoquaux 35
  • 47. Hyper-parameter optimization procedures Grid search Vary each hyper-parameter on a grid G Varoquaux 36
  • 48. Hyper-parameter optimization procedures Grid search Random search [Bergstra and Bengio 2012] prefer to grid-search for more than 2 params Sub-optimal hyper- parameters on models routinely lead to invalid conclusions See refs in [Bouthillier... 2021] Region of good hyperparameters Hyperparameter 1 Hyperparameter 2 Grid Search Randomized Search (important hyperparameter) (unimportant hyperparameter) G Varoquaux 36
  • 49. Benchmarking training procedures (eg to compare them) Control arbitrary fluctuations (that will not generalize) Sample all: data sampling Multiple train-test splits (cross-validation) Test set Train set Full data arbitrary choices (seeds) Randomize them all hyper-parameters Hyper-parameter optimization Too expensive to randomize In practice: set hyper parameters once, then randomize model seeds and data splits [Bouthillier... 2021] Counterintuitive: randomization decorrelates errors, improving benchmarks G Varoquaux 37
  • 50. Accounting for variance in conclusions Confidence intervals statistical testing Test set Train set Full data Statistical testing with multiple folds Challenge: folds are not independent1 .t-test/Wilcoxon across folds are not valid Don’t divide std by number of folds Approximate corrections for dependence across folds2 5x2cv: repeat 5 times randomized 2-fold [Dietterich 1998] Corrected resampled t-test statistic [Nadeau and Bengio 2003] 1Train sets overlap, and often test sets also do. 2Does not account for sources of variance beyond data sampling, eg random seeds, hyper parameters. G Varoquaux 38
  • 51. Statistical tests: across datasets (more general claims on training pipelines) Challenge: metrics not comparable across datasets =⇒ Tests based on rank statistics Wilcoxon signed rank test Tests how often p1 outperforms p2 across datasets G Varoquaux 39 [Demšar 2006]
  • 52. Statistical tests: multiple pipelines across datasets Challenge: multiple comparisons1 The Wilcoxon-Holm approach [Demšar 2006] Pairwise comparisons + Bonferroni-Holm correction The Friedman-Nemenyi approach2 [Demšar 2006] 1. Friedman test across all pipelines (omnibus) 2. Critical difference via Nemenyi test 1 2 3 4 5 4.2000 clf13.7667 clf23.5000 clf4 2.0000clf5 1.5333clf3 Accuracy (rank) Critical difference diagrams Replicability analysis [Dror... 2017] Perform dataset-level pairwise tests Combine by testing (partial conjunction multiple-testing): “Does p1 perform better than p2 on at least u datasets?” More powerful for a small number of datasets 1If we do many tests, some will show large differences by chance. 2The Holm approach can be more interesting when considering comparisons to a referent classifier. G Varoquaux 40
  • 53. Statistical tests: beyond null-hypothesis testing Sample size is a problem Across datasets: significance typically requires 15 datasets In a dataset (repeating folds, seeds...): many repetitions makes any difference significant1 .Underpowered experiments bring no evidence Shortcomings of null-hypothesis testing Significance decreases with more comparisons Statistically significance , practical significance 1Though as the total test-set size is limited, they do not bring more evidence for generalization. G Varoquaux 41 [Demšar 2008]
  • 54. Statistical tests: beyond null-hypothesis testing Sample size is a problem Across datasets: significance typically requires 15 datasets In a dataset (repeating folds, seeds...): many repetitions makes any difference significant1 .Underpowered experiments bring no evidence Shortcomings of null-hypothesis testing Significance decreases with more comparisons2 Statistically significance , practical significance 1Though as the total test-set size is limited, they do not bring more evidence for generalization. 2FDR (False Discovery Rate) attempts to solve this. G Varoquaux 41 [Demšar 2008]
  • 55. Statistical tests: accounting for effect sizes Neyman-Pearson view of hypothesis testing Two hypothesis, H0 and H1 H1: p1 outperforms p2 by a margin1 Which is mostly likely? H0 H1 H0 H1 Requires the choice of the margin Related to superiority testing in clinical trials [Lesaffre 2008] 1Related to the rejection region in the Neyman-Pearson lemma. G Varoquaux 42
  • 56. Pragmatic compromises Test on P(p1 p2) ÎŽ ÎŽ .5: Neyman-Pearson view Evaluate P(p1 p2) by resampling Randomize everything: data splits, seeds,... Gaussian approximation: amounts to comparing differences to standard deviations Not an inference on the expected difference in performance1 1Unlike standard error, standard deviation does not shrink to zero with the number of resampling. G Varoquaux 43 [Bouthillier... 2021]
  • 57. Summary – comparing learning procedures Sample multiple sources of variance: - data split - random seeds - hyper-parameter tuning Null-hypothesis testing: no t-test on cross-validation! Don’t mis-interpret p-value: - Not significant: more data could change that - Significant differences may be trivial Detect practical differences: delta in performance standard deviation G Varoquaux 44
  • 58. Better experimental procedures Crack the black box open A prediction score is seldom insightful Ablation studies: remove/change atomic elements Learning curves Better benchmarking in these Tune hyper-parameters to the same quality Randomize everything Account for variance in conclusions G Varoquaux 45
  • 59. Summary – Better benchmarking procedures Reminder: Your validation measure is intrinsically unreli- able (sampling noise) An arbitrary choice (random seed) may give seemingly-good results that do not generalize Optimizing test accuracy will explore the tails Evaluation is a bottleneck 300 200 100 30 Number of available samples    ­2% +2% ­4% +4% ­5% +5% ­7% +7% ­15% +12% G Varoquaux 46
  • 60. Model evaluation Choice of performance metrics Should be suited to the application setting Machine learning does metric chasing Ć  Evaluation procedures Account for variance Different sources of variance for applying prediction rules vs learning them Careful benchmarking is crucial Optimistic flukes will not generalize What is our purpose? External validity a G Varoquaux 47 @GaelVaroquaux
  • 61. References I A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873, 2005. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281, 2012. X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021. D. Chyzhyk, G. Varoquaux, B. Thirion, and M. Milham. Controlling a confound in predictive models with a test set minimizing its effect. In 2018 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2018. J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006. J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in conjunction with ICML, page 65. Citeseer, 2008. G Varoquaux 48
  • 62. References II T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998. R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 2017. E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU hospital for joint diseases, 66(2), 2008. M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording. Using and understanding cross-validation strategies. perspectives on saeb et al. GigaScience, 6(5):1–6, 2017. L. Maier-Hein, A. Reinke, E. Christodoulou, B. Glocker, P. Godau, F. Isensee, J. Kleesiek, M. Kozubek, M. Reyes, M. A. Riegler, ... Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv preprint arXiv:2206.01653, 2022. G Varoquaux 49
  • 63. References III A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and C. Archambeau. Overfitting in bayesian optimization: an empirical study and early-stopping solution. arXiv preprint arXiv:2104.08166, 2021. C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3): 239–281, 2003. D. Picard. Torch. manual seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. arXiv:2109.08203, 2021. N. Traut, K. Heuer, G. Lemaı̂tre, A. Beggiato, D. Germanaud, M. Elmaleh, A. Bethegnies, L. Bonnasse-Gahot, W. Cai, S. Chambon, ... Insights from an autism imaging biomarker challenge: promises and threats to biomarker discovery. NeuroImage, 255:119171, 2022. G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage, 180:68–77, 2018. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):1–8, 2022. G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022. G Varoquaux 50