Cross-validation to assess decoder performance: the good, the bad, and the ugly

Cross-validation to assess decoder performance:
the good, the bad, and the ugly
Gaël Varoquaux
https://hal.archives-ouvertes.fr/hal-01332785

Measuring prediction accuracy
To ﬁnd the best method
(computer scientists)
For information mapping = omnibus test
(cognitive neuroimaging)
Cross-validation
asymptotically unbiased
non parametric
G Varoquaux 2

1 Some theory
2 Empirical results on brain imaging
G Varoquaux 3

1 Some theory
Test setTrain set
Full data
G Varoquaux 4

1 Cross-validation
Test on independent data
Train set Validation
set
G Varoquaux 5

1 Cross-validation
set
Loop
Test setTrain set
Full data
Measures prediction accuracy
G Varoquaux 5

1 Choice of cross-validation strategy
Be robust to confounding dependences
Leave subjects out, or sessions out
Loop
More loop = more data points
Need to balance error in training model
/ error on test
G Varoquaux 6

1 Choice of cross-validation strategy: theory
Negative bias (underestimate performance)
decreasing with the size of the training set
[Arlot... 2010] sec.5.1
Variance decreases with the size of the test set
[Arlot... 2010] sec.5.2
Fraction of data left out: 10–20%
Many random splits of the data
respecting dependency structure
G Varoquaux 7

1 Tuning hyper-parameters
Computer scientist says:
You need to set C in your SVM
G Varoquaux 8

1 Tuning hyper-parameters
Computer scientist says:
You need to set C in your SVM
10-4
10-3
10-2
10-1
100
101
102
103
104
Parameter tuning: C
Training set
Validation set
G Varoquaux 8

1 Nested cross-validation
set
Two loops
Validation set
Full data
Test setTrain set
Nested loop
Outer loop
G Varoquaux 9

2 Empirical results on brain
imaging
Validation set
Full data
Test setTrain set
Nested loop
Outer loop
G Varoquaux 10

2 Datasets and tasks
7 fMRI datasets (6 from openfMRI)
Haxby: 5 subjects, 15 inter-subject predictions
Inter-subject predictions on 6 studies
OASIS VBM, gender discrimination
HCP MEG task, intra-subject, working memory
# samples: ∼ 200 (min 80, max 400)
accuracy min 62%, max 96%
G Varoquaux 11

2 Experiment 1: measuring cross-validation error
Leave out a large validation set
Measure error by cross-validation on the rest
Compare
Validation set
Full data
Test setTrain set
Nested loop
Outer loop
G Varoquaux 12

2 Cross-validated measure versus validation set
50.0% 60.0% 70.0% 80.0% 90.0% 100.0%
Accuracy on validation set
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
Accuracy
measured by crossvalidation
Intra subject
Inter subject
G Varoquaux 13

2 Diﬀerent cross-validation strategies
Cross-validation Diﬀerence in accuracy measured
strategy by cross-validation and on validation set
40% 20% 10% 0% +10% +20% +40%
Leave one
sample out
22% +19%
+3% +43%
Intra
subject
Inter
subject
G Varoquaux 14

40% 20% 10% 0% +10% +20% +40%
Leave one
sample out
Leave one
subject/session
22% +19%
+3% +43%
10% +10%
21% +17%
Intra
subject
Inter
subject
G Varoquaux 14

40% 20% 10% 0% +10% +20% +40%
Leave one
sample out
Leave one
subject/session
20% left out,
3 splits
22% +19%
+3% +43%
10% +10%
21% +17%
11% +11%
24% +16%
Intra
subject
Inter
subject
G Varoquaux 14

40% 20% 10% 0% +10% +20% +40%
Leave one
sample out
Leave one
subject/session
20% left out,
3 splits
20% left out,
10 splits
20% left out,
50 splits
22% +19%
+3% +43%
10% +10%
21% +17%
11% +11%
24% +16%
9% +9%
24% +14%
9% +8%
23% +13%
Intra
subject
Inter
subject
G Varoquaux 14

2 Simple simulations
X1
X2
time
X1
2 Gaussian-separated
clouds
Auto-correlated noise
200 decoding samples
10 000 validation samples
⇒ Validation
= assymptotics
G Varoquaux 15

2 Simple simulations
X1
X2
time
X1
X1
X2
time
X1
G Varoquaux 15

40% 20% 10% 0% +10% +20% +40%
Leave one
sample out
Leave one
block out
20% leftout,
3 splits
20% leftout,
10 splits
20% leftout,
50 splits
16% +14%
+4% +33%
15% +13%
8% +8%
15% +12%
10% +11%
13% +10%
8% +8%
12% +10%
7% +7%
MEG data
Simulations
G Varoquaux 16

2 Experiment 2: parameter-tuning
Compare diﬀerent strategies on validation set:
1. Use the default C = 1
2. Use C = 1000
3. Choose best C by cross-validation and reﬁt
3. Average best models in cross-validation
Validation set
Full data
Test setTrain set
Nested loop
Outer loop
G Varoquaux 17

2 Experiment 2: parameter-tuning
Compare diﬀerent strategies on validation set:
1. Use the default C = 1
2. Use C = 1000
3. Choose best C by cross-validation and reﬁt
3. Average best models in cross-validation
Validation set
Full data
Test setTrain set
Nested loop
Outer loop
Non-sparse decoders
SVM 2
Log-reg 2
Sparse decoders
SVM 1
Log-reg 1
G Varoquaux 17

2 Cross-validation for tuning?
CV +
averaging CV +
refitting C=1
C=1000
8%
4%
2%
0%
+2%
+4%
+8%
Impact on prediction accuracy
SVM
logreg
⇓
CV +
averaging CV +
refitting C=1
C=1000
8%
4%
2%
0%
+2%
+4%
+8%
Impact on prediction accuracy
SVM
logreg
⇑
Non-sparse models Sparse models
G Varoquaux 18

@GaelVaroquaux
Cross-validation: lessons learned
Don’t use Leave One Out
Random 10-20% splits respecting sample structure

@GaelVaroquaux
Cross-validation has error bars of ±10%

@GaelVaroquaux
Cross-validation is ineﬃcient for parameter tuning
- C = 1 for SVM- 2
- model averaging for SVM- 1

@GaelVaroquaux
Cross-validation is ineﬃcient for parameter tuning
- C = 1 for SVM- 2
- model averaging for SVM- 1
https://hal.archives-ouvertes.fr/hal-01332785
ni

References I
S. Arlot, A. Celisse, ... A survey of cross-validation procedures for
model selection. Statistics surveys, 4:40–79, 2010.

Cross-validation to assess decoder performance: the good, the bad, and the ugly

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from Gael Varoquaux

More from Gael Varoquaux (19)

Recently uploaded

Recently uploaded (20)

Cross-validation to assess decoder performance: the good, the bad, and the ugly