Machine learning builds predictive models from the data. It is massive used on medical images these days, for a variety of applications ranging from segmentation to diagnosis.
This is an introductory tutorial to machine learning from giving intuitions on the statistical point of view. It introduce the methodology, the concepts behind the central models, the validation framework and some caveats to look for.
It also discusses some applications to drawing conclusions from brain imaging, and use these applications to highlight various technical aspects to running machine learning models on high-dimensional data such as medical imaging.
2. Introduction to Machine Learning in MR Imaging
Gaël Varoquaux
Outline:
1 The machine learning setting
2 A glance at a few models
3 Model evaluation
4 Learning on full-brain images
5 Learning on correlations in brain activity
3. Speaker Name: Gaël Varoquaux
I have no financial interests or relationships to disclose with regard to the
subject matter of this presentation.
Declaration of
Financial Interests or Relationships
4. 1 The machine learning setting
Adjusting models for prediction
G Varoquaux 3
5. 1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
G Varoquaux 4
6. 1 Machine learning in a nutshell: an example
Face recognition
Andrew Bill Charles Dave
G Varoquaux 4
7. 1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names that go with
them.
2 From a new (noisy) images, find the image that is most similar.
“Nearest neighbor” method
G Varoquaux 5
8. 1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names that go with
them.
2 From a new (noisy) images, find the image that is most similar.
“Nearest neighbor” method
How many errors on already-known images?
... 0: no errors
Test data = Train data
G Varoquaux 5
9. 1 Machine learning in a nutshell
A simple method:
1 Store all the known (noisy) images and the names that go with
them.
2 From a new (noisy) images, find the image that is most similar.
“Nearest neighbor” method
How many errors on already-known images?
... 0: no errors
Test data = Train data
G Varoquaux 5
10. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y
G Varoquaux 6
11. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y
x
y
Which model to prefer?
G Varoquaux 6
12. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y
x
y
Problem of “over-fitting”
Minimizing error is not always the best strategy (learning noise)
Test data = train data
G Varoquaux 6
13. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y
x
y
Prefer simple models = concept of “regularization”
Balance the number of parameters to learn with the amount of data
G Varoquaux 6
14. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y
x
y
Prefer simple models = concept of “regularization”
Balance the number of parameters to learn with the amount of data
Bias variance tradeoff
G Varoquaux 6
15. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y Two descriptors: 2 dimensions
X_1
X_2
y
More
parameters
G Varoquaux 6
16. 1 Machine learning in a nutshell: regression
A single descriptor: 1 dimension
x
y Two descriptors: 2 dimensions
X_1
X_2
y
More
parameters
⇒ Model with more parameters need much more data
“curse of dimensionality”
G Varoquaux 6
17. 1 Some formalism: bias and regularization
Settings: data (X, y), prediction y ∼ f (X, w)
Our goal: minimizew
E[ y − f (X, w) ]
We only can measure y − f (X, w)
G Varoquaux 7
18. 1 Some formalism: bias and regularization
Settings: data (X, y), prediction y ∼ f (X, w)
Our goal: minimizew
E[ y − f (X, w) ]
We only can measure y − f (X, w)
Prediction is very difficult, especially about the future.
Niels Bohr
G Varoquaux 7
19. 1 Some formalism: bias and regularization
Settings: data (X, y), prediction y ∼ f (X, w)
Our goal: minimizew
E[ y − f (X, w) ]
We only can measure y − f (X, w)
Solution: bias w to push toward a plausible solution
In a minimization framework: minimizew
y − f (X, w) + p(w)
G Varoquaux 7
20. 1 Probabilistic modeling: Bayesian point of view
P(w|x, y) ∝ P(y, x|w) P(w) ( )
“Posterior”
Quantity of interest
Forward model “Prior”
Expectations on w
Forward model: y = f (x, w) + e, e: noise
⇒ P(x, y|w) ∝ exp −L(y − f (x, w))
Prior: P(w) ∝ exp −p(w)
Negated log of ( ): L(y − f (X, w)) + p(w)
Minimization framework = maximum a posteriori
Note that the Bayesian interpretation may be more subtle [Gribonval 2011].G Varoquaux 8
21. 1 Summary: elements of a machine-learning method
A forward model: ypred = f (X, w)
Numerical rules to go from X to y
A loss, or data fit
A measure of error between ytrue and ypred
Can be given by a noise model
Regularization:
Any way of restricting model complexity
- by choices in the model
- via a penalty
G Varoquaux 9
22. 2 A glance at a few models
Linear models
Random forests
Gradient boosted trees
G Varoquaux 10
23. 2 Classification: categorical variables
y describes categories, say (0, 1)
x
y
Fit y by a straight line?
G Varoquaux 11
24. 2 Classification: categorical variables
y describes categories, say (0, 1)
x
y
Fit y by a straight line?
A sigmoid is better suited
⇒ Logistic regression
G Varoquaux 11
25. 2 Classification: categorical variables
y describes categories, say (0, 1)
x
y
Fit y by a straight line?
A sigmoid is better suited
⇒ Logistic regression
SVM (Support Vector Machines)
are similar
Note that points far from threshold don’t influence the fit
G Varoquaux 11
26. 2 Classification on multivariate data
Choosing a separating hyperplane
X_1
X_2
G Varoquaux 12
27. 2 Classification on multivariate data: SVMs
SVM: support vector machine
X_1
X_2
Discriminative method: choose hyperplane to maximize margin
G Varoquaux 12
28. 2 Classification on multivariate data: SVMs
SVM: support vector machine
X_1
X_2
x
y
Discriminative method: choose hyperplane to maximize margin
For this, use exemplars close to hyperplane (support vectors)G Varoquaux 12
29. 2 With brain images
Find a separating hyperplane
between images
G Varoquaux 13
30. 2 With brain images
Find a separating hyperplane
between images
SVM builds it by combining
exemplars
⇒ hyperplane looks like
train images
G Varoquaux 13
31. 2 Risk minimization and penalization
Inverse problem
Minimize the error term:
ˆw = argmin
w
l(y − X w)
Ill-posed:
Many different w will give
the same prediction error
To choose one: inject prior with a penalty
ˆw = argmin
w
l(y − X w) + p(w)
G Varoquaux 14
32. 2 Sparse models: selecting predictive voxels?
Lasso estimator
minx
y − Ax 2
2 + λ 1(x)
Data fit Penalization
x2
x1
G Varoquaux 15
33. 2 Sparse models: selecting predictive voxels?
Lasso estimator
minx
y − Ax 2
2 + λ 1(x)
Data fit Penalization
x2
x1
Between correlated features,
selects a random subset
[Wainwright 2009, Varoquaux... 2012]
Violates the restricted isometry property
lasso: 23 non-zerosG Varoquaux 15
35. 2 Underfit versus overfit
Polynome degree 1
x
y Polynome degree 9
x
yG Varoquaux 16
36. 2 Underfit versus overfit
Polynome degree 1
x
y
Train explained variance: 0.56
Test explained variance: 0.56
Underfit
Polynome degree 9
x
y
Train explained variance: 0.99
Test explained variance: -98000
Overfit
G Varoquaux 16
37. 2 Random forest
1 staircase
Fit with a staircase of many values (tree)
G Varoquaux 17
38. 2 Random forest
1 staircase
Fit with a staircase of many values (tree)
Fit several on perturbed data
G Varoquaux 17
39. 2 Random forest
1 staircase
Fit with a staircase of many values (tree)
Fit several on perturbed data
G Varoquaux 17
40. 2 Random forest
1 staircase
Fit with a staircase of many values (tree)
Fit several on perturbed data
G Varoquaux 17
41. 2 Random forest
1 staircase
Fit with a staircase of many values (tree)
Fit several on perturbed data
G Varoquaux 17
42. 2 Random forest
1 staircase
10 staircases averaged
Fit with a staircase of many values (tree)
Fit several on perturbed data
Average prediction
G Varoquaux 17
43. 2 Random forest
1 staircase
10 staircases averaged
50 staircases averaged
Fit with a staircase of many values (tree)
Fit several on perturbed data
Average prediction
G Varoquaux 17
44. 2 Random forest
1 staircase
10 staircases averaged
50 staircases averaged
300 staircases averaged
Fit with a staircase of many values (tree)
Fit several on perturbed data
Average prediction
G Varoquaux 17
45. 2 Random forest
1 staircase
10 staircases averaged
50 staircases averaged
300 staircases averaged
Fit with a staircase of many values (tree)
Fit several on perturbed data
Average prediction
Forest = ensemble of trees
Bagging = boostrap aggregating
Applied to models that overfit
G Varoquaux 17
46. 2 Gradient boosted trees
1 staircase
Fit with a staircase of 10 constant values (tree)
G Varoquaux 18
47. 2 Gradient boosted trees
1 staircase
2 staircases combined
Fit with a staircase of 10 constant values (tree)
Fit a new staircase on errors
G Varoquaux 18
48. 2 Gradient boosted trees
1 staircase
2 staircases combined
3 staircases combined
Fit with a staircase of 10 constant values (tree)
Fit a new staircase on errors
Keep going
G Varoquaux 18
49. 2 Gradient boosted trees
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values (tree)
Fit a new staircase on errors
Keep going
G Varoquaux 18
50. 2 Gradient boosted trees
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values (tree)
Fit a new staircase on errors
Keep going
“Boosted” regression trees
Boosting = iteratively fit models on error
Applied to models that underfit
G Varoquaux 18
51. 2 Gradient boosted trees
1 staircase
2 staircases combined
3 staircases combined
300 staircases combined
Fit with a staircase of 10 constant values (tree)
Fit a new staircase on errors
Keep going
“Boosted” regression trees
Boosting = iteratively fit models on error
Applied to models that underfit
Complexity trade offs
Computational + statistical
G Varoquaux 18
54. 3 Generalization as a test: cross-validation
x
y
x
y
⇒ Need test on independent data, to control for model complexity
Train set Validation
set
Measures prediction accuracy [Varoquaux... 2017]
G Varoquaux 20
55. 3 Generalization as a test: cross-validation
x
y
x
y
⇒ Need test on independent data,
Loop
Test setTrain set
Full data
G Varoquaux 20
56. 3 Generalization as a test: cross-validation
x
y
x
y
⇒ Need test on independent data,
Nested loop
Validation set
Full data
Test setTrain set
Nested loop
Outer loop
G Varoquaux 20
57. 3 Confounds & dependencies
time
X1 Prediction is easy
if you look at the last time point
G Varoquaux 21
58. 3 Confounds & dependencies
time
X1X2 Prediction is easy
if you look at the last time point
Auto-correlated noise
Neighboring data points
not independent
No evidence of generalization
No evidence of useful prediction
Validate on independent data
Not 2 images of the same subject
Unless it’s a longitudinal study
[Varoquaux... 2017, Little... 2017]
G Varoquaux 21
59. 3 Uncertainty of cross-validation
[Varoquaux 2017]
100
200
er of available samples
19% +15%
10% +8%
10% +10%
7% +5%
7% +7%
6% +6%
LOO
LOO
50 splits, 20% test
LOO
50 splits, 20% test
50 splits, 20% test50 splits, 20% test
Distribution of prediction accuracies across 50 folds
Should be seen as a posterior distribution (or bootstrap)
= folds are not independent measures
No statistical tests (eg T tests) on folds
G Varoquaux 22
60. 3 Uncertainty of cross-validation
[Varoquaux 2017]
100
200
300
1000
Number of available samples
19% +15%
10% +8%
10% +10%
7% +5%
7% +7%
5% +4%
6% +6%
3% +2%
3% +3%
LOO
LOO
50 splits, 20% test
LOO
50 splits, 20% test
LOO
50 splits, 20% test
50 splits, 20% test50 splits, 20% test
G Varoquaux 22
61. 3 Uncertainty of cross-validation
[Varoquaux 2017]
100
200
300
ber of available samples
19% +15%
10% +8%
10% +10%
7% +5%
7% +7%
5% +4%
6% +6%
LOO
LOO
50 splits, 20% test
LOO
50 splits, 20% test
50 splits, 20% test50 splits, 20% test
Sample size Confidence interval (binomial model)
100 18.0%
1 000 5.7%
10 000 1.8%
100 000 0.568%
Sampling noise – Decreases slowly with n
G Varoquaux 22
62. 3 Uncertainty of cross-validation
[Varoquaux 2017]
45% 30% 15% 0% +15% +30%
Difference between public and private scores
15% +14%
Kaggle competition on r-fMRI for Schizophrenia
2 different test sets: size 30 and 28
Winner of competition: first code written
G Varoquaux 22
63. 4 Learning on full-brain images
Personal focus on functional imaging
G Varoquaux 23
64. 4 Linear models on brain images
Design
matrix
× Coefficients =
Coefficients are
brain maps
Target
G Varoquaux 24
65. 4 Finding the predictive regions?
Face vs house visual recognition [Haxby... 2001]
SVM
error: 26%
G Varoquaux 25
66. 4 Finding the predictive regions?
Face vs house visual recognition [Haxby... 2001]
Sparse model
error: 19%
G Varoquaux 25
67. 4 Finding the predictive regions?
Face vs house visual recognition [Haxby... 2001]
Ridge
error: 15%
G Varoquaux 25
68. 4 Estimating the linear model [Gramfort... 2013]
Inverse problem
Minimize the error term:
ˆw = argmin
w
l(y − X w)
Ill-posed:
Many different w will give
the same prediction error
Choice driven by (implicit) priors
SVM sparse ridge TV- 1
G Varoquaux 26
69. 4 Estimating the linear model [Gramfort... 2013]
Inverse problem
Minimize the error term:
ˆw = argmin
w
l(y − X w)
Ill-posed:
Many different w will give
the same prediction error
Choice driven by (implicit) priors
SVM sparse ridge TV- 1
Sparse estimators (Lasso)
minx
y − Ax 2
2 + λ 1(x)
Data fit Penalization
x2
x1G Varoquaux 26
70. 4 Estimating the linear model [Gramfort... 2013]
Inverse problem
Minimize the error term:
ˆw = argmin
w
l(y − X w)
Ill-posed:
Many different w will give
the same prediction error
Choice driven by (implicit) priors
SVM sparse ridge TV- 1
Total-variation penalization
Impose sparsity on the gradient of the
image:
p(w) = 1( w)
In fMRI: [Michel... 2011]
G Varoquaux 26
71. 4 Estimating the linear model [Gramfort... 2013]
Inverse problem
Minimize the error term:
ˆw = argmin
w
l(y − X w)
Ill-posed:
Many different w will give
the same prediction error
Choice driven by (implicit) priors
SVM sparse ridge TV- 1
G Varoquaux 26
72. 4 Tricks of the trade
Spatial smoothing
12mm
Giving up on resolution
error rate: 43% 31%
Feature selection
23%
Giving up on multivariate
G Varoquaux 27
73. 4 FREM: Ensemble models [Hoyos-Idrobo... 2017]
Very fast sub optimal models
Average many of them
G Varoquaux 28
74. 4 FREM: Ensemble models [Hoyos-Idrobo... 2017]
Very fast sub optimal models
Average many of them
...
...
Learn parcellation on
perturbed data
Estimate linear models
Average the results
G Varoquaux 28
75. 4 FREM: Ensemble models [Hoyos-Idrobo... 2017]
Very fast sub optimal models
Average many of them
G Varoquaux 28
76. 4 FREM: Ensemble models [Hoyos-Idrobo... 2017]
Very fast sub optimal models
Average many of them
G Varoquaux 28
77. 4 FREM: Ensemble models [Hoyos-Idrobo... 2017]
Very fast sub optimal models
Average many of them
G Varoquaux 28
78. 5 Learning on correlations in brain
activity
“connectome” prediction
Brain “at rest”
= easier to scan
G Varoquaux 29
79. From rest-fMRI to biomarkers
No salient features in rest fMRI
G Varoquaux 30
80. From rest-fMRI to biomarkers
Define functional regions
G Varoquaux 30
81. From rest-fMRI to biomarkers
Define functional regions
Learn interactions
G Varoquaux 30
82. From rest-fMRI to biomarkers
Define functional regions
Learn interactions
Detect differences
G Varoquaux 30
83. From rest-fMRI to biomarkers
Functional
connectivity
matrix
Time series
extraction
Region
definition
Supervised learning
RS-fMRI
G Varoquaux 31
84. From rest-fMRI to biomarkers
Functional
connectivity
matrix
Time series
extraction
Region
definition
Supervised learning
RS-fMRI
Unsupervised learning ⇒ adapted representations dictionary learning
G Varoquaux 31
85. From rest-fMRI to biomarkers
Functional
connectivity
matrix
Time series
extraction
Region
definition
Supervised learning
RS-fMRI
1Tb data 50Gb data
Unsupervised learning ⇒ adapted representations dictionary learning
More data is always better computational cost [Mensch... 2016]G Varoquaux 31
86. 5 Biomarkers, heterogeneity, & big data: Autism study
[Abraham... 2017]
Ill-defined pathology Heterogeneous dataset (ABIDE)
As number of subjects increases
Accuracy
Fraction of subjects used
Heterogeneity is not the limiting factor
G Varoquaux 32
87. @GaelVaroquaux
An introduction to Machine Learning for MR imaging
Overfit versus underfit
The best fit is not the best performer
Minimizing an observed error is the wrong solution
xy
88. @GaelVaroquaux
An introduction to Machine Learning for MR imaging
Overfit versus underfit
Evaluate on independent data
Cross-validation error bars
89. @GaelVaroquaux
An introduction to Machine Learning for MR imaging
Overfit versus underfit
Evaluate on independent data
More data trumps being clever
90. @GaelVaroquaux
An introduction to Machine Learning for MR imaging
Overfit versus underfit
Evaluate on independent data
More data trumps being clever
Software: scikit-learn & nilearn In Python
http://scikit-learn.org http://nilearn.github.io
ni
91. References I
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras, B. Thirion,
and G. Varoquaux. Deriving reproducible biomarkers from multi-site resting-state
data: An autism-based example. NeuroImage, 147:736–745, 2017.
A. Gramfort, B. Thirion, and G. Varoquaux. Identifying predictive regions from fMRI
with TV-L1 prior. In PRNI, page 17, 2013.
R. Gribonval. Should penalized least squares regression be interpreted as maximum a
posteriori estimation? IEEE Transactions on Signal Processing, 59(5):2405–2410,
2011.
J. V. Haxby, I. M. Gobbini, M. L. Furey, ... Distributed and overlapping representations
of faces and objects in ventral temporal cortex. Science, 293:2425, 2001.
A. Hoyos-Idrobo, G. Varoquaux, Y. Schwartz, and B. Thirion. Frem – scalable and
stable decoding with fast regularized ensemble of models. NeuroImage, 2017.
92. References II
M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P.
Kording. Using and understanding cross-validation strategies. perspectives on saeb
et al. GigaScience, 6(5):1–6, 2017.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive
matrix factorization. In International Conference on Machine Learning, pages
1737–1746, 2016.
V. Michel, A. Gramfort, G. Varoquaux, E. Eger, and B. Thirion. Total variation
regularization for fMRI-based prediction of behavior. Medical Imaging, IEEE
Transactions on, 30:1328, 2011.
S. M. Smith, T. E. Nichols, D. Vidaurre, A. M. Winkler, T. E. Behrens, M. F. Glasser,
K. Ugurbil, D. M. Barch, D. C. Van Essen, and K. L. Miller. A positive-negative
mode of population covariation links brain connectivity, demographics and behavior.
Nature neuroscience, 18(11):1565–1567, 2015.
G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars.
NeuroImage, 2017.
93. References III
G. Varoquaux and B. Thirion. How machine learning is shaping cognitive
neuroimaging. GigaScience, 3:28, 2014.
G. Varoquaux, A. Gramfort, and B. Thirion. Small-sample brain mapping: sparse
recovery on spatially correlated designs with randomization and clustering. In ICML,
page 1375, 2012.
G. Varoquaux, P. R. Raamana, D. A. Engemann, A. Hoyos-Idrobo, Y. Schwartz, and
B. Thirion. Assessing and tuning brain decoders: cross-validation, caveats, and
guidelines. NeuroImage, 145:166–179, 2017.
M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery
using 1-constrained quadratic programming. Trans Inf Theory, 55:2183, 2009.