Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
1. Evaluation*
in ubiquitous computing
* (quantitative evaluation, qualitative studies are a field of their own!)
Nils Hammerla, Research Associate @ Open Lab
2. Evaluation? Why?
• Performance evaluation
• How well does it work?
• Model selection
• Which approach works best?
• Parameter selection
• How do I tune my approach for best results?
• Demonstration of feasibility
• Is it worthwhile investing more time/money?
4. Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining validation
5. Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
6. Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
7. Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
• User-dependent performance
How would the system perform for more data from a known user?
• User-independent performance
How would the system perform for a new user?
8. Training and test-sets
• Repeated hold-out
• Split data into k continuous folds
• A split into n folds leads to n performance estimates
• Gives more reliable performance estimate
• But: this depends on how the folds are constructed!
• Variants
• Leave-one-subject-out (LOSO) — User-independent
• Leave-one-run-out — User-dependent
9. Training and test-sets
• (Random, stratified) cross-validation
• Split data into k folds, on the lowest level
• Folds constructed to be stratified w.r.t. class distribution
• Popular in general machine learning
• Standard approach in many ML frameworks
• User-dependent performance estimate
• Requires samples to be statistically independent
• This is not the case in most cases in ubicomp!
10. Pitfalls
• User dependent vs independent performance
• Assumption: User dependent performance is the upper bound of
possible system performance
• Cross-validation therefore popular in ubicomp to demonstrate
feasibility of e.g. new technical approach
study
1 2 3 4 5 6
accuracy
0
10
20
30
40
50
60
70
80
90
100
Loso vs xval performance in related work
xval
loso
11. Pitfalls
• Cross-validation in segmented time-series
• When testing for a segment i, it is very likely that segments i-1 or i
+1 are in the training set.
• Neighbouring segments typically overlap
• They are therefore very similar!
• This biases the results and leads to bloated performance figures
12. Pitfalls
• Cross-validation in segmented time-series
• This is an issue, as cross-validation is widely used
• For user-dependent evaluation
• For parameter estimation (e.g. in neural networks)
• A simple extension of cross-validation alleviates the bias
• There will be a talk on this on Friday morning!
13. Pitfalls
• Reporting results
• Always report means and standard deviation of results
• T-tests for difference in mean performance
• Whether t-tests on cross-validated results are meaningful is still
debated in statistics
• You have to use the same folds when comparing different
approaches!
• Better: repeat whole experiment multiple times (if computationally
feasible)
14. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
15. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
16. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
17. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
18. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
19. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
20. Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
21. Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Precision / PPV tp / (tp + fp)
Recall / Sensitivity tp / (tp + fn)
Specificity tn / (tn + fp)
Accuracy (tp+tn) / (tp + fp + fn + tn)
F1-score 2*prec*sens / (prec+sens)
/ ( + )
/ ( + )
/ ( + )
For each class (or for two class problems):
( + )/( + + + )
22. Overall accuracy
Mean accuracy
Weighted F1-score
Mean F1-score
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
For a set of classes C = {A,B,C,…}:
1
|C|
X
c2C
accc
1
N
X
c2C
tpc
2
N
X
c2C
nc
precc ⇥ sensc
precc + sensc
2
|C|
X
c2C
precc ⇥ sensc
precc + sensc
23. Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Normalised confusion matrices
- Divide each row by its sum to get confusion probabilities
3284
449
311
66
446
1275
1600
319
133
647
2865
285
24
198
715
412
a) Confusion on HOME data-set
prediction
asleep off on dys
annotation(diary)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
13
155
180
39
101
414
117
3
3
164
b) Confusion on LAB data-set
prediction
asleep off on dys
annotation(clinician)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
24. Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Receiver Operator Characteristics (ROC)
- Illustrates trade-off between
True Positive Rate (sensitivity / recall), and
False Positive Rate (1 - specificity)
- Useful if approach has a simple parameter,
like a threshold.
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
false positive rate
truepositiverate
ROC curves for different classifiers
knn
c45
logR
PCA−logR
[Ladha, Cassim, et al. "ClimbAX: skill
assessment for climbing enthusiasts."
Ubicomp 2013]
25. Log likelihood
AIC / BIC
KL-divergence
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Probabilistic models:
Mean Square Error (MSE)
Root Mean Square Error (RMSE)
Residuals
Correlation coefficients
Regression:
26. Example
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
A B C
PPV 0.77 0.89 0.44
Sensitivity 0.87 0.63 0.66
Specificity 0.71 0.95 0.92
Accuracy 0.79 0.83 0.90
F1-score 0.81 0.73 0.53
Overall Accuracy 0.76
Weighted F1-score 0.76
Mean F1-score 0.69
prediction
A B C
groundtruth
A
B
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
27. Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
28. Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
/ ( + ) / ( + )PPV or Recall
Better:
29. Which metric should I not use?
• Overall accuracy
• In unbalanced data-sets, overall accuracy can be
misleading
1
N
X
c2C
tpc
10,000 10 2
100 1 0
20 1 0
Overall accuracy = 0.99
Weighted F1-score = 0.98
Mean F1-score = 0.34
9,950 40 22
40 45 15
10 1 10
Overall accuracy = 0.99
Weighted F1-score = 0.99
Mean F1-score = 0.59
<
<<
~=
30. Which metric should I use?
Balanced Unbalanced
Two-class
- Accuracy
- F1-score
- Sensitivity, Specificity
- PPV, Recall for rare class
- Mean F1-score
Multi-class
- Accuracy
- Weighted F1
- Mean F1
- Avg. PPV, Sensitivity
- Mean F1 vs weighted F1
- Confusion matrices
No single metric reflects all aspects of performance!
31. Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
32. Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
33. Ward et al. (2011)
• Continuous recognition
• Measures like accuracy treat each error the same
• But: The predictions are time-series themselves, and different errors
are possible
• Which type of error is more acceptable in my application?
[Jamie Ward et al., Performance Metrics for Activity Recognition]
34. Ward et al. (2011)
• Continuous recognition
• Similar to bioinformatics:
• Merge, Deletion, Confusion, Fragmentation, Insertion
• Each type of error is scored differently (cost matrix)
• Compound score reflects (adjustable) performance metric
• Systems that show very good performance in regular metrics may
show poor performance in one of these aspects!
[Jamie Ward et al., Performance Metrics for Activity Recognition]
35. Summary
• Careful when selecting training and test-set
• Different aspects of performance (user dependent / independent)
• Avoid random cross-validation for segmented time-series!
• Performance metrics
• No single metric reflects all aspects of performance
• Depends on your application!
• More difficult for unbalanced multi-class problems
• Continuous recognition
• Different types of errors have different impact on perceived
performance of a system
• Scoring systems can be tailored for individual application