SlideShare una empresa de Scribd logo
1 de 35
Evaluation*
in ubiquitous computing
* (quantitative evaluation, qualitative studies are a field of their own!)
Nils Hammerla, Research Associate @ Open Lab
Evaluation? Why?
• Performance evaluation
• How well does it work?
• Model selection
• Which approach works best?
• Parameter selection
• How do I tune my approach for best results?
• Demonstration of feasibility
• Is it worthwhile investing more time/money?
Training and test-sets
Basic elements, segments, images, etc.
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining validation
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
Training and test-sets
• Hold-out validation
• Choose part of the data as training and test-set
• By some heuristic (e.g. 20% test),
• Depending on study design.
• Gives a single performance estimate
• Performance may critically depend on chosen test-set
testtraining
User 1 User 2 User 3 User 4 User 5 User 6
validation
Run 1,2,3 Run 4
• User-dependent performance

How would the system perform for more data from a known user?

• User-independent performance

How would the system perform for a new user?
Training and test-sets
• Repeated hold-out
• Split data into k continuous folds
• A split into n folds leads to n performance estimates
• Gives more reliable performance estimate
• But: this depends on how the folds are constructed!
• Variants
• Leave-one-subject-out (LOSO) — User-independent
• Leave-one-run-out — User-dependent
Training and test-sets
• (Random, stratified) cross-validation
• Split data into k folds, on the lowest level
• Folds constructed to be stratified w.r.t. class distribution
• Popular in general machine learning
• Standard approach in many ML frameworks
• User-dependent performance estimate
• Requires samples to be statistically independent
• This is not the case in most cases in ubicomp!
Pitfalls
• User dependent vs independent performance
• Assumption: User dependent performance is the upper bound of
possible system performance
• Cross-validation therefore popular in ubicomp to demonstrate
feasibility of e.g. new technical approach
study
1 2 3 4 5 6
accuracy
0
10
20
30
40
50
60
70
80
90
100
Loso vs xval performance in related work
xval
loso
Pitfalls
• Cross-validation in segmented time-series
• When testing for a segment i, it is very likely that segments i-1 or i
+1 are in the training set.
• Neighbouring segments typically overlap
• They are therefore very similar!
• This biases the results and leads to bloated performance figures
Pitfalls
• Cross-validation in segmented time-series
• This is an issue, as cross-validation is widely used
• For user-dependent evaluation
• For parameter estimation (e.g. in neural networks)
• A simple extension of cross-validation alleviates the bias
• There will be a talk on this on Friday morning!
Pitfalls
• Reporting results
• Always report means and standard deviation of results
• T-tests for difference in mean performance
• Whether t-tests on cross-validated results are meaningful is still
debated in statistics
• You have to use the same folds when comparing different
approaches!
• Better: repeat whole experiment multiple times (if computationally
feasible)
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B:
Performance metrics
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Precision / PPV tp / (tp + fp)
Recall / Sensitivity tp / (tp + fn)
Specificity tn / (tn + fp)
Accuracy (tp+tn) / (tp + fp + fn + tn)
F1-score 2*prec*sens / (prec+sens)
/ ( + )
/ ( + )
/ ( + )
For each class (or for two class problems):
( + )/( + + + )
Overall accuracy
Mean accuracy
Weighted F1-score
Mean F1-score
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
For a set of classes C = {A,B,C,…}:
1
|C|
X
c2C
accc
1
N
X
c2C
tpc
2
N
X
c2C
nc
precc ⇥ sensc
precc + sensc
2
|C|
X
c2C
precc ⇥ sensc
precc + sensc
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Normalised confusion matrices
- Divide each row by its sum to get confusion probabilities
3284
449
311
66
446
1275
1600
319
133
647
2865
285
24
198
715
412
a) Confusion on HOME data-set
prediction
asleep off on dys
annotation(diary)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6
13
155
180
39
101
414
117
3
3
164
b) Confusion on LAB data-set
prediction
asleep off on dys
annotation(clinician)
asleep
off
on
dys
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Receiver Operator Characteristics (ROC)
- Illustrates trade-off between 

True Positive Rate (sensitivity / recall), and
False Positive Rate (1 - specificity)
- Useful if approach has a simple parameter,
like a threshold.
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
false positive rate
truepositiverate
ROC curves for different classifiers
knn
c45
logR
PCA−logR
[Ladha, Cassim, et al. "ClimbAX: skill
assessment for climbing enthusiasts."
Ubicomp 2013]
Log likelihood
AIC / BIC
KL-divergence
Performance metrics
Basic elements for each class:
true positives
false positives
false negatives
true negatives
A: B: C:
Probabilistic models:
Mean Square Error (MSE)
Root Mean Square Error (RMSE)
Residuals
Correlation coefficients
Regression:
Example
1022 13 144
300 542 24
12 55 132
groundtruth
prediction
A B C
A
B
C
A B C
PPV 0.77 0.89 0.44
Sensitivity 0.87 0.63 0.66
Specificity 0.71 0.95 0.92
Accuracy 0.79 0.83 0.90
F1-score 0.81 0.73 0.53
Overall Accuracy 0.76
Weighted F1-score 0.76
Mean F1-score 0.69
prediction
A B C
groundtruth
A
B
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
Which metric should I not use?
• Specificity, or per-class accuracy
• Less meaningful with an increase in #classes
• #TN cells grow in O(n2),
• #FN, #FP cells grow in O(n)!
• Leads to e.g. incredible specificity figures (> 0.99)
/ ( + ) ( + )/( + + + )
/ ( + ) / ( + )PPV or Recall
Better:
Which metric should I not use?
• Overall accuracy
• In unbalanced data-sets, overall accuracy can be
misleading
1
N
X
c2C
tpc
10,000 10 2
100 1 0
20 1 0
Overall accuracy = 0.99
Weighted F1-score = 0.98
Mean F1-score = 0.34
9,950 40 22
40 45 15
10 1 10
Overall accuracy = 0.99
Weighted F1-score = 0.99
Mean F1-score = 0.59
<
<<
~=
Which metric should I use?
Balanced Unbalanced
Two-class
- Accuracy
- F1-score
- Sensitivity, Specificity
- PPV, Recall for rare class
- Mean F1-score
Multi-class
- Accuracy
- Weighted F1
- Mean F1
- Avg. PPV, Sensitivity
- Mean F1 vs weighted F1
- Confusion matrices
No single metric reflects all aspects of performance!
Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
Which metric should I use?
• Application requirements
• Different applications have different requirements
• What type of error is more critical?
• False positives can hurt if activities of interest rare
• Low precision can be a problem in medical applications
• There is always a trade-off between different aspects of
performance
Ward et al. (2011)
• Continuous recognition
• Measures like accuracy treat each error the same
• But: The predictions are time-series themselves, and different errors
are possible
• Which type of error is more acceptable in my application?
[Jamie Ward et al., Performance Metrics for Activity Recognition]
Ward et al. (2011)
• Continuous recognition
• Similar to bioinformatics:
• Merge, Deletion, Confusion, Fragmentation, Insertion
• Each type of error is scored differently (cost matrix)
• Compound score reflects (adjustable) performance metric
• Systems that show very good performance in regular metrics may
show poor performance in one of these aspects!
[Jamie Ward et al., Performance Metrics for Activity Recognition]
Summary
• Careful when selecting training and test-set
• Different aspects of performance (user dependent / independent)
• Avoid random cross-validation for segmented time-series!
• Performance metrics
• No single metric reflects all aspects of performance
• Depends on your application!
• More difficult for unbalanced multi-class problems
• Continuous recognition
• Different types of errors have different impact on perceived
performance of a system
• Scoring systems can be tailored for individual application

Más contenido relacionado

La actualidad más candente

ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
MDO_Lab
 

La actualidad más candente (16)

Creating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with ConcertoCreating an in-house computerized adaptive testing (CAT) program with Concerto
Creating an in-house computerized adaptive testing (CAT) program with Concerto
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?Cross-validation Tutorial: What, how and which?
Cross-validation Tutorial: What, how and which?
 
Machine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
 
No struggle with test design (presentation at TestExpo 2015 Denmark)
No struggle with test design (presentation at TestExpo 2015 Denmark)No struggle with test design (presentation at TestExpo 2015 Denmark)
No struggle with test design (presentation at TestExpo 2015 Denmark)
 
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
 
Test design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTARTest design made easy (and fun) Rik Marselis EuroSTAR
Test design made easy (and fun) Rik Marselis EuroSTAR
 
Experimental design
Experimental designExperimental design
Experimental design
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
forecasting model
forecasting modelforecasting model
forecasting model
 
Computer Adaptive Test (cat)
Computer Adaptive Test (cat)Computer Adaptive Test (cat)
Computer Adaptive Test (cat)
 
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methodsMl9 introduction to-unsupervised_learning_and_clustering_methods
Ml9 introduction to-unsupervised_learning_and_clustering_methods
 
Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)Introduction to Computerized Adaptive Testing (CAT)
Introduction to Computerized Adaptive Testing (CAT)
 

Destacado

林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
台灣資料科學年會
 

Destacado (9)

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Bridging the Gap: Machine Learning for Ubiquitous Computing -- IntroductionBridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Introduction
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Study Design a...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
Bridging the Gap: Machine Learning for Ubiquitous Computing -- ML and Ubicomp...
 
Final Young Alert And Aware 2004 Cprs Conf
Final Young Alert And Aware 2004 Cprs ConfFinal Young Alert And Aware 2004 Cprs Conf
Final Young Alert And Aware 2004 Cprs Conf
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning林守德/Practical Issues in Machine Learning
林守德/Practical Issues in Machine Learning
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 

Similar a Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

Understanding statistics in laboratory quality control
Understanding statistics in laboratory quality controlUnderstanding statistics in laboratory quality control
Understanding statistics in laboratory quality control
Randox
 

Similar a Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation (20)

Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Spc
SpcSpc
Spc
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification ModelsDMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 14 Evaluation of Classification Models
 
Understanding statistics in laboratory quality control
Understanding statistics in laboratory quality controlUnderstanding statistics in laboratory quality control
Understanding statistics in laboratory quality control
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
Catapult DOE Case Study
Catapult DOE Case StudyCatapult DOE Case Study
Catapult DOE Case Study
 
1b7 quality control
1b7 quality control1b7 quality control
1b7 quality control
 
Black Box Testing.pdf
Black Box Testing.pdfBlack Box Testing.pdf
Black Box Testing.pdf
 
Machine Learning Powered A/B Testing
Machine Learning Powered A/B TestingMachine Learning Powered A/B Testing
Machine Learning Powered A/B Testing
 
Lecture 3 for the AI course in A university
Lecture 3 for the AI course in A universityLecture 3 for the AI course in A university
Lecture 3 for the AI course in A university
 
Lecture3-eval.pptx
Lecture3-eval.pptxLecture3-eval.pptx
Lecture3-eval.pptx
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 

Último

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Último (20)

Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation

  • 1. Evaluation* in ubiquitous computing * (quantitative evaluation, qualitative studies are a field of their own!) Nils Hammerla, Research Associate @ Open Lab
  • 2. Evaluation? Why? • Performance evaluation • How well does it work? • Model selection • Which approach works best? • Parameter selection • How do I tune my approach for best results? • Demonstration of feasibility • Is it worthwhile investing more time/money?
  • 3. Training and test-sets Basic elements, segments, images, etc.
  • 4. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining validation
  • 5. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation
  • 6. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation Run 1,2,3 Run 4
  • 7. Training and test-sets • Hold-out validation • Choose part of the data as training and test-set • By some heuristic (e.g. 20% test), • Depending on study design. • Gives a single performance estimate • Performance may critically depend on chosen test-set testtraining User 1 User 2 User 3 User 4 User 5 User 6 validation Run 1,2,3 Run 4 • User-dependent performance
 How would the system perform for more data from a known user?
 • User-independent performance
 How would the system perform for a new user?
  • 8. Training and test-sets • Repeated hold-out • Split data into k continuous folds • A split into n folds leads to n performance estimates • Gives more reliable performance estimate • But: this depends on how the folds are constructed! • Variants • Leave-one-subject-out (LOSO) — User-independent • Leave-one-run-out — User-dependent
  • 9. Training and test-sets • (Random, stratified) cross-validation • Split data into k folds, on the lowest level • Folds constructed to be stratified w.r.t. class distribution • Popular in general machine learning • Standard approach in many ML frameworks • User-dependent performance estimate • Requires samples to be statistically independent • This is not the case in most cases in ubicomp!
  • 10. Pitfalls • User dependent vs independent performance • Assumption: User dependent performance is the upper bound of possible system performance • Cross-validation therefore popular in ubicomp to demonstrate feasibility of e.g. new technical approach study 1 2 3 4 5 6 accuracy 0 10 20 30 40 50 60 70 80 90 100 Loso vs xval performance in related work xval loso
  • 11. Pitfalls • Cross-validation in segmented time-series • When testing for a segment i, it is very likely that segments i-1 or i +1 are in the training set. • Neighbouring segments typically overlap • They are therefore very similar! • This biases the results and leads to bloated performance figures
  • 12. Pitfalls • Cross-validation in segmented time-series • This is an issue, as cross-validation is widely used • For user-dependent evaluation • For parameter estimation (e.g. in neural networks) • A simple extension of cross-validation alleviates the bias • There will be a talk on this on Friday morning!
  • 13. Pitfalls • Reporting results • Always report means and standard deviation of results • T-tests for difference in mean performance • Whether t-tests on cross-validated results are meaningful is still debated in statistics • You have to use the same folds when comparing different approaches! • Better: repeat whole experiment multiple times (if computationally feasible)
  • 14. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives
  • 15. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives
  • 16. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A:
  • 17. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A:
  • 18. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B:
  • 19. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B:
  • 20. Performance metrics 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C Basic elements for each class: true positives false positives false negatives true negatives A: B: C:
  • 21. Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Precision / PPV tp / (tp + fp) Recall / Sensitivity tp / (tp + fn) Specificity tn / (tn + fp) Accuracy (tp+tn) / (tp + fp + fn + tn) F1-score 2*prec*sens / (prec+sens) / ( + ) / ( + ) / ( + ) For each class (or for two class problems): ( + )/( + + + )
  • 22. Overall accuracy Mean accuracy Weighted F1-score Mean F1-score Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: For a set of classes C = {A,B,C,…}: 1 |C| X c2C accc 1 N X c2C tpc 2 N X c2C nc precc ⇥ sensc precc + sensc 2 |C| X c2C precc ⇥ sensc precc + sensc
  • 23. Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Normalised confusion matrices - Divide each row by its sum to get confusion probabilities 3284 449 311 66 446 1275 1600 319 133 647 2865 285 24 198 715 412 a) Confusion on HOME data-set prediction asleep off on dys annotation(diary) asleep off on dys 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 13 155 180 39 101 414 117 3 3 164 b) Confusion on LAB data-set prediction asleep off on dys annotation(clinician) asleep off on dys 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 [Hammerla, Nils Y., et al. "PD Disease State Assessment in Naturalistic Environments using Deep Learning." AAAI 2015]
  • 24. Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Receiver Operator Characteristics (ROC) - Illustrates trade-off between 
 True Positive Rate (sensitivity / recall), and False Positive Rate (1 - specificity) - Useful if approach has a simple parameter, like a threshold. 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 false positive rate truepositiverate ROC curves for different classifiers knn c45 logR PCA−logR [Ladha, Cassim, et al. "ClimbAX: skill assessment for climbing enthusiasts." Ubicomp 2013]
  • 25. Log likelihood AIC / BIC KL-divergence Performance metrics Basic elements for each class: true positives false positives false negatives true negatives A: B: C: Probabilistic models: Mean Square Error (MSE) Root Mean Square Error (RMSE) Residuals Correlation coefficients Regression:
  • 26. Example 1022 13 144 300 542 24 12 55 132 groundtruth prediction A B C A B C A B C PPV 0.77 0.89 0.44 Sensitivity 0.87 0.63 0.66 Specificity 0.71 0.95 0.92 Accuracy 0.79 0.83 0.90 F1-score 0.81 0.73 0.53 Overall Accuracy 0.76 Weighted F1-score 0.76 Mean F1-score 0.69 prediction A B C groundtruth A B C 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 27. Which metric should I not use? • Specificity, or per-class accuracy • Less meaningful with an increase in #classes • #TN cells grow in O(n2), • #FN, #FP cells grow in O(n)! • Leads to e.g. incredible specificity figures (> 0.99) / ( + ) ( + )/( + + + )
  • 28. Which metric should I not use? • Specificity, or per-class accuracy • Less meaningful with an increase in #classes • #TN cells grow in O(n2), • #FN, #FP cells grow in O(n)! • Leads to e.g. incredible specificity figures (> 0.99) / ( + ) ( + )/( + + + ) / ( + ) / ( + )PPV or Recall Better:
  • 29. Which metric should I not use? • Overall accuracy • In unbalanced data-sets, overall accuracy can be misleading 1 N X c2C tpc 10,000 10 2 100 1 0 20 1 0 Overall accuracy = 0.99 Weighted F1-score = 0.98 Mean F1-score = 0.34 9,950 40 22 40 45 15 10 1 10 Overall accuracy = 0.99 Weighted F1-score = 0.99 Mean F1-score = 0.59 < << ~=
  • 30. Which metric should I use? Balanced Unbalanced Two-class - Accuracy - F1-score - Sensitivity, Specificity - PPV, Recall for rare class - Mean F1-score Multi-class - Accuracy - Weighted F1 - Mean F1 - Avg. PPV, Sensitivity - Mean F1 vs weighted F1 - Confusion matrices No single metric reflects all aspects of performance!
  • 31. Which metric should I use? • Application requirements • Different applications have different requirements • What type of error is more critical? • False positives can hurt if activities of interest rare • Low precision can be a problem in medical applications • There is always a trade-off between different aspects of performance
  • 32. Which metric should I use? • Application requirements • Different applications have different requirements • What type of error is more critical? • False positives can hurt if activities of interest rare • Low precision can be a problem in medical applications • There is always a trade-off between different aspects of performance
  • 33. Ward et al. (2011) • Continuous recognition • Measures like accuracy treat each error the same • But: The predictions are time-series themselves, and different errors are possible • Which type of error is more acceptable in my application? [Jamie Ward et al., Performance Metrics for Activity Recognition]
  • 34. Ward et al. (2011) • Continuous recognition • Similar to bioinformatics: • Merge, Deletion, Confusion, Fragmentation, Insertion • Each type of error is scored differently (cost matrix) • Compound score reflects (adjustable) performance metric • Systems that show very good performance in regular metrics may show poor performance in one of these aspects! [Jamie Ward et al., Performance Metrics for Activity Recognition]
  • 35. Summary • Careful when selecting training and test-set • Different aspects of performance (user dependent / independent) • Avoid random cross-validation for segmented time-series! • Performance metrics • No single metric reflects all aspects of performance • Depends on your application! • More difficult for unbalanced multi-class problems • Continuous recognition • Different types of errors have different impact on perceived performance of a system • Scoring systems can be tailored for individual application