SlideShare una empresa de Scribd logo
1 de 30
02/03/2021 Introduction to Data Mining, 2nd Edition 1
Data Mining
Model Overfitting
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
02/03/2021 Introduction to Data Mining, 2nd Edition 2
Classification Errors
 Training errors: Errors committed on the training set
 Test errors: Errors committed on the test set
 Generalization errors: Expected error of a model over random selection of records from same distribution
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
02/03/2021 Introduction to Data Mining, 2nd Edition 3
Example Data Set
Two class problem:
+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 400 noisy instances added
o : 5400 instances
• Generated from a uniform
distribution
10 % of the data used for
training and 90% of the
data used for testing
02/03/2021 Introduction to Data Mining, 2nd Edition 4
Increasing number of nodes in Decision Trees
02/03/2021 Introduction to Data Mining, 2nd Edition 5
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
02/03/2021 Introduction to Data Mining, 2nd Edition 6
Decision Tree with 50 nodes
Decision Tree
Decision Tree
Decision boundaries on Training data
02/03/2021 Introduction to Data Mining, 2nd Edition 7
Which tree is better?
Decision Tree with 4 nodes
Decision Tree with 50 nodes
Which tree is better ?
02/03/2021 Introduction to Data Mining, 2nd Edition 8
Model Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
•As the model becomes more and more complex, test errors can start
increasing even though training error may be decreasing
02/03/2021 Introduction to Data Mining, 2nd Edition 9
Model Overfitting – Impact of Training Data Size
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
02/03/2021 Introduction to Data Mining, 2nd Edition 10
Model Overfitting – Impact of Training Data Size
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Decision Tree with 50 nodes Decision Tree with 50 nodes
02/03/2021 Introduction to Data Mining, 2nd Edition 11
Reasons for Model Overfitting
 Not enough training data
 High model complexity
– Multiple Comparison Procedure
02/03/2021 Introduction to Data Mining, 2nd Edition 12
Effect of Multiple Comparison Procedure
 Consider the task of predicting whether
stock market will rise/fall in the next 10
trading days
 Random guessing:
P(correct) = 0.5
 Make 10 random guesses in a row:
Day 1 Up
Day 2 Down
Day 3 Down
Day 4 Up
Day 5 Down
Day 6 Down
Day 7 Up
Day 8 Up
Day 9 Up
Day 10 Down
0547
.
0
2
10
10
9
10
8
10
)
8
(# 10





























correct
P
02/03/2021 Introduction to Data Mining, 2nd Edition 13
Effect of Multiple Comparison Procedure
 Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
 Probability that at least one analyst makes at
least 8 correct predictions
9399
.
0
)
0547
.
0
1
(
1
)
8
(# 50





correct
P
02/03/2021 Introduction to Data Mining, 2nd Edition 14
Effect of Multiple Comparison Procedure
 Many algorithms employ the following greedy strategy:
– Initial model: M
– Alternative model: M’ = M  ,
where  is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) > 
 Often times,  is chosen from a set of alternative
components,  = {1, 2, …, k}
 If many alternatives are available, one may inadvertently
add irrelevant components to the model, resulting in
model overfitting
02/03/2021 Introduction to Data Mining, 2nd Edition 15
Effect of Multiple Comparison - Example
Use additional 100 noisy variables
generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and
70% of the data for testing
Using only X and Y as attributes
02/03/2021 Introduction to Data Mining, 2nd Edition 16
Notes on Overfitting
 Overfitting results in decision trees that are more
complex than necessary
 Training error does not provide a good estimate
of how well the tree will perform on previously
unseen records
 Need ways for estimating generalization errors
02/03/2021 Introduction to Data Mining, 2nd Edition 17
Model Selection
 Performed during model building
 Purpose is to ensure that model is not overly
complex (to avoid overfitting)
 Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
02/03/2021 Introduction to Data Mining, 2nd Edition 18
Model Selection:
Using Validation Set
 Divide training data into two parts:
– Training set:
 use for model building
– Validation set:
 use for estimating generalization error
 Note: validation set is not the same as test set
 Drawback:
– Less data available for training
02/03/2021 Introduction to Data Mining, 2nd Edition 19
Model Selection:
Incorporating Model Complexity
 Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted
accidentally
– Therefore, one should include model complexity when
evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
02/03/2021 Introduction to Data Mining, 2nd Edition 20
Estimating the Complexity of Decision Trees
 Pessimistic Error Estimate of decision tree T
with k leaf nodes:
– err(T): error rate on all training records
– : trade-off hyper-parameter (similar to )
 Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
02/03/2021 Introduction to Data Mining, 2nd Edition 21
Estimating the Complexity of Decision Trees: Example
+: 5
-: 2
+: 1
-: 4
+: 3
-: 0
+: 3
-: 6
+: 3
-: 0
+: 0
-: 5
+: 3
-: 1
+: 1
-: 2
+: 0
-: 2
+: 2
-: 1
+: 3
-: 1
Decision Tree, TL Decision Tree, TR
e(TL) = 4/24
e(TR) = 6/24
 = 1
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
02/03/2021 Introduction to Data Mining, 2nd Edition 22
Estimating the Complexity of Decision Trees
 Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
+: 5
-: 2
+: 1
-: 4
+: 3
-: 0
+: 3
-: 6
+: 3
-: 0
+: 0
-: 5
+: 3
-: 1
+: 1
-: 2
+: 0
-: 2
+: 2
-: 1
+: 3
-: 1
Decision Tree, TL Decision Tree, TR
e(TL) = 4/24
e(TR) = 6/24
02/03/2021 Introduction to Data Mining, 2nd Edition 23
Minimum Description Length (MDL)
 Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
 Cost(Data|Model) encodes the misclassification errors.
 Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
A B
A?
B?
C?
1
0
0
1
Yes No
B1 B2
C1 C2
X y
X1 1
X2 0
X3 0
X4 1
… …
Xn 1
X y
X1 ?
X2 ?
X3 ?
X4 ?
… …
Xn ?
02/03/2021 Introduction to Data Mining, 2nd Edition 24
Model Selection for Decision Trees
 Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
– More restrictive conditions:
 Stop if number of instances is less than some user-specified
threshold
 Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
 Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
 Stop if estimated generalization error falls below certain threshold
02/03/2021 Introduction to Data Mining, 2nd Edition 25
Model Selection for Decision Trees
 Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
 Trim the nodes of the decision tree in a bottom-up
fashion
 If generalization error improves after trimming,
replace sub-tree by a leaf node
 Class label of leaf node is determined from
majority class of instances in the sub-tree
02/03/2021 Introduction to Data Mining, 2nd Edition 26
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10
Error = 10/30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4  0.5)/30 = 11/30
PRUNE!
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 4
Class = Yes 4
Class = No 1
Class = Yes 5
Class = No 1
02/03/2021 Introduction to Data Mining, 2nd Edition 27
Examples of Post-pruning
Simplified Decision Tree:
depth = 1 :
| ImagePages <= 0.1333 : class 1
| ImagePages > 0.1333 :
| | breadth <= 6 : class 0
| | breadth > 6 : class 1
depth > 1 :
| MultiAgent = 0: class 0
| MultiAgent = 1:
| | totalPages <= 81 : class 0
| | totalPages > 81 : class 1
Decision Tree:
depth = 1 :
| breadth > 7 : class 1
| breadth <= 7 :
| | breadth <= 3 :
| | | ImagePages > 0.375 : class 0
| | | ImagePages <= 0.375 :
| | | | totalPages <= 6 : class 1
| | | | totalPages > 6 :
| | | | | breadth <= 1 : class 1
| | | | | breadth > 1 : class 0
| | width > 3 :
| | | MultiIP = 0:
| | | | ImagePages <= 0.1333 : class 1
| | | | ImagePages > 0.1333 :
| | | | | breadth <= 6 : class 0
| | | | | breadth > 6 : class 1
| | | MultiIP = 1:
| | | | TotalTime <= 361 : class 0
| | | | TotalTime > 361 : class 1
depth > 1 :
| MultiAgent = 0:
| | depth > 2 : class 0
| | depth <= 2 :
| | | MultiIP = 1: class 0
| | | MultiIP = 0:
| | | | breadth <= 6 : class 0
| | | | breadth > 6 :
| | | | | RepeatedAccess <= 0.0322 : class 0
| | | | | RepeatedAccess > 0.0322 : class 1
| MultiAgent = 1:
| | totalPages <= 81 : class 0
| | totalPages > 81 : class 1
Subtree
Raising
Subtree
Replacement
02/03/2021 Introduction to Data Mining, 2nd Edition 28
Model Evaluation
 Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
 Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
 Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
02/03/2021 Introduction to Data Mining, 2nd Edition 29
Cross-validation Example
 3-fold cross-validation
02/03/2021 Introduction to Data Mining, 2nd Edition 30
Variations on Cross-validation
 Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
 Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
 Use nested cross-validation approach for model
selection and evaluation

Más contenido relacionado

La actualidad más candente

Instance based learning
Instance based learningInstance based learning
Instance based learning
Slideshare
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 

La actualidad más candente (20)

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Analysis Of Attribute Revelance
Analysis Of Attribute RevelanceAnalysis Of Attribute Revelance
Analysis Of Attribute Revelance
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Activation function
Activation functionActivation function
Activation function
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Decision tree
Decision treeDecision tree
Decision tree
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 

Similar a Overfitting.pptx

Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
madlynplamondon
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
taikhoan262
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
Chirag Gupta
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
RyuichiKanoh
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
Shivakrishnan18
 

Similar a Overfitting.pptx (20)

Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labels
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictors
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning Models
 

Más de PerumalPitchandi

Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
PerumalPitchandi
 

Más de PerumalPitchandi (20)

Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
 
biv_mult.ppt
biv_mult.pptbiv_mult.ppt
biv_mult.ppt
 
ppt_ids-data science.pdf
ppt_ids-data science.pdfppt_ids-data science.pdf
ppt_ids-data science.pdf
 
ANOVA Presentation.ppt
ANOVA Presentation.pptANOVA Presentation.ppt
ANOVA Presentation.ppt
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
 
SW_Cost_Estimation.ppt
SW_Cost_Estimation.pptSW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
 
CostEstimation-1.ppt
CostEstimation-1.pptCostEstimation-1.ppt
CostEstimation-1.ppt
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
 
Data_exploration.ppt
Data_exploration.pptData_exploration.ppt
Data_exploration.ppt
 
state-spaces29Sep06.ppt
state-spaces29Sep06.pptstate-spaces29Sep06.ppt
state-spaces29Sep06.ppt
 

Último

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Overfitting.pptx

  • 1. 02/03/2021 Introduction to Data Mining, 2nd Edition 1 Data Mining Model Overfitting Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar
  • 2. 02/03/2021 Introduction to Data Mining, 2nd Edition 2 Classification Errors  Training errors: Errors committed on the training set  Test errors: Errors committed on the test set  Generalization errors: Expected error of a model over random selection of records from same distribution Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 3. 02/03/2021 Introduction to Data Mining, 2nd Edition 3 Example Data Set Two class problem: + : 5400 instances • 5000 instances generated from a Gaussian centered at (10,10) • 400 noisy instances added o : 5400 instances • Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing
  • 4. 02/03/2021 Introduction to Data Mining, 2nd Edition 4 Increasing number of nodes in Decision Trees
  • 5. 02/03/2021 Introduction to Data Mining, 2nd Edition 5 Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data
  • 6. 02/03/2021 Introduction to Data Mining, 2nd Edition 6 Decision Tree with 50 nodes Decision Tree Decision Tree Decision boundaries on Training data
  • 7. 02/03/2021 Introduction to Data Mining, 2nd Edition 7 Which tree is better? Decision Tree with 4 nodes Decision Tree with 50 nodes Which tree is better ?
  • 8. 02/03/2021 Introduction to Data Mining, 2nd Edition 8 Model Underfitting and Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large •As the model becomes more and more complex, test errors can start increasing even though training error may be decreasing
  • 9. 02/03/2021 Introduction to Data Mining, 2nd Edition 9 Model Overfitting – Impact of Training Data Size Using twice the number of data instances • Increasing the size of training data reduces the difference between training and testing errors at a given size of model
  • 10. 02/03/2021 Introduction to Data Mining, 2nd Edition 10 Model Overfitting – Impact of Training Data Size Using twice the number of data instances • Increasing the size of training data reduces the difference between training and testing errors at a given size of model Decision Tree with 50 nodes Decision Tree with 50 nodes
  • 11. 02/03/2021 Introduction to Data Mining, 2nd Edition 11 Reasons for Model Overfitting  Not enough training data  High model complexity – Multiple Comparison Procedure
  • 12. 02/03/2021 Introduction to Data Mining, 2nd Edition 12 Effect of Multiple Comparison Procedure  Consider the task of predicting whether stock market will rise/fall in the next 10 trading days  Random guessing: P(correct) = 0.5  Make 10 random guesses in a row: Day 1 Up Day 2 Down Day 3 Down Day 4 Up Day 5 Down Day 6 Down Day 7 Up Day 8 Up Day 9 Up Day 10 Down 0547 . 0 2 10 10 9 10 8 10 ) 8 (# 10                              correct P
  • 13. 02/03/2021 Introduction to Data Mining, 2nd Edition 13 Effect of Multiple Comparison Procedure  Approach: – Get 50 analysts – Each analyst makes 10 random guesses – Choose the analyst that makes the most number of correct predictions  Probability that at least one analyst makes at least 8 correct predictions 9399 . 0 ) 0547 . 0 1 ( 1 ) 8 (# 50      correct P
  • 14. 02/03/2021 Introduction to Data Mining, 2nd Edition 14 Effect of Multiple Comparison Procedure  Many algorithms employ the following greedy strategy: – Initial model: M – Alternative model: M’ = M  , where  is a component to be added to the model (e.g., a test condition of a decision tree) – Keep M’ if improvement, (M,M’) >   Often times,  is chosen from a set of alternative components,  = {1, 2, …, k}  If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting
  • 15. 02/03/2021 Introduction to Data Mining, 2nd Edition 15 Effect of Multiple Comparison - Example Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes. Use 30% of the data for training and 70% of the data for testing Using only X and Y as attributes
  • 16. 02/03/2021 Introduction to Data Mining, 2nd Edition 16 Notes on Overfitting  Overfitting results in decision trees that are more complex than necessary  Training error does not provide a good estimate of how well the tree will perform on previously unseen records  Need ways for estimating generalization errors
  • 17. 02/03/2021 Introduction to Data Mining, 2nd Edition 17 Model Selection  Performed during model building  Purpose is to ensure that model is not overly complex (to avoid overfitting)  Need to estimate generalization error – Using Validation Set – Incorporating Model Complexity
  • 18. 02/03/2021 Introduction to Data Mining, 2nd Edition 18 Model Selection: Using Validation Set  Divide training data into two parts: – Training set:  use for model building – Validation set:  use for estimating generalization error  Note: validation set is not the same as test set  Drawback: – Less data available for training
  • 19. 02/03/2021 Introduction to Data Mining, 2nd Edition 19 Model Selection: Incorporating Model Complexity  Rationale: Occam’s Razor – Given two models of similar generalization errors, one should prefer the simpler model over the more complex model – A complex model has a greater chance of being fitted accidentally – Therefore, one should include model complexity when evaluating a model Gen. Error(Model) = Train. Error(Model, Train. Data) + x Complexity(Model)
  • 20. 02/03/2021 Introduction to Data Mining, 2nd Edition 20 Estimating the Complexity of Decision Trees  Pessimistic Error Estimate of decision tree T with k leaf nodes: – err(T): error rate on all training records – : trade-off hyper-parameter (similar to )  Relative cost of adding a leaf node – k: number of leaf nodes – Ntrain: total number of training records
  • 21. 02/03/2021 Introduction to Data Mining, 2nd Edition 21 Estimating the Complexity of Decision Trees: Example +: 5 -: 2 +: 1 -: 4 +: 3 -: 0 +: 3 -: 6 +: 3 -: 0 +: 0 -: 5 +: 3 -: 1 +: 1 -: 2 +: 0 -: 2 +: 2 -: 1 +: 3 -: 1 Decision Tree, TL Decision Tree, TR e(TL) = 4/24 e(TR) = 6/24  = 1 egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458 egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
  • 22. 02/03/2021 Introduction to Data Mining, 2nd Edition 22 Estimating the Complexity of Decision Trees  Resubstitution Estimate: – Using training error as an optimistic estimate of generalization error – Referred to as optimistic error estimate +: 5 -: 2 +: 1 -: 4 +: 3 -: 0 +: 3 -: 6 +: 3 -: 0 +: 0 -: 5 +: 3 -: 1 +: 1 -: 2 +: 0 -: 2 +: 2 -: 1 +: 3 -: 1 Decision Tree, TL Decision Tree, TR e(TL) = 4/24 e(TR) = 6/24
  • 23. 02/03/2021 Introduction to Data Mining, 2nd Edition 23 Minimum Description Length (MDL)  Cost(Model,Data) = Cost(Data|Model) + x Cost(Model) – Cost is the number of bits needed for encoding. – Search for the least costly model.  Cost(Data|Model) encodes the misclassification errors.  Cost(Model) uses node encoding (number of children) plus splitting condition encoding. A B A? B? C? 1 0 0 1 Yes No B1 B2 C1 C2 X y X1 1 X2 0 X3 0 X4 1 … … Xn 1 X y X1 ? X2 ? X3 ? X4 ? … … Xn ?
  • 24. 02/03/2021 Introduction to Data Mining, 2nd Edition 24 Model Selection for Decision Trees  Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node:  Stop if all instances belong to the same class  Stop if all the attribute values are the same – More restrictive conditions:  Stop if number of instances is less than some user-specified threshold  Stop if class distribution of instances are independent of the available features (e.g., using  2 test)  Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).  Stop if estimated generalization error falls below certain threshold
  • 25. 02/03/2021 Introduction to Data Mining, 2nd Edition 25 Model Selection for Decision Trees  Post-pruning – Grow decision tree to its entirety – Subtree replacement  Trim the nodes of the decision tree in a bottom-up fashion  If generalization error improves after trimming, replace sub-tree by a leaf node  Class label of leaf node is determined from majority class of instances in the sub-tree
  • 26. 02/03/2021 Introduction to Data Mining, 2nd Edition 26 Example of Post-Pruning A? A1 A2 A3 A4 Class = Yes 20 Class = No 10 Error = 10/30 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting) = (9 + 4  0.5)/30 = 11/30 PRUNE! Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 4 Class = Yes 4 Class = No 1 Class = Yes 5 Class = No 1
  • 27. 02/03/2021 Introduction to Data Mining, 2nd Edition 27 Examples of Post-pruning Simplified Decision Tree: depth = 1 : | ImagePages <= 0.1333 : class 1 | ImagePages > 0.1333 : | | breadth <= 6 : class 0 | | breadth > 6 : class 1 depth > 1 : | MultiAgent = 0: class 0 | MultiAgent = 1: | | totalPages <= 81 : class 0 | | totalPages > 81 : class 1 Decision Tree: depth = 1 : | breadth > 7 : class 1 | breadth <= 7 : | | breadth <= 3 : | | | ImagePages > 0.375 : class 0 | | | ImagePages <= 0.375 : | | | | totalPages <= 6 : class 1 | | | | totalPages > 6 : | | | | | breadth <= 1 : class 1 | | | | | breadth > 1 : class 0 | | width > 3 : | | | MultiIP = 0: | | | | ImagePages <= 0.1333 : class 1 | | | | ImagePages > 0.1333 : | | | | | breadth <= 6 : class 0 | | | | | breadth > 6 : class 1 | | | MultiIP = 1: | | | | TotalTime <= 361 : class 0 | | | | TotalTime > 361 : class 1 depth > 1 : | MultiAgent = 0: | | depth > 2 : class 0 | | depth <= 2 : | | | MultiIP = 1: class 0 | | | MultiIP = 0: | | | | breadth <= 6 : class 0 | | | | breadth > 6 : | | | | | RepeatedAccess <= 0.0322 : class 0 | | | | | RepeatedAccess > 0.0322 : class 1 | MultiAgent = 1: | | totalPages <= 81 : class 0 | | totalPages > 81 : class 1 Subtree Raising Subtree Replacement
  • 28. 02/03/2021 Introduction to Data Mining, 2nd Edition 28 Model Evaluation  Purpose: – To estimate performance of classifier on previously unseen data (test set)  Holdout – Reserve k% for training and (100-k)% for testing – Random subsampling: repeated holdout  Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n
  • 29. 02/03/2021 Introduction to Data Mining, 2nd Edition 29 Cross-validation Example  3-fold cross-validation
  • 30. 02/03/2021 Introduction to Data Mining, 2nd Edition 30 Variations on Cross-validation  Repeated cross-validation – Perform cross-validation a number of times – Gives an estimate of the variance of the generalization error  Stratified cross-validation – Guarantee the same percentage of class labels in training and test – Important when classes are imbalanced and the sample is small  Use nested cross-validation approach for model selection and evaluation