SlideShare a Scribd company logo
1 of 39
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model  for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc
Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT
Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split
Splitting Based on Nominal Attributes CarType Family Luxury Sports Multi-way split: Use as many partitions as distinct values.
Contd….. Binary split:  Divides values into two subsets. Need to find optimal partitioning CarType {Sports, Luxury} {Family}
Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute  Static – discretize once at the beginning  Dynamic – ranges can be found by equal interval 		bucketing, equal frequency bucketing		(percentiles), or clustering.
Contd…. Binary Decision: (A < v) or (A  v)  consider all possible splits and finds the best cut  can be more compute intensive
How to determine the Best Split Greedy approach:  Nodes with homogeneous class distribution are preferred Need a measure of node impurity:
Measures of Node Impurity Gini Index Entropy Misclassification error
Measure of Impurity: GINI Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information
Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where,	ni = number of records at child i,     	          n = number of records at node p.
Binary Attributes: Computing GINI Index ,[object Object]
Effect of Weighing partitions:
Larger and Purer Partitions are sought for.,[object Object]
Two-way split (find best partition of values) Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A  v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.
Measure of Impurity: Entropy Entropy at a given node t: Measures homogeneity of a node.  Maximum (log nc) when records are equally distributed among all classes implying least Information Minimum (0.0) when all records belong to one class, implying most information
Splitting based on Entropy Parent Node, p is split into k partitions ni is the number of records in partition i Classification error at a node t :
Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later)
Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets
Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification
Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors
How to Address Overfitting Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node:  Stop if all instances belong to the same class  Stop if all the attribute values are the same More restrictive conditions:  Stop if number of instances is less than some user-specified threshold  Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
How to Address Overfitting… Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Can use MDL for post-pruning
Other Issues Data Fragmentation Search Strategy Expressiveness Tree Replication
Data Fragmentation Number of instances gets smaller as you traverse down the tree Number of instances at the leaf nodes could be too small to make any statistically significant decision
Search Strategy Finding an optimal decision tree is NP-hard The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution Other strategies? Bottom-up Bi-directional
Expressiveness Decision tree provides expressive representation for learning discrete-valued function But they do not generalize well to certain types of Boolean functions Not expressive enough for modeling continuous variables Particularly when test condition involves only a single attribute at-a-time
Tree Replication Same subtree appears in multiple branches
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?
Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. It is determined using: Confusion matrix Cost matrix
Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets
Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing  Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out:   k=n Stratified sampling  oversampling vsundersampling Bootstrap Sampling with replacement
Methods for Model Comparison -ROC Developed in 1950s for signal detection theory to analyze noisy signals  Characterize the trade-off between positive hits and false alarms ROC curve plots TP (on the y-axis) against FP (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set?
Conclusion Decision tree induction Algorithm for decision tee induction Model Overfitting Evaluating the performance of a classifier  are studied  in detail

More Related Content

What's hot

What's hot (20)

Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Data clustering
Data clustering Data clustering
Data clustering
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Bagging.pptx
Bagging.pptxBagging.pptx
Bagging.pptx
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin Analytics
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 

Viewers also liked

Presentazione oroblu
Presentazione orobluPresentazione oroblu
Presentazione oroblu
robyroby65
 
Direct-services portfolio
Direct-services portfolioDirect-services portfolio
Direct-services portfolio
vlastakolaja
 

Viewers also liked (20)

Huidige status van de testtaal TTCN-3
Huidige status van de testtaal TTCN-3Huidige status van de testtaal TTCN-3
Huidige status van de testtaal TTCN-3
 
Data Applied: Developer Quicklook
Data Applied: Developer QuicklookData Applied: Developer Quicklook
Data Applied: Developer Quicklook
 
Bernoullis Random Variables And Binomial Distribution
Bernoullis Random Variables And Binomial DistributionBernoullis Random Variables And Binomial Distribution
Bernoullis Random Variables And Binomial Distribution
 
Info Chimps: What Makes Infochimps.org Unique
Info Chimps: What Makes Infochimps.org UniqueInfo Chimps: What Makes Infochimps.org Unique
Info Chimps: What Makes Infochimps.org Unique
 
Data Applied: Similarity
Data Applied: SimilarityData Applied: Similarity
Data Applied: Similarity
 
Quantica Construction Search
Quantica Construction SearchQuantica Construction Search
Quantica Construction Search
 
Ccc
CccCcc
Ccc
 
MS Sql Server: Doing Calculations With Functions
MS Sql Server: Doing Calculations With FunctionsMS Sql Server: Doing Calculations With Functions
MS Sql Server: Doing Calculations With Functions
 
Txomin Hartz Txikia
Txomin Hartz TxikiaTxomin Hartz Txikia
Txomin Hartz Txikia
 
Quick Look At Classification
Quick Look At ClassificationQuick Look At Classification
Quick Look At Classification
 
SPSS: Data Editor
SPSS: Data EditorSPSS: Data Editor
SPSS: Data Editor
 
MS Sql Server: Deleting A Database
MS Sql Server: Deleting A DatabaseMS Sql Server: Deleting A Database
MS Sql Server: Deleting A Database
 
Mysql:Operators
Mysql:OperatorsMysql:Operators
Mysql:Operators
 
Oracle: Joins
Oracle: JoinsOracle: Joins
Oracle: Joins
 
Presentazione oroblu
Presentazione orobluPresentazione oroblu
Presentazione oroblu
 
Direct-services portfolio
Direct-services portfolioDirect-services portfolio
Direct-services portfolio
 
PresentacióN De Quimica
PresentacióN De QuimicaPresentacióN De Quimica
PresentacióN De Quimica
 
Test
TestTest
Test
 
RapidMiner: Nested Subprocesses
RapidMiner:   Nested SubprocessesRapidMiner:   Nested Subprocesses
RapidMiner: Nested Subprocesses
 
Facebook: An Innovative Influenza Pandemic Early Warning System
Facebook: An Innovative Influenza Pandemic Early Warning SystemFacebook: An Innovative Influenza Pandemic Early Warning System
Facebook: An Innovative Influenza Pandemic Early Warning System
 

Similar to Classification

Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
On cascading small decision trees
On cascading small decision treesOn cascading small decision trees
On cascading small decision trees
Julià Minguillón
 
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
madlynplamondon
 

Similar to Classification (20)

IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
decisiontrees (3).ppt
decisiontrees (3).pptdecisiontrees (3).ppt
decisiontrees (3).ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 
decisiontrees.ppt
decisiontrees.pptdecisiontrees.ppt
decisiontrees.ppt
 
data mining.pptx
data mining.pptxdata mining.pptx
data mining.pptx
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
My8clst
My8clstMy8clst
My8clst
 
Decision trees
Decision treesDecision trees
Decision trees
 
On cascading small decision trees
On cascading small decision treesOn cascading small decision trees
On cascading small decision trees
 
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
 
Decision tree
Decision treeDecision tree
Decision tree
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 

More from DataminingTools Inc

More from DataminingTools Inc (20)

Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Areas of machine leanring
Areas of machine leanringAreas of machine leanring
Areas of machine leanring
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI 2AI: Learning in AI 2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
AI: AI & Searching
AI: AI & SearchingAI: AI & Searching
AI: AI & Searching
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Classification

  • 1. Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation
  • 2. Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
  • 3. Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc
  • 4. Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
  • 5. Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT
  • 6. Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
  • 7. How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split
  • 8. Splitting Based on Nominal Attributes CarType Family Luxury Sports Multi-way split: Use as many partitions as distinct values.
  • 9. Contd….. Binary split: Divides values into two subsets. Need to find optimal partitioning CarType {Sports, Luxury} {Family}
  • 10. Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
  • 11. Contd…. Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive
  • 12. How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity:
  • 13. Measures of Node Impurity Gini Index Entropy Misclassification error
  • 14. Measure of Impurity: GINI Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information
  • 15. Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p.
  • 16.
  • 17. Effect of Weighing partitions:
  • 18.
  • 19. Two-way split (find best partition of values) Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A  v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.
  • 20. Measure of Impurity: Entropy Entropy at a given node t: Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all classes implying least Information Minimum (0.0) when all records belong to one class, implying most information
  • 21. Splitting based on Entropy Parent Node, p is split into k partitions ni is the number of records in partition i Classification error at a node t :
  • 22. Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later)
  • 23. Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets
  • 24. Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification
  • 25. Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors
  • 26. How to Address Overfitting Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
  • 27. How to Address Overfitting… Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Can use MDL for post-pruning
  • 28. Other Issues Data Fragmentation Search Strategy Expressiveness Tree Replication
  • 29. Data Fragmentation Number of instances gets smaller as you traverse down the tree Number of instances at the leaf nodes could be too small to make any statistically significant decision
  • 30. Search Strategy Finding an optimal decision tree is NP-hard The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution Other strategies? Bottom-up Bi-directional
  • 31. Expressiveness Decision tree provides expressive representation for learning discrete-valued function But they do not generalize well to certain types of Boolean functions Not expressive enough for modeling continuous variables Particularly when test condition involves only a single attribute at-a-time
  • 32. Tree Replication Same subtree appears in multiple branches
  • 33. Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?
  • 34. Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. It is determined using: Confusion matrix Cost matrix
  • 35. Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets
  • 36. Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vsundersampling Bootstrap Sampling with replacement
  • 37. Methods for Model Comparison -ROC Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP (on the y-axis) against FP (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
  • 38. Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set?
  • 39. Conclusion Decision tree induction Algorithm for decision tee induction Model Overfitting Evaluating the performance of a classifier are studied in detail
  • 40. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net