SlideShare una empresa de Scribd logo
1 de 50
Classification:
Decision Tree
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Introduction
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 A classification scheme which generates a tree and
a set of rules from given data set.
 The set of records available for developing
classification methods is divided into two disjoint
subsets – a training set and a test set.
 The attributes of the records are categorise into two
types:
 Attributes whose domain is numerical are called numerical
attributes.
 Attributes whose domain is not numerical are called the
categorical attributes.
Introduction
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 A decision tree is a tree with the following properties:
 An inner node represents an attribute.
 An edge represents a test on the attribute of the father
node.
 A leaf represents one of the classes.
 Construction of a decision tree
 Based on the training data
 Top-Down strategy
Decision
Tree
Example
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 The data set has five attributes.
 There is a special attribute: the attribute class is the class label.
 The attributes, temp (temperature) and humidity are numerical
attributes
 Other attributes are categorical, that is, they cannot be ordered.
 Based on the training data set, we want to find a set of rules to
know what values of outlook, temperature, humidity and wind,
determine whether or not to play golf.
Decision
Tree
Example
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 We have five leaf nodes.
 In a decision tree, each leaf node represents a rule.
 We have the following rules corresponding to the tree given in
Figure.
 RULE 1
 RULE 2
 RULE 3
 RULE 4
 RULE 5
If it is sunny and the humidity is not above 75%, then play.
If it is sunny and the humidity is above 75%, then do not play.
If it is overcast, then play.
If it is rainy and not windy, then play.
If it is rainy and windy, then don't play.
Classification
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
 A record enters the tree at the root node.
 At the root, a test is applied to determine which child
node the record will encounter next.
 This process is repeated until the record arrives at a leaf
node.
 All the records that end up at a given leaf of the tree are
classified in the same way.
 There is a unique path from the root to each leaf.
 The path is a rule which is used to classify the records.
 In our tree, we can carry out the classification
for an unknown record as follows.
 Let us assume, for the record, that we know
the values of the first four attributes (but we
do not know the value of class attribute) as
 outlook= rain; temp = 70; humidity = 65; and
windy= true.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 We start from the root node to check the value of the attribute
associated at the root node.
 This attribute is the splitting attribute at this node.
 For a decision tree, at every node there is an attribute associated
with the node called the splitting attribute.
 In our example, outlook is the splitting attribute at root.
 Since for the given record, outlook = rain, we move to the right-
most child node of the root.
 At this node, the splitting attribute is windy and we find that for
the record we want classify, windy = true.
 Hence, we move to the left child node to conclude that the class
label Is "no play".
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 The accuracy of the classifier is determined by the percentage of the
test data set that is correctly classified.
 We can see that for Rule 1 there are two records of the test data set
satisfying outlook= sunny and humidity < 75, and only one of these
is correctly classified as play.
 Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the
accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is
0.66.
RULE 1
If it is sunny and the humidity
is not above 75%, then play.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Concept of Categorical
Attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 Consider the following training
data set.
 There are three attributes,
namely, age, pincode and class.
 The attribute class is used for
class label.
The attribute age is a numeric attribute, whereas pincode is a categorical
one.
Though the domain of pincode is numeric, no ordering can be defined
among pincode values.
You cannot derive any useful information if one pin-code is greater than
another pincode.
 Figure gives a decision tree for the
training data.
 The splitting attribute at the root is
pincode and the splitting criterion
here is pincode = 500 046.
 Similarly, for the left child node, the
splitting criterion is age < 48 (the
splitting attribute is age).
 Although the right child node has
the same attribute as the splitting
attribute, the splitting criterion is
different.
At root level, we have 9 records.
The associated splitting criterion is
pincode = 500 046.
As a result, we split the records
into two subsets. Records 1, 2, 4, 8,
and 9 are to the left child note and
remaining to the right node.
The process is repeated at every
node.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Advantages and Shortcomings of
Decision Tree Classifications
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
 A decision tree construction process is concerned with identifying
the splitting attributes and splitting criterion at every level of the tree.
 Major strengths are:
 Decision tree able to generate understandable rules.
 They are able to handle both numerical and categorical attributes.
 They provide clear indication of which fields are most important for
prediction or classification.
 Weaknesses are:
 The process of growing a decision tree is computationally expensive. At
each node, each candidate splitting field is examined before its best split
can be found.
 Some decision tree can only deal with binary-valued target classes.
Tree construction Principle
Splitting Attribute
Splitting Criterion
 3 main phases
 -construction Phase
 -Pruning Phase
 -Processing the pruned tree to improve
the understandability
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Guillotine Cut
 The test is the form (X > Z) or (X<Z), Since it create
guillotine cut subdivision of Cartesian space of
ranges of attributes.
 Problem:
 When Attributes are co related e.g. Height and
weight of a person.
 Solution:
 Oblique decision tree.
 Splitting criteria involving more than one attribute.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Avoid Overfitting in
Classification
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
4
Constructing decision trees
 Strategy: top down
Recursive divide-and-conquer fashion
 First: select attribute for root node
Create branch for each possible attribute value
 Then: split instances into subsets
One for each branch extending from the node
 Finally: repeat recursively for each branch, using
only instances that reach the branch
 Stop if all instances have the same class
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Play or not?
• The weather
dataset
5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
6
Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Best Split
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.
Splitting Indices
 Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Final decision tree
 Splitting stops when data can’t be split any further
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Criterion for attribute selection
 Which is the best attribute?
 Want to get the smallest tree
 Heuristic: choose the attribute that produces the
“purest” nodes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
-- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
27
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy: Formulas
 Formulas for computing entropy:
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Entropy: Outlook, sunny
 Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Example: Outlook
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Computing Information Gain
 Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
 Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Information Gain Drawbacks
 Problematic: attributes with a large number
of values (extreme case: ID code)
Weather data with ID code
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
Tree stump for ID code attribute
 Entropy of split:
 Information gain is maximal for ID code (namely 0.940
bits)
Information Gain
Limitations
 Problematic: attributes with a large number
of values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
 (Another problem: fragmentation)
Gain ratio
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: information about the class is
disregarded.
Gain ratios for weather
data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
More on the gain ratio
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain
Gini index
 All attributes are assumed continuous-
valued
 Assume there exist several possible split
values for each attribute
 May need other tools, such as
clustering, to get the possible split
values
 Can be modified for categorical attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Splitting Criteria
 Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
 Sort the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
 The point with the minimum expected information requirement
for A is selected as the split point
Split
 D1 is the set of tuples in D satisfying A ≤ split-point
 D2 is the set of tuples in D satisfying A > split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Binary Split
 Numerical Values Attributes
 Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
 For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
 The point that gives the minimum Gini index for attribute A is
selected as its split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Class Histogram
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Two class histograms are used to store the class
distribution for numerical attributes.
Binary Split
 Categorical Attributes
 Examine the partitions resulting from all possible subsets of
{a1…,av}
 Each subset SA is a binary test of attribute A of the form
“A∈SA?”
 2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
 The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Count Matrix
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The count matrix stores the class distribution of
each value of a categorical attribute.
Decision tree construction algorithm
1. Information Gain
 • ID3
 • C4.5
 • C 5
 • J 48
2. Gini Index
 • SPRINT
 • SLIQ
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Decision tree construction
algorithm
 PUBLIC
 Rain Forest
 BOAT (1999
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The
End
50

Más contenido relacionado

Similar a Lt. 5 Pattern Reg.pptx

Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
ssuser33da69
 

Similar a Lt. 5 Pattern Reg.pptx (20)

Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
3 classification
3  classification3  classification
3 classification
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Artificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptxArtificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptx
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision trees
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Decision tree
Decision treeDecision tree
Decision tree
 
Hx3115011506
Hx3115011506Hx3115011506
Hx3115011506
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
 
winbis1005
winbis1005winbis1005
winbis1005
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Lt. 5 Pattern Reg.pptx

  • 1. Classification: Decision Tree Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 2. Introduction Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  A classification scheme which generates a tree and a set of rules from given data set.  The set of records available for developing classification methods is divided into two disjoint subsets – a training set and a test set.  The attributes of the records are categorise into two types:  Attributes whose domain is numerical are called numerical attributes.  Attributes whose domain is not numerical are called the categorical attributes.
  • 3. Introduction Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  A decision tree is a tree with the following properties:  An inner node represents an attribute.  An edge represents a test on the attribute of the father node.  A leaf represents one of the classes.  Construction of a decision tree  Based on the training data  Top-Down strategy
  • 4. Decision Tree Example Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  The data set has five attributes.  There is a special attribute: the attribute class is the class label.  The attributes, temp (temperature) and humidity are numerical attributes  Other attributes are categorical, that is, they cannot be ordered.  Based on the training data set, we want to find a set of rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.
  • 5. Decision Tree Example Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  We have five leaf nodes.  In a decision tree, each leaf node represents a rule.  We have the following rules corresponding to the tree given in Figure.  RULE 1  RULE 2  RULE 3  RULE 4  RULE 5 If it is sunny and the humidity is not above 75%, then play. If it is sunny and the humidity is above 75%, then do not play. If it is overcast, then play. If it is rainy and not windy, then play. If it is rainy and windy, then don't play.
  • 6. Classification Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  The classification of an unknown input vector is done by traversing the tree from the root node to a leaf node.  A record enters the tree at the root node.  At the root, a test is applied to determine which child node the record will encounter next.  This process is repeated until the record arrives at a leaf node.  All the records that end up at a given leaf of the tree are classified in the same way.  There is a unique path from the root to each leaf.  The path is a rule which is used to classify the records.
  • 7.  In our tree, we can carry out the classification for an unknown record as follows.  Let us assume, for the record, that we know the values of the first four attributes (but we do not know the value of class attribute) as  outlook= rain; temp = 70; humidity = 65; and windy= true. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 8.  We start from the root node to check the value of the attribute associated at the root node.  This attribute is the splitting attribute at this node.  For a decision tree, at every node there is an attribute associated with the node called the splitting attribute.  In our example, outlook is the splitting attribute at root.  Since for the given record, outlook = rain, we move to the right- most child node of the root.  At this node, the splitting attribute is windy and we find that for the record we want classify, windy = true.  Hence, we move to the left child node to conclude that the class label Is "no play". Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 9.  The accuracy of the classifier is determined by the percentage of the test data set that is correctly classified.  We can see that for Rule 1 there are two records of the test data set satisfying outlook= sunny and humidity < 75, and only one of these is correctly classified as play.  Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is 0.66. RULE 1 If it is sunny and the humidity is not above 75%, then play. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 10. Concept of Categorical Attributes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  Consider the following training data set.  There are three attributes, namely, age, pincode and class.  The attribute class is used for class label. The attribute age is a numeric attribute, whereas pincode is a categorical one. Though the domain of pincode is numeric, no ordering can be defined among pincode values. You cannot derive any useful information if one pin-code is greater than another pincode.
  • 11.  Figure gives a decision tree for the training data.  The splitting attribute at the root is pincode and the splitting criterion here is pincode = 500 046.  Similarly, for the left child node, the splitting criterion is age < 48 (the splitting attribute is age).  Although the right child node has the same attribute as the splitting attribute, the splitting criterion is different. At root level, we have 9 records. The associated splitting criterion is pincode = 500 046. As a result, we split the records into two subsets. Records 1, 2, 4, 8, and 9 are to the left child note and remaining to the right node. The process is repeated at every node. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 12. Advantages and Shortcomings of Decision Tree Classifications Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU  A decision tree construction process is concerned with identifying the splitting attributes and splitting criterion at every level of the tree.  Major strengths are:  Decision tree able to generate understandable rules.  They are able to handle both numerical and categorical attributes.  They provide clear indication of which fields are most important for prediction or classification.  Weaknesses are:  The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field is examined before its best split can be found.  Some decision tree can only deal with binary-valued target classes.
  • 13. Tree construction Principle Splitting Attribute Splitting Criterion  3 main phases  -construction Phase  -Pruning Phase  -Processing the pruned tree to improve the understandability Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 14. Guillotine Cut  The test is the form (X > Z) or (X<Z), Since it create guillotine cut subdivision of Cartesian space of ranges of attributes.  Problem:  When Attributes are co related e.g. Height and weight of a person.  Solution:  Oblique decision tree.  Splitting criteria involving more than one attribute. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 15. Avoid Overfitting in Classification  Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples  Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a set of data different from the training data to decide which is the “best pruned tree” Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 16. Decision Tree Splitting Indices, Splitting Criteria, Decision tree construction algorithm Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 17. 4 Constructing decision trees  Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch  Stop if all instances have the same class Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 18. Play or not? • The weather dataset 5 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 19. 6 Which attribute to select? Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 20. Best Split Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU 1. Evaluation of splits for each attribute and the selection of the best split, determination of splitting attribute, 2. Determination of splitting condition on the selected splitting attribute 3. Partitioning the data using best split.
  • 21. Splitting Indices  Determining the goodness of a split 1. Information Gain (From Information theory, entropy) 2. Gini Index (From economics, measure of diversity ) Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 22. Computing purity: the information measure • information is a measure of a reduction of uncertainty • It represents the expected amount of information that would be needed to “place” a new instance in the branch. 7 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 23. Which attribute to select? Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 24. Final decision tree  Splitting stops when data can’t be split any further Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 25. Criterion for attribute selection  Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 26. -- Information gain: increases with the average purity of the subsets -- Strategy: choose attribute that gives greatest information gain Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 27. How to compute Informaton Gain: Entropy 1. When the number of either yes OR no is zero (that is the node is pure) the information is zero. 2. When the number of yes and no is equal, the information reaches its maximum because we are very uncertain about the outcome. 3. Complex scenarios: the measure should be applicable to a multiclass situation, where a multi- staged decision must be made. 27 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 28. Entropy: Formulas  Formulas for computing entropy: Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 29. Entropy: Outlook, sunny  Formulae for computing the entropy: = (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 30. Measures: Information & Entropy • entropy is a probabilistic measure of uncertainty or ignorance and information is a measure of a reduction of uncertainty • However, in our context we use entropy (ie the quantity of uncertainty) to measure the purity of a node. 1 8 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 31. Example: Outlook Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 32. Computing Information Gain  Information gain: information before splitting – information after splitting gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2]) = 0.940 – 0.693 = 0.247 bits  Information gain for attributes from weather data: gain(Outlook ) gain(Temperature ) gain(Humidity ) gain(Windy ) = 0.247 bits = 0.029 bits = 0.152 bits = 0.048 bits Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 33. Information Gain Drawbacks  Problematic: attributes with a large number of values (extreme case: ID code)
  • 34. Weather data with ID code ID code Outlook Temp. Humidity Windy Play A Sunny Hot High False No B Sunny Hot High True No C Overcast Hot High False Yes D Rainy Mild High False Yes E Rainy Cool Normal False Yes F Rainy Cool Normal True No G Overcast Cool Normal True Yes H Sunny Mild High False No I Sunny Cool Normal False Yes J Rainy Mild Normal False Yes K Sunny Mild Normal True Yes L Overcast Mild High True Yes M Overcast Hot Normal False Yes N Rainy Mild High True No
  • 35. Tree stump for ID code attribute  Entropy of split:  Information gain is maximal for ID code (namely 0.940 bits)
  • 36. Information Gain Limitations  Problematic: attributes with a large number of values (extreme case: ID code)  Subsets are more likely to be pure if there is a large number of values  Information gain is biased towards choosing attributes with a large number of values  This may result in overfitting (selection of an attribute that is non-optimal for prediction)  (Another problem: fragmentation)
  • 37. Gain ratio  Gain ratio: a modification of the information gain that reduces its bias  Gain ratio takes number and size of branches into account when choosing an attribute  It corrects the information gain by taking the intrinsic information of a split into account  Intrinsic information: information about the class is disregarded.
  • 38. Gain ratios for weather data Outlook Temperature Info: 0.693 Info: 0.911 Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029 Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557 Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019 Humidity Windy Info: 0.788 Info: 0.892 Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048 Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985 Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
  • 39. More on the gain ratio  “Outlook” still comes out top  However: “ID code” has greater gain ratio  Standard fix: ad hoc test to prevent splitting on that type of attribute  Problem with gain ratio: it may overcompensate  May choose an attribute just because its intrinsic information is very low  Standard fix: only consider attributes with greater than average information gain
  • 40. Gini index  All attributes are assumed continuous- valued  Assume there exist several possible split values for each attribute  May need other tools, such as clustering, to get the possible split values  Can be modified for categorical attributes Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 41. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 42. Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 43. Splitting Criteria  Let attribute A be a numerical-valued attribute Must determine the best split point for A (BINARY Split)  Sort the values of A in increasing order  Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1  The point with the minimum expected information requirement for A is selected as the split point Split  D1 is the set of tuples in D satisfying A ≤ split-point  D2 is the set of tuples in D satisfying A > split-point Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 44. Binary Split  Numerical Values Attributes  Examine each possible split point. The midpoint between each pair of (sorted) adjacent values is taken as a possible split-point  For each split-point, compute the weighted sum of the impurity of each of the two resulting partitions (D1: A<=split-point, D2: A> split- point)  The point that gives the minimum Gini index for attribute A is selected as its split-point Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 45. Class Histogram Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU Two class histograms are used to store the class distribution for numerical attributes.
  • 46. Binary Split  Categorical Attributes  Examine the partitions resulting from all possible subsets of {a1…,av}  Each subset SA is a binary test of attribute A of the form “A∈SA?”  2^v possible subsets. We exclude the power set and the empty set, then we have 2^v-2 subsets  The subset that gives the minimum Gini index for attribute A is selected as its splitting subset Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 47. Count Matrix Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU The count matrix stores the class distribution of each value of a categorical attribute.
  • 48. Decision tree construction algorithm 1. Information Gain  • ID3  • C4.5  • C 5  • J 48 2. Gini Index  • SPRINT  • SLIQ Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
  • 49. Decision tree construction algorithm  PUBLIC  Rain Forest  BOAT (1999 Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU