2. Introduction
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
A classification scheme which generates a tree and
a set of rules from given data set.
The set of records available for developing
classification methods is divided into two disjoint
subsets – a training set and a test set.
The attributes of the records are categorise into two
types:
Attributes whose domain is numerical are called numerical
attributes.
Attributes whose domain is not numerical are called the
categorical attributes.
3. Introduction
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
A decision tree is a tree with the following properties:
An inner node represents an attribute.
An edge represents a test on the attribute of the father
node.
A leaf represents one of the classes.
Construction of a decision tree
Based on the training data
Top-Down strategy
4. Decision
Tree
Example
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The data set has five attributes.
There is a special attribute: the attribute class is the class label.
The attributes, temp (temperature) and humidity are numerical
attributes
Other attributes are categorical, that is, they cannot be ordered.
Based on the training data set, we want to find a set of rules to
know what values of outlook, temperature, humidity and wind,
determine whether or not to play golf.
5. Decision
Tree
Example
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
We have five leaf nodes.
In a decision tree, each leaf node represents a rule.
We have the following rules corresponding to the tree given in
Figure.
RULE 1
RULE 2
RULE 3
RULE 4
RULE 5
If it is sunny and the humidity is not above 75%, then play.
If it is sunny and the humidity is above 75%, then do not play.
If it is overcast, then play.
If it is rainy and not windy, then play.
If it is rainy and windy, then don't play.
6. Classification
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
A record enters the tree at the root node.
At the root, a test is applied to determine which child
node the record will encounter next.
This process is repeated until the record arrives at a leaf
node.
All the records that end up at a given leaf of the tree are
classified in the same way.
There is a unique path from the root to each leaf.
The path is a rule which is used to classify the records.
7. In our tree, we can carry out the classification
for an unknown record as follows.
Let us assume, for the record, that we know
the values of the first four attributes (but we
do not know the value of class attribute) as
outlook= rain; temp = 70; humidity = 65; and
windy= true.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
8. We start from the root node to check the value of the attribute
associated at the root node.
This attribute is the splitting attribute at this node.
For a decision tree, at every node there is an attribute associated
with the node called the splitting attribute.
In our example, outlook is the splitting attribute at root.
Since for the given record, outlook = rain, we move to the right-
most child node of the root.
At this node, the splitting attribute is windy and we find that for
the record we want classify, windy = true.
Hence, we move to the left child node to conclude that the class
label Is "no play".
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
9. The accuracy of the classifier is determined by the percentage of the
test data set that is correctly classified.
We can see that for Rule 1 there are two records of the test data set
satisfying outlook= sunny and humidity < 75, and only one of these
is correctly classified as play.
Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the
accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is
0.66.
RULE 1
If it is sunny and the humidity
is not above 75%, then play.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
10. Concept of Categorical
Attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Consider the following training
data set.
There are three attributes,
namely, age, pincode and class.
The attribute class is used for
class label.
The attribute age is a numeric attribute, whereas pincode is a categorical
one.
Though the domain of pincode is numeric, no ordering can be defined
among pincode values.
You cannot derive any useful information if one pin-code is greater than
another pincode.
11. Figure gives a decision tree for the
training data.
The splitting attribute at the root is
pincode and the splitting criterion
here is pincode = 500 046.
Similarly, for the left child node, the
splitting criterion is age < 48 (the
splitting attribute is age).
Although the right child node has
the same attribute as the splitting
attribute, the splitting criterion is
different.
At root level, we have 9 records.
The associated splitting criterion is
pincode = 500 046.
As a result, we split the records
into two subsets. Records 1, 2, 4, 8,
and 9 are to the left child note and
remaining to the right node.
The process is repeated at every
node.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
12. Advantages and Shortcomings of
Decision Tree Classifications
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
A decision tree construction process is concerned with identifying
the splitting attributes and splitting criterion at every level of the tree.
Major strengths are:
Decision tree able to generate understandable rules.
They are able to handle both numerical and categorical attributes.
They provide clear indication of which fields are most important for
prediction or classification.
Weaknesses are:
The process of growing a decision tree is computationally expensive. At
each node, each candidate splitting field is examined before its best split
can be found.
Some decision tree can only deal with binary-valued target classes.
13. Tree construction Principle
Splitting Attribute
Splitting Criterion
3 main phases
-construction Phase
-Pruning Phase
-Processing the pruned tree to improve
the understandability
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
14. Guillotine Cut
The test is the form (X > Z) or (X<Z), Since it create
guillotine cut subdivision of Cartesian space of
ranges of attributes.
Problem:
When Attributes are co related e.g. Height and
weight of a person.
Solution:
Oblique decision tree.
Splitting criteria involving more than one attribute.
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
15. Avoid Overfitting in
Classification
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the “best pruned tree”
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
16. Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
17. 4
Constructing decision trees
Strategy: top down
Recursive divide-and-conquer fashion
First: select attribute for root node
Create branch for each possible attribute value
Then: split instances into subsets
One for each branch extending from the node
Finally: repeat recursively for each branch, using
only instances that reach the branch
Stop if all instances have the same class
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
18. Play or not?
• The weather
dataset
5
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
20. Best Split
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.
21. Splitting Indices
Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
22. Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
23. Which attribute to select?
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
24. Final decision tree
Splitting stops when data can’t be split any further
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
25. Criterion for attribute selection
Which is the best attribute?
Want to get the smallest tree
Heuristic: choose the attribute that produces the
“purest” nodes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
26. -- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
27. How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
27
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
29. Entropy: Outlook, sunny
Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
30. Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
32. Computing Information Gain
Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
34. Weather data with ID code
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
35. Tree stump for ID code attribute
Entropy of split:
Information gain is maximal for ID code (namely 0.940
bits)
36. Information Gain
Limitations
Problematic: attributes with a large number
of values (extreme case: ID code)
Subsets are more likely to be pure if there is
a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
(Another problem: fragmentation)
37. Gain ratio
Gain ratio: a modification of the information gain
that reduces its bias
Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the intrinsic
information of a split into account
Intrinsic information: information about the class is
disregarded.
38. Gain ratios for weather
data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
39. More on the gain ratio
“Outlook” still comes out top
However: “ID code” has greater gain ratio
Standard fix: ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate
May choose an attribute just because its intrinsic
information is very low
Standard fix: only consider attributes with greater
than average information gain
40. Gini index
All attributes are assumed continuous-
valued
Assume there exist several possible split
values for each attribute
May need other tools, such as
clustering, to get the possible split
values
Can be modified for categorical attributes
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
43. Splitting Criteria
Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
Sort the values of A in increasing order
Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
The point with the minimum expected information requirement
for A is selected as the split point
Split
D1 is the set of tuples in D satisfying A ≤ split-point
D2 is the set of tuples in D satisfying A > split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
44. Binary Split
Numerical Values Attributes
Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
The point that gives the minimum Gini index for attribute A is
selected as its split-point
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
45. Class Histogram
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
Two class histograms are used to store the class
distribution for numerical attributes.
46. Binary Split
Categorical Attributes
Examine the partitions resulting from all possible subsets of
{a1…,av}
Each subset SA is a binary test of attribute A of the form
“A∈SA?”
2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
47. Count Matrix
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU
The count matrix stores the class distribution of
each value of a categorical attribute.
48. Decision tree construction algorithm
1. Information Gain
• ID3
• C4.5
• C 5
• J 48
2. Gini Index
• SPRINT
• SLIQ
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU