Lt. 5 Pattern Reg.pptx

Classification:
Decision Tree
Data Mining Dr. Iram Naim, Dept. of CSIT, MJPRU

Introduction
 A classification scheme which generates a tree and
a set of rules from given data set.
 The set of records available for developing
classification methods is divided into two disjoint
subsets – a training set and a test set.
 The attributes of the records are categorise into two
types:
 Attributes whose domain is numerical are called numerical
attributes.
 Attributes whose domain is not numerical are called the
categorical attributes.

Introduction
 A decision tree is a tree with the following properties:
 An inner node represents an attribute.
 An edge represents a test on the attribute of the father
node.
 A leaf represents one of the classes.
 Construction of a decision tree
 Based on the training data
 Top-Down strategy

Decision
Tree
Example
 The data set has five attributes.
 There is a special attribute: the attribute class is the class label.
 The attributes, temp (temperature) and humidity are numerical
attributes
 Other attributes are categorical, that is, they cannot be ordered.
 Based on the training data set, we want to find a set of rules to
know what values of outlook, temperature, humidity and wind,
determine whether or not to play golf.

Decision
Tree
Example
 We have five leaf nodes.
 In a decision tree, each leaf node represents a rule.
 We have the following rules corresponding to the tree given in
Figure.
 RULE 1
 RULE 2
 RULE 3
 RULE 4
 RULE 5
If it is sunny and the humidity is not above 75%, then play.
If it is sunny and the humidity is above 75%, then do not play.
If it is overcast, then play.
If it is rainy and not windy, then play.
If it is rainy and windy, then don't play.

Classification
 The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node.
 A record enters the tree at the root node.
 At the root, a test is applied to determine which child
node the record will encounter next.
 This process is repeated until the record arrives at a leaf
node.
 All the records that end up at a given leaf of the tree are
classified in the same way.
 There is a unique path from the root to each leaf.
 The path is a rule which is used to classify the records.

 In our tree, we can carry out the classification
for an unknown record as follows.
 Let us assume, for the record, that we know
the values of the first four attributes (but we
do not know the value of class attribute) as
 outlook= rain; temp = 70; humidity = 65; and
windy= true.

 We start from the root node to check the value of the attribute
associated at the root node.
 This attribute is the splitting attribute at this node.
 For a decision tree, at every node there is an attribute associated
with the node called the splitting attribute.
 In our example, outlook is the splitting attribute at root.
 Since for the given record, outlook = rain, we move to the right-
most child node of the root.
 At this node, the splitting attribute is windy and we find that for
the record we want classify, windy = true.
 Hence, we move to the left child node to conclude that the class
label Is "no play".

 The accuracy of the classifier is determined by the percentage of the
test data set that is correctly classified.
 We can see that for Rule 1 there are two records of the test data set
satisfying outlook= sunny and humidity < 75, and only one of these
is correctly classified as play.
 Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the
accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is
0.66.
RULE 1
If it is sunny and the humidity
is not above 75%, then play.

Concept of Categorical
Attributes
 Consider the following training
data set.
 There are three attributes,
namely, age, pincode and class.
 The attribute class is used for
class label.
The attribute age is a numeric attribute, whereas pincode is a categorical
one.
Though the domain of pincode is numeric, no ordering can be defined
among pincode values.
You cannot derive any useful information if one pin-code is greater than
another pincode.

 Figure gives a decision tree for the
training data.
 The splitting attribute at the root is
pincode and the splitting criterion
here is pincode = 500 046.
 Similarly, for the left child node, the
splitting criterion is age < 48 (the
splitting attribute is age).
 Although the right child node has
the same attribute as the splitting
attribute, the splitting criterion is
different.
At root level, we have 9 records.
The associated splitting criterion is
pincode = 500 046.
As a result, we split the records
into two subsets. Records 1, 2, 4, 8,
and 9 are to the left child note and
remaining to the right node.
The process is repeated at every
node.

Advantages and Shortcomings of
Decision Tree Classifications
 A decision tree construction process is concerned with identifying
the splitting attributes and splitting criterion at every level of the tree.
 Major strengths are:
 Decision tree able to generate understandable rules.
 They are able to handle both numerical and categorical attributes.
 They provide clear indication of which fields are most important for
prediction or classification.
 Weaknesses are:
 The process of growing a decision tree is computationally expensive. At
each node, each candidate splitting field is examined before its best split
can be found.
 Some decision tree can only deal with binary-valued target classes.

Tree construction Principle
Splitting Attribute
Splitting Criterion
 3 main phases
 -construction Phase
 -Pruning Phase
 -Processing the pruned tree to improve
the understandability

Guillotine Cut
 The test is the form (X > Z) or (X<Z), Since it create
guillotine cut subdivision of Cartesian space of
ranges of attributes.
 Problem:
 When Attributes are co related e.g. Height and
weight of a person.
 Solution:
 Oblique decision tree.
 Splitting criteria involving more than one attribute.

Avoid Overfitting in
Classification
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”

Decision Tree
Splitting Indices, Splitting Criteria,
Decision tree construction algorithm

4
Constructing decision trees
 Strategy: top down
Recursive divide-and-conquer fashion
 First: select attribute for root node
Create branch for each possible attribute value
 Then: split instances into subsets
One for each branch extending from the node
 Finally: repeat recursively for each branch, using
only instances that reach the branch
 Stop if all instances have the same class

Play or not?
• The weather
dataset
5

6
Which attribute to select?

Best Split
1. Evaluation of splits for each attribute and
the selection of the best split,
determination of splitting attribute,
2. Determination of splitting condition on
the selected splitting attribute
3. Partitioning the data using best split.

Splitting Indices
 Determining the goodness of a split
1. Information Gain
(From Information theory, entropy)
2. Gini Index
(From economics, measure of diversity )

Computing purity: the information
measure
• information is a measure of a
reduction of uncertainty
• It represents the expected amount of
information that would be needed to
“place” a new instance in the branch.
7

Which attribute to select?

Final decision tree
 Splitting stops when data can’t be split any further

Criterion for attribute selection
 Which is the best attribute?
 Want to get the smallest tree
 Heuristic: choose the attribute that produces the
“purest” nodes

-- Information gain: increases with the average purity of the
subsets
-- Strategy: choose attribute that gives greatest information
gain

How to compute Informaton
Gain: Entropy
1. When the number of either yes OR no is zero (that is
the node is pure) the information is zero.
2. When the number of yes and no is equal, the
information reaches its maximum because we are very
uncertain about the outcome.
3. Complex scenarios: the measure should be
applicable to a multiclass situation, where a multi-
staged decision must be made.
27

Entropy: Formulas
 Formulas for computing entropy:

Entropy: Outlook, sunny
 Formulae for computing the entropy:
= (((-2) / 5) log2(2 / 5)) + (((-3) / 5) x log2(3 / 5)) = 0.97095059445

Measures: Information &
Entropy
• entropy is a probabilistic measure of
uncertainty or ignorance and
information is a measure of a reduction
of uncertainty
• However, in our context we use entropy (ie
the quantity of uncertainty) to measure the
purity of a node.
1
8

Example: Outlook

Computing Information Gain
 Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) –info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
 Information gain for attributes from weather data:
gain(Outlook )
gain(Temperature )
gain(Humidity )
gain(Windy )
= 0.247 bits
= 0.029 bits
= 0.152 bits
= 0.048 bits

Information Gain Drawbacks
 Problematic: attributes with a large number
of values (extreme case: ID code)

Weather data with ID code
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No

Tree stump for ID code attribute
 Entropy of split:
 Information gain is maximal for ID code (namely 0.940
bits)

Information Gain
Limitations
 Problematic: attributes with a large number
of values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
 (Another problem: fragmentation)

Gain ratio
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: information about the class is
disregarded.

Gain ratios for weather
data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049

More on the gain ratio
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain

Gini index
 All attributes are assumed continuous-
valued
 Assume there exist several possible split
values for each attribute
 May need other tools, such as
clustering, to get the possible split
values
 Can be modified for categorical attributes

Splitting Criteria
 Let attribute A be a numerical-valued attribute Must determine
the best split point for A (BINARY Split)
 Sort the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values is
considered as a possible split point (ai+ai+1)/2 is the midpoint
between the values of ai and ai+1
 The point with the minimum expected information requirement
for A is selected as the split point
Split
 D1 is the set of tuples in D satisfying A ≤ split-point
 D2 is the set of tuples in D satisfying A > split-point

Binary Split
 Numerical Values Attributes
 Examine each possible split point. The midpoint between each pair
of (sorted) adjacent values is taken as a possible split-point
 For each split-point, compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A<=split-point, D2: A> split-
point)
 The point that gives the minimum Gini index for attribute A is
selected as its split-point

Class Histogram
Two class histograms are used to store the class
distribution for numerical attributes.

Binary Split
 Categorical Attributes
 Examine the partitions resulting from all possible subsets of
{a1…,av}
 Each subset SA is a binary test of attribute A of the form
“A∈SA?”
 2^v possible subsets. We exclude the power set and the
empty set, then we have 2^v-2 subsets
 The subset that gives the minimum Gini index for attribute
A is selected as its splitting subset

Count Matrix
The count matrix stores the class distribution of
each value of a categorical attribute.

Decision tree construction algorithm
1. Information Gain
 • ID3
 • C4.5
 • C 5
 • J 48
2. Gini Index
 • SPRINT
 • SLIQ

Decision tree construction
algorithm
 PUBLIC
 Rain Forest
 BOAT (1999

Lt. 5 Pattern Reg.pptx

Recomendados

Recomendados

Más contenido relacionado

Similar a Lt. 5 Pattern Reg.pptx

Similar a Lt. 5 Pattern Reg.pptx (20)

Último

Último (20)

Lt. 5 Pattern Reg.pptx