Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
10 decision tree
1. Prof. Neeraj Bhargava
Vishal Dutt
Department of Computer Science, School of
Engineering & System Sciences
MDS University, Ajmer
2. Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Attributes are categorical (we will talk about continuous-valued
attributes later)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All examples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no examples left
2
3. Building a compact tree
The key to building a decision tree - which attribute to
choose in order to branch.
The heuristic is to choose the attribute with the
maximum Information Gain based on information
theory.
Another explanation is to reduce uncertainty as much
as possible.
3
4. Information theory
Information theory provides a mathematical basis
for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
If one already has a good guess about the answer, then the
actual answer is less informative.
If one already knows that the coin is rigged so that it will
come with heads with probability 0.99, then a message
(advanced information) about the actual outcome of a flip
is worth less than it would be for a honest coin.
4
5. Information theory (cont …)
For a fair (honest) coin, you have no information,
and you are willing to pay more (say in terms of $)
for advanced information - less you know, the
more valuable the information.
Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in bits.
One bit of information is enough to answer a
yes/no question about which one has no idea, such
as the flip of a fair coin
5
6. Information theory
In general, if the possible answers vi have probabilities P(vi),
then the information content I (entropy) of the actual
answer is given by
For example, for the tossing of a fair coin we get
If the coin is loaded to give 99% head we get I = 0.08, and as
the probability of heads goes to 1, the information of the
actual answer goes to 0
6
I p v p vi i
i
n
( )log ( )2
1
I bit( ) log logCoin toss
1
2
1
2
1
2
1
2
12 2
7. Back to decision tree learning
For a given example, what is the correct classification?
We may think of a decision tree as conveying
information about the classification of examples in the
table (of examples);
The entropy measure characterizes the (im)purity of
an arbitrary collection of examples.
7