DBM630 Data Mining Classification and Prediction

DBM630: Data Mining and
Data Warehousing

MS.IT. Rangsit University
Semester 2/2011

Lecture 6
Classification and Prediction
Decision Tree and Classification Rules

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1

Topics
 What Is Classification, What Is Prediction?
 Decision Tree
 Classification Rule: Covering Algorithm

2 Data Warehousing and Data Mining by Kritsada Sriphaew

What Is Classification?
 Case
 A bank loans officer needs analysis of her data in order to learn
which loan applicants are “safe” and which are “risky” for the bank
 A marketing manager needs data analysis to help guess whether a
customer with a given profile will buy a new computer or not
 A medical researcher wants to analyze breast cancer data in order
to predict which one of three specific treatments a patient receive

 The data analysis task is classification, where the model or
classifier is constructed to predict categorical labels
 The model is a classifier

3
Data Warehousing and Data Mining by Kritsada Sriphaew

What Is Prediction?
 Suppose that the marketing manager would like to predict how
much a given customer will spend during a sale at the shop
 This data analysis task is numeric prediction, where the model
constructed predicts a continuous value or ordered values, as
opposed to a categorical label
 This model is a predictor
 Regression analysis is a statistical methodology that is most often
used for numeric prediction

4

How does classification work?
 Data classification is a two-step process
 In the first step, -- learning step or training phase
 A model is built describing a predetermined set of data classes or concepts
 Data tuples used to build the classification model are called training data set
 If the class label is provided, this step is known as supervised learning, otherwise
called unsupervised learning
 The learned model may be represented in the form of classification rules,
decision trees, Bayesian, mathematical formulae, etc.

5

How does classification work?
 In the second step,
 The learned model is used for classification
 Estimate the predictive accuracy of the model using hold-out data set (a test set
of class-labeled samples which are randomly selected and are independent of the
training samples)
 If the accuracy of the model were estimate based on the training data set -> the
model tends to overfit the data
 If the accuracy of the model is considered acceptable, the model can be used to
classify future data tuples or objects for which the class label is unknown
 In the experiment, there are three kinds of dataset, training data set, hold-out
data set (or validation data set), and test data set

6

Issues Regarding Classification/Prediction
 Comparing classification methods
 The criteria to compare and evaluate classification and prediction methods
 Accuracy: an ability of a given classifier to correctly predict the class label of new
or unseen data
 Speed: the computation costs involved in generating and using the given classifier
or predictor
 Robustness: an ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values
 Scalability: an ability to construct the classifier or predictor efficiently given large
amounts of data
 Interpretability: the level of understanding and insight that is provided by the
classifier or predictor – subjective and more difficult to assess

7

Decision Tree
 A decision tree is a flow-chart-like tree structure,
 each internal node denotes a test on an attribute,
 each branch represents an outcome of the test
 leaf node represent classes
 Top-most node in a tree is the root node
 Instead of using the complete set of features jointly to make a decision, different
subsets of features are used at different levels of the tree during making a decision

Age? The decision tree
<=30 >40 represents the concept
31…40 buys_computer
student? Credit_rating?
yes
no yes excellent fair

no yes no yes

8 Classification – Decision Tree

Decision Tree Induction
 Normal procedure: greedy algorithm by top down
in recursive divide-and-conquer fashion
 First: attribute is selected for root node and branch is
created for each possible attribute value
 Then: the instances are split into subsets (one for each
branch extending from the node)
 Finally: procedure is repeated recursively for each
branch, using only instances that reach the branch
 Process stops if
 All instances for a given node belong to the same class
 No remaining attribute on which the samples may be further partitioned  majority vote is
employed
 No sample for the branch to test the attrbiute  majority vote is employed


Decision Tree Representation
(An Example)
 The decision tree (DT) of the weather example is:
Outlook Temp. Humid. Windy Play Decision Tree
sunny hot high false N Induction
sunny hot high true N
overcast hot high false Y
rainy mild high false Y
rainy cool normal false Y
outlook
rainy cool normal true N
overcast cool normal true Y sunny rainy
sunny mild high false N overcast

sunny cool normal false Y humidity windy
yes
rainy mild normal false Y high normal false true
sunny mild normal true Y
overcast mild high true Y no yes yes no
overcast hot normal false Y
rainy mild high
10 true N Classification – Decision Tree

An Example
(Which attribute is the best?)
 There are four possibilities for each split


Criterions for Attribute Selection
 Which is the best attribute?
 The one which will result in the smallest tree
 Heuristic: choose the attribute that produces the “purest”
nodes
 Popular impurity criterion: information gain
 Information gain increases with the average purity
of the subsets that an attribute produces
 Strategy: choose the attribute with the highest
information gain is chosen as the test attribute
for the current nodes

Computing “Information”
 Information is measured in bits
 Given a probability distribution, the information required to predict an event is
the distribution‟s entropy
 Entropy gives the information required in bits (this can involve fractions of bits!)
 Information gain measures the goodness of split
 Formula for computing expected information:
 Let S be a set consisting of s data instances, the class label attribute has n distinct
classes, Ci (for i = 1, …, n)
 Let si be the number of instances in class Ci
 The expected information or entropy is

info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi)
where pi is the probability that the instance belongs to class, pi = si/s

 Formula for computing information gain:
 Find an information gain of attribute A

gain(A) = info. before splitting – info. after splitting


Expected Information for “Outlook”
 “Outlook” = “sunny”:
info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5)
= 0.971 bits
“Outlook” = “overcast”:
Outlook Temp. Humid. Windy Play
 sunny hot high false N

info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0) sunny
overcast
hot
hot
high
high
true
false
N
Y
= 0 bits rainy mild high false Y

 “Outlook” = “rainy”: rainy
rainy
cool
cool
normal
normal
false
true
Y
N

info([3,2]) = entropy(3/5,2/5) overcast cool normal true Y

= - (3/5)log2(3/5) - (2/5)log2(2/5) sunny mild high false N
sunny cool normal false Y
= 0.971 bits rainy mild normal false Y

 Expected information for attribute “Outlook”: sunny
overcast
mild
mild
normal
high
true
true
Y
Y
info([2,3],[4,0],[3,2]) overcast hot normal false Y

= (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2])
rainy mild high true N

= [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ]
= 0.693 bits


Information Gain for “Outlook”
 Information gain:
 info. before splitting – info. after splitting
gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2])
= 0.940-0.693
= 0.247 bits
 Information gain for attributes from weather data:
gain(”Outlook”) = 0.247 bits
gain(”Temperature”) = 0.029 bits
gain(“Humidity”) = 0.152 bits
gain(“Windy”) = 0.048 bits


An Example of Gain Criterion
(Which attribute is the best?)
Gain(outlook) = Gain(humidity) =
info([9,5]) - info([9,5]) -
info([2,3],[4,0],[3,2]) info([3,4],[6,1])
= 0.247 = 0.152

The best

Gain(humidity) =
info([9,5]) - Gain(outlook) =
info([6,2],[3,3]) info([9,5]) -
= 0.048 info([2,2],[4,2],[3,1])
= 0.029


Continuing to Split

If “Outlook” = “sunny”
gain(”Temperature”) = 0.571 bits
gain(“Humidity”) = 0.971 bits
gain(“Windy”) = 0.020 bits

The Final Decision Tree

Note: not all leaves need to be pure; sometimes identical
instances have different classes
Splitting stops when data can‟t be split any further


Properties for a Purity Measure
 Properties we require from a purity measure:
 When node is pure, measure should be zero
 When impurity is maximal (i. e. all classes equally likely),
measure should be maximal
 Measure should obey multistage property (i. e. decisions can be
made in several stages):

measure([2,3,4]) =
measure([2,7]) + (7/9) measure([3,4])

 Entropy is the only function that satisfies all
three properties!

Some Properties for the Entropy
 The multistage property:
entropy(p,q,r) = entropy(p,q+r) + [(q+r)/(p+q+r)] × entropy(q, r)
 Ex.: info(2,3,4) can be calculated as
= {- (2/9)log2(2/9) + (7/9)log2(7/9)}–{7/9}*{[(3/7)log2(3/7) + (4/7)log2(4/7)}
= - (2/9)log2(2/9) - (7/9) [ log2 (7/9) + (3/7)log2(3/7) + (4/7)log2(4/7) ]
= - (2/9)log2(2/9)
- (7/9) [ (3/7)log2(7/9) + (4/7)log2(7/9) +(3/7)log2(3/7) + (4/7)log2(4/7) ]
= - (2/9)log2(2/9)
- (7/9) [ (3/7)log2(7/9) + (3/7)log2(3/7) +(4/7)log2(7/9) + (4/7)log2(4/7) ]
= - (2/9)log2(2/9)
- (7/9) [ (3/7)log2(7/9 x 3/7) + (4/7)log2(7/9 x 4/7) ]
= - (2/9)log2(2/9) - (7/9) [ (3/7)log2(3/9) + (4/7)log2(4/9) ]
= - (2/9)log2(2/9) - (3/9)log2(3/9) - (4/9)log2(4/9)


A Problem: Highly-Branching Attributes
 Problematic: attributes with a large number of
values (extreme case: ID code)
 Subsets are more likely to be pure if there is a
large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting (selection of an
attribute that is non-optimal for prediction) and
fragmentation


Example: Highly-Branching Attributes
ID Outlook Temp. Humid. Windy Play
A sunny hot high false N ID
B sunny hot high true N
A N
C overcast hot high false Y B M
D rainy mild high false Y
no yes yes no
E rainy cool normal false Y
F rainy cool normal true N
Entropy Split
G overcast cool normal true Y info(ID)
H sunny mild high false N = info([0,1],[0,1],
I sunny cool normal false Y [1,0],…,[0,1])
J rainy mild normal false Y = 0 bits
K sunny mild normal true Y gain(ID) = 0.940 (max.)
L overcast mild high true Y
M overcast hot normal false Y
N rainy mild high true N


Modification: The Gain Ratio As a Split
Info.
 Gain ratio: a modification of the information
gain that reduces its bias
 Gain ratio takes number and size of branches
into account when choosing an attribute
 It corrects the information gain by taking the
intrinsic information of a split into account
 Intrinsic information: entropy of distribution of
instances into branches
(i.e. how much info do we need to tell which branch an
instance belongs to)


Computing the Gain Ratio
 Example: intrinsic information (split info) for ID
code
info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807
 Value of attribute decreases as intrinsic
information gets larger
 Definition of gain ratio:
gain_ratio(“Attribute”) = gain(“Attribute”)
intrinsic_info (“Attribute”)
 Example:
gain_ratio(“ID”) = gain(“ID”) = 0.970 bits
intrinsic_info (“ID”) 3.807 bits
= 0.246


Gain Ratio for Weather Data


Gain Ratio for Weather Data(Discussion)
 “Outlook” still comes out top
 However: “ID” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on
that type of attribute
 Problem with gain ratio: it may
overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater
than average information gain


Avoiding Overfitting the Data
 The naïve DT algorithm grows each branch of
the tree just deeply enough to perfectly classify
the training examples.
 This algorithm may produce trees that overfit
the training examples but do not work well for
general cases.
 Reason: the training set may has some noises
or it is too small to produce a representative
sample of the true target tree (function).


Avoid Overfitting: Pruning
 Pruning simplifies a decision tree to prevent overfitting to noise
in the data
 Two main pruning strategies:
 1. Prepruning: stops growing a tree when no statistically significant
association between any attribute and the class at a particular node.
Most popular test: chi-squared test, only statistically significant
attributes where allowed to be selected by information gain procedure
2. Postpruning: takes a fully-grown decision tree and discards unreliable
parts by two main pruning operations, i.e., subtree replacement and
subtree raising with some possible strategies, e.g., error estimation,
significance testing, MDL principle.
 Prepruning is preferred in practice because of early stopping


Subtree Replacement
 Bottom-up: tree is considered for replacement once all its
subtrees have been considered


Subtree Raising
 Deletes node and redistributes instances
 Slower than subtree replacement (Worthwhile?)


Tree to Rule vs. Rule to Tree
Tree outlook Rule
If outlook=sunny & humidity=high then class=no
sunny rainy If outlook=sunny & humidity=normal then class=yes
overcast
humidity windy If outlook=overcast then class=yes
yes If outlook=rainy & windy=false then class=yes
If outlook=rainy & windy=true then class=no
high normal false true

no yes yes no

Rule Tree

?
If outlook=sunny & humidity=high then class=no
If humidity=normal then class=yes

If outlook=overcast then class=yes
If outlook=rainy & windy=true then class=no

outlook=rainy & windy=true & humidity=normal  ?
Question: outlook=rainy & windy=false & humidity=high  ?

31 Classification Rules

Classification Rule: Algorithms
 Two main algorithms are:

 Inferring Rudimentary rules
 1R: 1-level decision tree

 Covering Algorithms:
 Algorithm to construct the rules
 Pruning Rules & Computing Significance
 Hypergeometric Distribution vs. Binomial Distribution
 Incremental Reduce-Error Pruning


(Holte, 93)

Inferring Rudimentary Rules (1R rule)
 1R learns a 1-level decision tree
 Generate a set of rules that all test on one particular attribute
 Focus on each attribute
 Pseudo-code
• For each attribute,
• For each value of the attribute, make a rule as
follows:
• count how often each class appears
• find the most frequent class
• make the rule assign that class to this
attribute-value
• Calculate the error rate of the rules
• Choose the rules with the smallest error rate

 Note: “missing” can be treated as a separate attribute value
 1R’s simple rules performed not much worse than much more complex
decision trees.


An Example: Evaluating the Weather
Attributes (Nominal, Ordinal)
Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error

sunny hot high false no Outlook O = sunny  no 2/5 4/14
(O) O = overcast  yes 0/4
sunny hot high true no
O = rainy  yes 2/5
overcast hot high false yes Temp. T = hot  no 2/4 5/14
rainy mild high false yes (T) T = mild  yes 2/6
rainy cool normal false yes T = cool  yes 1/4
Humidity H = high  no 3/7 4/14
rainy cool normal true no
(H) H = normal  yes 1/7
overcast cool normal true yes
Windy W = false  yes 2/8 5/14
sunny mild high false no (W) W = true  no 3/6
sunny cool normal false yes
rainy mild normal false yes 1R chooses the attribute that
sunny mild normal true yes produces rules with the smallest
overcast mild high true yes
number of errors, i.e., rule sets
of attribute “Outlook” or
overcast hot normal false yes
“Humidity”
rainy mild high true no


An Example: Evaluating the Weather
Attributes (Numeric)
Outlook Temp. Humidity Windy Play Attribute Rule Error Total
sunny 85 85 false no Error

sunny 80 90 true no Outlook O = sunny  no 2/5 4/14
(O) O = overcast  yes 0/4
overcast 83 86 false yes O = rainy  yes 2/5
rainy 70 96 false yes
Temp. T <= 77.5  yes 3/10 5/14
rainy 68 80 false yes (T) T > 77.5  no 2/4
rainy 65 70 true no Humidity H <= 82.5  yes 1/7 3/14
overcast 64 65 true yes (H) 82.5<H<=95.5  no 2/6
H > 95.5  yes 0/1
sunny 72 95 false no
Windy W = false  yes 2/8 5/14
sunny 69 70 false yes
(W) W = true  no 3/6
rainy 75 80 false yes
sunny 75 70 true yes 1R chooses the attribute that
overcast 72 90 true yes produces rules with the smallest
overcast 81 75 false yes
number of errors, i.e., rule set of
attribute “Humidity”
rainy 71 91 true no

Dealing with Numeric Attributes
 Numeric attributes are discretized: the range of the
attribute is divided into a set of intervals
 Instances are sorted according to attribute’s values
 Breakpoints are placed where the (majority) class changes
(so that the total error is minimized)
 Example: Temperature from weather data
Left-to-right
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N min=3

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y | N N Y Y Y | N Y Y N
Merge
64 65 68 69 70 71 72 72 75 75 80 81 83 85 same
Y N Y Y Y N N Y Y Y | N Y Y N category


Separate-and-conquer: selects the test that
maximizes the number of covered positive examples

Covering Algorithm and minimizes the number of negative examples
that pass the test. It usually does not pay any
attention to the examples that do not pass the test.
 Separate-and-conquer algorithm Divide-and-conquer: optimize for all outcomes of
 Focus on each class in turn the test.

 Seek a way to covering all instances in the class
 More rules could be added for perfect rule set
 Comparing to decision tree (DT):
 Decision tree
 Divide-and-conquer
 Focus on all classes at each step
 Seek an attribute to split on that best separates the classes
 DT can be converted into a rule set
 Straightforward conversion: rule set overly complex
 More effective conversions are not trivial
 In multiclass situations, covering algorithm concentrates on
one class at a time whereas DT learner takes all classes into
account


Constructing Classification Rule
(An Example)

y b a a y b a a y b a a
a a a
b b a b b b a b b b a b
a a 2.6 a
b b b b b b
bb bb bb
x 1.2 x 1.2 x
Instance space
Classification Rules Rule so far
If x<=1.2 then class = b

If x> 1.2 then class = b
If x> 1.2 & y<=2.6 then class = b

x > 1.2
n y

b y > 2.6
Rule after adding new item
n y
Decision Tree
b ? More rules could be added for
“perfect” rule set
38

A Simple Covering Algorithm
 Generates a rule by adding tests that maximize rule’s
accuracy, even each new test reduces the rule’s coverage
 Similar to situation in decision trees: problem of selecting
an attribute to split
 Decision tree inducer maximizes overall purity.
 Covering algorithm maximizes rule accuracy.
 Goal: maximizing accuracy
 t: total number of instances covered by rule
 p: positive examples of the class covered by rule
 t-p: number of errors made by rule
 One option: select test that maximizes the ratio p/t
 We are finished when p/t = 1 or the set of instances
cannot be split any further.

An Example: Contact Lenses Data
age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom.
prescription sm rate lenses prescription tism rate lenses
young myope no reduced none presbyopic myope no reduced none
young myope no normal soft presbyopic myope no normal none
young myope yes reduced none presbyopic myope yes reduced none
young myope yes normal hard presbyopic myope yes normal hard
young hypermyope no reduced none presbyopic hypermyope no reduced none
young hypermyope no normal soft presbyopic hypermyope no normal soft
young hypermyope yes reduced none presbyopic hypermyope yes reduced none
young hypermyope yes normal hard presbyopic hypermyope yes normal none
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermyope no reduced none
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
First try to find a rule for “hard”
pre-presbyopic hypermyope yes normal none


An Example: Contact Lenses Data
(Finding a good choice)

 Rule we seek:
If ? then recommendation = hard
 Possible tests:
Age = Young 2/8
Age = Pre- presbyopic 1/8
Age = Presbyopic 1/8
Spectacle prescription = Myope 3/12
Spectacle prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
Tear production rate = Reduced 0/12
Tear production rate = Normal 4/12 OR


Modified Rule and Resulting Data
 Rule with best test added:
If astigmatics = yes then recommendation = hard
age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom.
prescription sm rate lenses prescription tism rate lenses
young myope no reduced none presbyopic myope no reduced none
young myope no normal soft presbyopic myope no normal none
young myope yes reduced none presbyopic myope yes reduced none
young myope yes normal hard presbyopic myope yes normal hard
young hypermyope no reduced none presbyopic hypermyope no reduced none
young hypermyope no normal soft presbyopic hypermyope no normal soft
young hypermyope yes reduced none presbyopic hypermyope yes reduced none
young hypermyope yes normal hard presbyopic hypermyope yes normal none
pre-presbyopic myope no normal soft • The underlined rows match with the
pre-presbyopic myope yes reduced none rule.
pre-presbyopic myope yes normal hard • Anyway, we need to refine the rule
pre-presbyopic hypermyope no reduced none since they are not all correct,
pre-presbyopic hypermyope no normal soft according to the rule.
42
pre-presbyopic hypermyope yes normal none Classification Rules

Further Refinement
 Current State:
If astigmatism = yes and ? then recommendation = hard
 Possible tests:
Age = Young 2/4
Tear production rate = Reduced 0/6
Tear production rate = Normal 4/6


 Rule with best test added:
If astigmatics = yes and tear prod. rate = normal then
recommendation = hard
age Spectacle astigmati Tear prod. Recom.
prescription sm rate lenses
Age Spectacle astigma Tear prod. Recom.
young myope no reduced none tism
prescription rate lenses
young myope no normal soft
presbyopic myope no reduced none
young myope yes reduced none
presbyopic myope no normal none
young myope yes normal hard
presbyopic myope yes reduced none
young hypermyope no reduced none
presbyopic myope yes normal hard
young hypermyope no normal soft
presbyopic hypermyope no reduced none
young hypermyope yes reduced none
presbyopic hypermyope no normal soft
young hypermyope yes normal hard
presbyopic hypermyope yes reduced none
presbyopic hypermyope yes normal none
pre-presbyopic myope yes reduced none • The underlined rows match with
pre-presbyopic myope yes normal hard the rule.
• Anyway, we need to refine the rule
since they are not all correct,
44
according to the rule.
Classification Rules: Covering Algorithm

Further Refinement
 Current State:
If astigmatism = yes and tear prod. rate = normal and ? then
 Possible tests:
Age = Young 2/2
 Tie between the first and the fourth test
 We choose the one with greater coverage


 Final rule with best test added:
If astigmatics = yes and tear prod.rate = normal and
spectacle prescription = myope then recommendation = hard
age Spectacle astigmati Tear prod. Recom.
prescription sm rate lenses
Age Spectacle astigma Tear prod. Recom.
young myope no reduced none tism
prescription rate lenses
young myope no normal soft
presbyopic myope no reduced none
young myope yes reduced none
presbyopic myope no normal none
young myope yes normal hard
presbyopic myope yes reduced none
young hypermyope no reduced none
presbyopic myope yes normal hard
young hypermyope no normal soft
presbyopic hypermyope no reduced none
young hypermyope yes reduced none
presbyopic hypermyope no normal soft
young hypermyope yes normal hard
presbyopic hypermyope yes reduced none
presbyopic hypermyope yes normal none
pre-presbyopic myope yes reduced none • The blue rows match with the rule.
pre-presbyopic myope yes normal hard
• All three rows are „hard‟.
• No need to refine the rule since the
rule becomes perfect.
ITS423: Data Warehouses and Data
Mining

Finding More Rules
 Second rule for recommending “hard lenses”: (built from
instances not covered by first rule)
If age = young and astigmatism = yes and
tear production rate = normal then
 These astigmatics = yes & tear.prod.rate = lenses”: spectacle.prescr = myope
(1) If
two rules cover all “hard normal &
then recommendation = hard
(2) If age = young and astigmatism = yes and tear production rate = normal
then recommendation = hard

 Process is repeated with other two classes, that is “soft
lenses” and “none”.


Pseudo-code for PRISM Algorithm
For each class C
• Initialize E to the instance set
• While E contains instances in class C
• Create a rule R with an empty left-hand-side that
predicts class C
• Until R is perfect (or there are no more
attributes to use) do
• For each attribute A not mentioned in R, and
each value v,
• Consider adding the condition A = v to the
left-hand side of R
• Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with
the largest p)
• Add A = v to R
• Remove the instances covered by R from E

Order Dependency among Rules
 PRISM without outerloop generates a decision list for
one class
 Subsequent rules are designed for rules that are not
covered by previous rules
 Here, order does not matter because all rules predict the
same class
 Outer loop considers all classes separately
 No order dependence implied
 Two problems are
 overlapping rules
 default rule required


Separate-and-Conquer
 Methods like PRISM (for dealing with one class) are separate-and-
conquer algorithms:
 First, a rule is identified
 Then, all instances covered by the rule are separated out
 Finally, the remaining instances are “conquered”
 Difference to divide-and-conquer methods:
 Subset covered by rule doesn’t need to be explored any further
 Variety in separate-and-conquer approach.
 Search method (e. g. greedy, beam search, ...)
 Test selection criteria (e. g. accuracy, ...)
 Pruning method (e. g. MDL, hold-out set, ...)
 Stopping criterion (e. g. minimum accuracy)
 Post- processing step
 Also: Decision list vs. one rule set for each class


Good Rules and Bad Rules
(overview)
 Sometimes it is better not to generate perfect rules that guarantee to give the
correct classification on all instances in order to avoiding overfitting.
 How do we decide which rules are worthwhile?
 How do we tell when it becomes counterproductive to continue adding terms to
a rule to exclude a few pecky instances of the wrong type?
 Two main strategies of pruning rules
 Global pruning (post-pruning)
Create all perfect rules then prune
 Incremental pruning (pre-pruning)
 Three pruning criteria Prune a rule when generating

 MDL principle (Minimum Description Length)
Rule size + Exception
 Statistical significance  INDUCT
 Error on hold-out set (reduced-error pruning)


Hypergeometric Distribution
The dataset contains T examples

The rule selects
t examples

The class contains
P examples

P T-P
The p examples out of t
p t-p
examples selected by the
rule are correctly covered
T
Hypergeometric Distribution t


Computing Significance
 We want the probability that a random rule does
at least as well (statistical significance of rule):

 P  T  P 
 
min(t , P )  

i  t  i 
Or
m( R )     Ci  T  PCt i
min(t , P ) P
m( R)  
T 
T
i p Ct
i p
 
t
 

 p p!
Here,   
 q  q!( p  q)!
 

Good/Bad Rules by Statistical significance
(An Example) “Reduced
probability”
1 If astigmatism = yes then recommendation = hard means better
success fraction = 4/12 0.047  0.0014
P = p = 4, T = 24, t=12
no information success fraction = 4/24 4 24−4 1∗ 20 20!
4 12−4 8
probability of 4/24  4/12 = 0.047 24 = 24! = 8!∗12! 24!
12 12!∗12! 12!∗12!
20! ∗ 12!
= = 0.047
If astigmatism = yes and 8! ∗ 24!
2 tear production rate = normal then recommendation = hard
success fraction = 4/6 The Best Rule
no information success fraction = 4/24
probability of 4/24  4/6 = 0.0014

3 If astigmatism = yes and tear prod. rate = normal and age = young
then recommendation = hard 0.0014  0.022
success fraction = 2/2 “Increased
no information success fraction = 4/24  P  T  P  probability”
 
 i  t  i 
probability of 4/24  2/2 = 0.022 
min(t , P )
m( R)      means worse
i p T 
54  
t
 

Good/Bad Rules by Statistical significance
(Another Example)
If astigmatism = yes and tear production rate = normal
4 then recommendation = none
success fraction = 2/6
no information success fraction = 15/24 Bad Rule
probability of 15/24  2/6 = 0.985 High Probability

5 If astigmatism = no and tear production rate = normal
then recommendation = soft
no information success fraction = 5/24 Good Rule
probability of 5/24  5/6 = 0.0001 Low Probability

6 If tear production rate = reduced then recommendation = none
no information success fraction = 15/24
probability of 15/24  12/12 = 0.0017


The Binomial Distribution
 Approximation: can use sampling with replacement instead of sampling
without replacement
Dataset contains T examples

Rule selects t examples Class contains P examples

p examples are correctly covered
t i
 t  P   P 
min(t , P ) i

m( R)      1  
i  T
i p     T 

Pruning Strategies
 For better estimation, a rule should be evaluated on data not used for
training.
 This requires a growing set and a pruning set
 Two options are
 Reduced-error pruning for rules builds a full unpruned rule set and
simplifies it subsequently
 Incremental reduced-error pruning simplifies a rule immediately after it
has been built.


INDUCT (Incremental Pruning Algorithm)
Initialize E to the instance set
Until E is empty do
For each class C for which E contains an instance
Use basic covering algorithm to create best perfect rule for C
Calculate significance m(R) for rule and significance
m(R-) for rule with final condition omitted
If (m(R-) < m(R)), prune rule and repeat previous step
From the rules for the different classes, select
the most significant one
(i.e. the one with smallest m(R))
Print the rule
Remove the instances covered by rule from E
Continue

INDUCT’s significance computation for a rule:
• Probability of completely random rule with same coverage performing at least as well.
• Random rule R selects t cases at random from the dataset
• We want to know how likely it is that p of these belong to the correct class?
• This probability is given by the hypergeometric distribution


Example:
Classification task is to predict whether a customer will buy a computer
RID age income student Credit_rating Class:buys_computer
1 youth High No Fair No
2 youth High No Excellent No
3 middle_age High No Fair Yes
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_age Low Yes Excellent Yes
8 youth Medium No Fair No
9 youth Low Yes Fair Yes
10 senior Medium Yes Fair Yes
11 youth Medium Yes Excellent Yes
12 middle_age Medium No Excellent Yes
13 middle_age High Yes Fair Yes
14 senior medium no Excellent No

59

DBM630 Data Mining Classification and Prediction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a DBM630 Data Mining Classification and Prediction

Similar a DBM630 Data Mining Classification and Prediction (20)

Más de Tokyo Institute of Technology

Más de Tokyo Institute of Technology (18)

Último

Último (20)

DBM630 Data Mining Classification and Prediction