The document discusses classification and prediction using decision trees. It begins by defining classification as predicting categorical labels from data, such as predicting if a loan applicant is "safe" or "risky". Prediction involves predicting continuous or ordered values, such as how much a customer will spend. The document then discusses how decision trees perform classification by recursively splitting the data into purer subsets based on attribute values, with leaf nodes representing class labels. Information gain is used as the splitting criterion to select the attribute that best splits the data. Finally, it notes that attributes with many values can bias decision trees towards overfitting.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
DBM630 Data Mining Classification and Prediction
1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
Lecture 6
Classification and Prediction
Decision Tree and Classification Rules
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
1
2. Topics
What Is Classification, What Is Prediction?
Decision Tree
Classification Rule: Covering Algorithm
2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. What Is Classification?
Case
A bank loans officer needs analysis of her data in order to learn
which loan applicants are “safe” and which are “risky” for the bank
A marketing manager needs data analysis to help guess whether a
customer with a given profile will buy a new computer or not
A medical researcher wants to analyze breast cancer data in order
to predict which one of three specific treatments a patient receive
The data analysis task is classification, where the model or
classifier is constructed to predict categorical labels
The model is a classifier
3
Data Warehousing and Data Mining by Kritsada Sriphaew
4. What Is Prediction?
Suppose that the marketing manager would like to predict how
much a given customer will spend during a sale at the shop
This data analysis task is numeric prediction, where the model
constructed predicts a continuous value or ordered values, as
opposed to a categorical label
This model is a predictor
Regression analysis is a statistical methodology that is most often
used for numeric prediction
4
Data Warehousing and Data Mining by Kritsada Sriphaew
5. How does classification work?
Data classification is a two-step process
In the first step, -- learning step or training phase
A model is built describing a predetermined set of data classes or concepts
Data tuples used to build the classification model are called training data set
If the class label is provided, this step is known as supervised learning, otherwise
called unsupervised learning
The learned model may be represented in the form of classification rules,
decision trees, Bayesian, mathematical formulae, etc.
5
Data Warehousing and Data Mining by Kritsada Sriphaew
6. How does classification work?
In the second step,
The learned model is used for classification
Estimate the predictive accuracy of the model using hold-out data set (a test set
of class-labeled samples which are randomly selected and are independent of the
training samples)
If the accuracy of the model were estimate based on the training data set -> the
model tends to overfit the data
If the accuracy of the model is considered acceptable, the model can be used to
classify future data tuples or objects for which the class label is unknown
In the experiment, there are three kinds of dataset, training data set, hold-out
data set (or validation data set), and test data set
6
Data Warehousing and Data Mining by Kritsada Sriphaew
7. Issues Regarding Classification/Prediction
Comparing classification methods
The criteria to compare and evaluate classification and prediction methods
Accuracy: an ability of a given classifier to correctly predict the class label of new
or unseen data
Speed: the computation costs involved in generating and using the given classifier
or predictor
Robustness: an ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values
Scalability: an ability to construct the classifier or predictor efficiently given large
amounts of data
Interpretability: the level of understanding and insight that is provided by the
classifier or predictor – subjective and more difficult to assess
7
Data Warehousing and Data Mining by Kritsada Sriphaew
8. Decision Tree
A decision tree is a flow-chart-like tree structure,
each internal node denotes a test on an attribute,
each branch represents an outcome of the test
leaf node represent classes
Top-most node in a tree is the root node
Instead of using the complete set of features jointly to make a decision, different
subsets of features are used at different levels of the tree during making a decision
Age? The decision tree
<=30 >40 represents the concept
31…40 buys_computer
student? Credit_rating?
yes
no yes excellent fair
no yes no yes
8 Classification – Decision Tree
9. Decision Tree Induction
Normal procedure: greedy algorithm by top down
in recursive divide-and-conquer fashion
First: attribute is selected for root node and branch is
created for each possible attribute value
Then: the instances are split into subsets (one for each
branch extending from the node)
Finally: procedure is repeated recursively for each
branch, using only instances that reach the branch
Process stops if
All instances for a given node belong to the same class
No remaining attribute on which the samples may be further partitioned majority vote is
employed
No sample for the branch to test the attrbiute majority vote is employed
9 Classification – Decision Tree
10. Decision Tree Representation
(An Example)
The decision tree (DT) of the weather example is:
Outlook Temp. Humid. Windy Play Decision Tree
sunny hot high false N Induction
sunny hot high true N
overcast hot high false Y
rainy mild high false Y
rainy cool normal false Y
outlook
rainy cool normal true N
overcast cool normal true Y sunny rainy
sunny mild high false N overcast
sunny cool normal false Y humidity windy
yes
rainy mild normal false Y high normal false true
sunny mild normal true Y
overcast mild high true Y no yes yes no
overcast hot normal false Y
rainy mild high
10 true N Classification – Decision Tree
11. An Example
(Which attribute is the best?)
There are four possibilities for each split
11 Classification – Decision Tree
12. Criterions for Attribute Selection
Which is the best attribute?
The one which will result in the smallest tree
Heuristic: choose the attribute that produces the “purest”
nodes
Popular impurity criterion: information gain
Information gain increases with the average purity
of the subsets that an attribute produces
Strategy: choose the attribute with the highest
information gain is chosen as the test attribute
for the current nodes
12 Classification – Decision Tree
13. Computing “Information”
Information is measured in bits
Given a probability distribution, the information required to predict an event is
the distribution‟s entropy
Entropy gives the information required in bits (this can involve fractions of bits!)
Information gain measures the goodness of split
Formula for computing expected information:
Let S be a set consisting of s data instances, the class label attribute has n distinct
classes, Ci (for i = 1, …, n)
Let si be the number of instances in class Ci
The expected information or entropy is
info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi)
where pi is the probability that the instance belongs to class, pi = si/s
Formula for computing information gain:
Find an information gain of attribute A
gain(A) = info. before splitting – info. after splitting
13 Classification – Decision Tree
14. Expected Information for “Outlook”
“Outlook” = “sunny”:
info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5)
= 0.971 bits
“Outlook” = “overcast”:
Outlook Temp. Humid. Windy Play
sunny hot high false N
info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0) sunny
overcast
hot
hot
high
high
true
false
N
Y
= 0 bits rainy mild high false Y
“Outlook” = “rainy”: rainy
rainy
cool
cool
normal
normal
false
true
Y
N
info([3,2]) = entropy(3/5,2/5) overcast cool normal true Y
= - (3/5)log2(3/5) - (2/5)log2(2/5) sunny mild high false N
sunny cool normal false Y
= 0.971 bits rainy mild normal false Y
Expected information for attribute “Outlook”: sunny
overcast
mild
mild
normal
high
true
true
Y
Y
info([2,3],[4,0],[3,2]) overcast hot normal false Y
= (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2])
rainy mild high true N
= [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ]
= 0.693 bits
14 Classification – Decision Tree
15. Information Gain for “Outlook”
Information gain:
info. before splitting – info. after splitting
gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2])
= 0.940-0.693
= 0.247 bits
Information gain for attributes from weather data:
gain(”Outlook”) = 0.247 bits
gain(”Temperature”) = 0.029 bits
gain(“Humidity”) = 0.152 bits
gain(“Windy”) = 0.048 bits
15 Classification – Decision Tree
16. An Example of Gain Criterion
(Which attribute is the best?)
Gain(outlook) = Gain(humidity) =
info([9,5]) - info([9,5]) -
info([2,3],[4,0],[3,2]) info([3,4],[6,1])
= 0.247 = 0.152
The best
Gain(humidity) =
info([9,5]) - Gain(outlook) =
info([6,2],[3,3]) info([9,5]) -
= 0.048 info([2,2],[4,2],[3,1])
= 0.029
16 Classification – Decision Tree
17. Continuing to Split
If “Outlook” = “sunny”
gain(”Temperature”) = 0.571 bits
gain(“Humidity”) = 0.971 bits
gain(“Windy”) = 0.020 bits
17 Classification – Decision Tree
18. The Final Decision Tree
Note: not all leaves need to be pure; sometimes identical
instances have different classes
Splitting stops when data can‟t be split any further
18 Classification – Decision Tree
19. Properties for a Purity Measure
Properties we require from a purity measure:
When node is pure, measure should be zero
When impurity is maximal (i. e. all classes equally likely),
measure should be maximal
Measure should obey multistage property (i. e. decisions can be
made in several stages):
measure([2,3,4]) =
measure([2,7]) + (7/9) measure([3,4])
Entropy is the only function that satisfies all
three properties!
19 Classification – Decision Tree
21. A Problem: Highly-Branching Attributes
Problematic: attributes with a large number of
values (extreme case: ID code)
Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction) and
fragmentation
21 Classification – Decision Tree
22. Example: Highly-Branching Attributes
ID Outlook Temp. Humid. Windy Play
A sunny hot high false N ID
B sunny hot high true N
A N
C overcast hot high false Y B M
D rainy mild high false Y
no yes yes no
E rainy cool normal false Y
F rainy cool normal true N
Entropy Split
G overcast cool normal true Y info(ID)
H sunny mild high false N = info([0,1],[0,1],
I sunny cool normal false Y [1,0],…,[0,1])
J rainy mild normal false Y = 0 bits
K sunny mild normal true Y gain(ID) = 0.940 (max.)
L overcast mild high true Y
M overcast hot normal false Y
N rainy mild high true N
22 Classification – Decision Tree
23. Modification: The Gain Ratio As a Split
Info.
Gain ratio: a modification of the information
gain that reduces its bias
Gain ratio takes number and size of branches
into account when choosing an attribute
It corrects the information gain by taking the
intrinsic information of a split into account
Intrinsic information: entropy of distribution of
instances into branches
(i.e. how much info do we need to tell which branch an
instance belongs to)
23 Classification – Decision Tree
24. Computing the Gain Ratio
Example: intrinsic information (split info) for ID
code
info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807
Value of attribute decreases as intrinsic
information gets larger
Definition of gain ratio:
gain_ratio(“Attribute”) = gain(“Attribute”)
intrinsic_info (“Attribute”)
Example:
gain_ratio(“ID”) = gain(“ID”) = 0.970 bits
intrinsic_info (“ID”) 3.807 bits
= 0.246
24 Classification – Decision Tree
25. Gain Ratio for Weather Data
25 Classification – Decision Tree
26. Gain Ratio for Weather Data(Discussion)
“Outlook” still comes out top
However: “ID” has greater gain ratio
Standard fix: ad hoc test to prevent splitting on
that type of attribute
Problem with gain ratio: it may
overcompensate
May choose an attribute just because its intrinsic
information is very low
Standard fix: only consider attributes with greater
than average information gain
26 Classification – Decision Tree
27. Avoiding Overfitting the Data
The naïve DT algorithm grows each branch of
the tree just deeply enough to perfectly classify
the training examples.
This algorithm may produce trees that overfit
the training examples but do not work well for
general cases.
Reason: the training set may has some noises
or it is too small to produce a representative
sample of the true target tree (function).
27 Classification – Decision Tree
28. Avoid Overfitting: Pruning
Pruning simplifies a decision tree to prevent overfitting to noise
in the data
Two main pruning strategies:
1. Prepruning: stops growing a tree when no statistically significant
association between any attribute and the class at a particular node.
Most popular test: chi-squared test, only statistically significant
attributes where allowed to be selected by information gain procedure
2. Postpruning: takes a fully-grown decision tree and discards unreliable
parts by two main pruning operations, i.e., subtree replacement and
subtree raising with some possible strategies, e.g., error estimation,
significance testing, MDL principle.
Prepruning is preferred in practice because of early stopping
28 Classification – Decision Tree
29. Subtree Replacement
Bottom-up: tree is considered for replacement once all its
subtrees have been considered
29 Classification – Decision Tree
30. Subtree Raising
Deletes node and redistributes instances
Slower than subtree replacement (Worthwhile?)
30 Classification – Decision Tree
31. Tree to Rule vs. Rule to Tree
Tree outlook Rule
If outlook=sunny & humidity=high then class=no
sunny rainy If outlook=sunny & humidity=normal then class=yes
overcast
humidity windy If outlook=overcast then class=yes
yes If outlook=rainy & windy=false then class=yes
If outlook=rainy & windy=true then class=no
high normal false true
no yes yes no
Rule Tree
?
If outlook=sunny & humidity=high then class=no
If humidity=normal then class=yes
If outlook=overcast then class=yes
If outlook=rainy & windy=true then class=no
outlook=rainy & windy=true & humidity=normal ?
Question: outlook=rainy & windy=false & humidity=high ?
31 Classification Rules
32. Classification Rule: Algorithms
Two main algorithms are:
Inferring Rudimentary rules
1R: 1-level decision tree
Covering Algorithms:
Algorithm to construct the rules
Pruning Rules & Computing Significance
Hypergeometric Distribution vs. Binomial Distribution
Incremental Reduce-Error Pruning
32 Classification Rules
33. (Holte, 93)
Inferring Rudimentary Rules (1R rule)
1R learns a 1-level decision tree
Generate a set of rules that all test on one particular attribute
Focus on each attribute
Pseudo-code
• For each attribute,
• For each value of the attribute, make a rule as
follows:
• count how often each class appears
• find the most frequent class
• make the rule assign that class to this
attribute-value
• Calculate the error rate of the rules
• Choose the rules with the smallest error rate
Note: “missing” can be treated as a separate attribute value
1R’s simple rules performed not much worse than much more complex
decision trees.
33 Classification Rules
34. An Example: Evaluating the Weather
Attributes (Nominal, Ordinal)
Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error
sunny hot high false no Outlook O = sunny no 2/5 4/14
(O) O = overcast yes 0/4
sunny hot high true no
O = rainy yes 2/5
overcast hot high false yes Temp. T = hot no 2/4 5/14
rainy mild high false yes (T) T = mild yes 2/6
rainy cool normal false yes T = cool yes 1/4
Humidity H = high no 3/7 4/14
rainy cool normal true no
(H) H = normal yes 1/7
overcast cool normal true yes
Windy W = false yes 2/8 5/14
sunny mild high false no (W) W = true no 3/6
sunny cool normal false yes
rainy mild normal false yes 1R chooses the attribute that
sunny mild normal true yes produces rules with the smallest
overcast mild high true yes
number of errors, i.e., rule sets
of attribute “Outlook” or
overcast hot normal false yes
“Humidity”
rainy mild high true no
34 Classification Rules
35. An Example: Evaluating the Weather
Attributes (Numeric)
Outlook Temp. Humidity Windy Play Attribute Rule Error Total
sunny 85 85 false no Error
sunny 80 90 true no Outlook O = sunny no 2/5 4/14
(O) O = overcast yes 0/4
overcast 83 86 false yes O = rainy yes 2/5
rainy 70 96 false yes
Temp. T <= 77.5 yes 3/10 5/14
rainy 68 80 false yes (T) T > 77.5 no 2/4
rainy 65 70 true no Humidity H <= 82.5 yes 1/7 3/14
overcast 64 65 true yes (H) 82.5<H<=95.5 no 2/6
H > 95.5 yes 0/1
sunny 72 95 false no
Windy W = false yes 2/8 5/14
sunny 69 70 false yes
(W) W = true no 3/6
rainy 75 80 false yes
sunny 75 70 true yes 1R chooses the attribute that
overcast 72 90 true yes produces rules with the smallest
overcast 81 75 false yes
number of errors, i.e., rule set of
attribute “Humidity”
rainy 71 91 true no
35 Classification Rules
36. Dealing with Numeric Attributes
Numeric attributes are discretized: the range of the
attribute is divided into a set of intervals
Instances are sorted according to attribute’s values
Breakpoints are placed where the (majority) class changes
(so that the total error is minimized)
Example: Temperature from weather data
Left-to-right
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N min=3
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y | N N Y Y Y | N Y Y N
Merge
64 65 68 69 70 71 72 72 75 75 80 81 83 85 same
Y N Y Y Y N N Y Y Y | N Y Y N category
36 Classification Rules
37. Separate-and-conquer: selects the test that
maximizes the number of covered positive examples
Covering Algorithm and minimizes the number of negative examples
that pass the test. It usually does not pay any
attention to the examples that do not pass the test.
Separate-and-conquer algorithm Divide-and-conquer: optimize for all outcomes of
Focus on each class in turn the test.
Seek a way to covering all instances in the class
More rules could be added for perfect rule set
Comparing to decision tree (DT):
Decision tree
Divide-and-conquer
Focus on all classes at each step
Seek an attribute to split on that best separates the classes
DT can be converted into a rule set
Straightforward conversion: rule set overly complex
More effective conversions are not trivial
In multiclass situations, covering algorithm concentrates on
one class at a time whereas DT learner takes all classes into
account
37 Classification Rules
38. Constructing Classification Rule
(An Example)
y b a a y b a a y b a a
a a a
b b a b b b a b b b a b
a a 2.6 a
b b b b b b
bb bb bb
x 1.2 x 1.2 x
Instance space
Classification Rules Rule so far
If x<=1.2 then class = b
If x> 1.2 then class = b
If x> 1.2 & y<=2.6 then class = b
x > 1.2
n y
b y > 2.6
Rule after adding new item
n y
Decision Tree
b ? More rules could be added for
“perfect” rule set
38
39. A Simple Covering Algorithm
Generates a rule by adding tests that maximize rule’s
accuracy, even each new test reduces the rule’s coverage
Similar to situation in decision trees: problem of selecting
an attribute to split
Decision tree inducer maximizes overall purity.
Covering algorithm maximizes rule accuracy.
Goal: maximizing accuracy
t: total number of instances covered by rule
p: positive examples of the class covered by rule
t-p: number of errors made by rule
One option: select test that maximizes the ratio p/t
We are finished when p/t = 1 or the set of instances
cannot be split any further.
39 Classification Rules
40. An Example: Contact Lenses Data
age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom.
prescription sm rate lenses prescription tism rate lenses
young myope no reduced none presbyopic myope no reduced none
young myope no normal soft presbyopic myope no normal none
young myope yes reduced none presbyopic myope yes reduced none
young myope yes normal hard presbyopic myope yes normal hard
young hypermyope no reduced none presbyopic hypermyope no reduced none
young hypermyope no normal soft presbyopic hypermyope no normal soft
young hypermyope yes reduced none presbyopic hypermyope yes reduced none
young hypermyope yes normal hard presbyopic hypermyope yes normal none
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermyope no reduced none
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
First try to find a rule for “hard”
pre-presbyopic hypermyope yes normal none
40 Classification Rules
41. An Example: Contact Lenses Data
(Finding a good choice)
Rule we seek:
If ? then recommendation = hard
Possible tests:
Age = Young 2/8
Age = Pre- presbyopic 1/8
Age = Presbyopic 1/8
Spectacle prescription = Myope 3/12
Spectacle prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
Tear production rate = Reduced 0/12
Tear production rate = Normal 4/12 OR
41 Classification Rules
42. Modified Rule and Resulting Data
Rule with best test added:
If astigmatics = yes then recommendation = hard
age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom.
prescription sm rate lenses prescription tism rate lenses
young myope no reduced none presbyopic myope no reduced none
young myope no normal soft presbyopic myope no normal none
young myope yes reduced none presbyopic myope yes reduced none
young myope yes normal hard presbyopic myope yes normal hard
young hypermyope no reduced none presbyopic hypermyope no reduced none
young hypermyope no normal soft presbyopic hypermyope no normal soft
young hypermyope yes reduced none presbyopic hypermyope yes reduced none
young hypermyope yes normal hard presbyopic hypermyope yes normal none
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft • The underlined rows match with the
pre-presbyopic myope yes reduced none rule.
pre-presbyopic myope yes normal hard • Anyway, we need to refine the rule
pre-presbyopic hypermyope no reduced none since they are not all correct,
pre-presbyopic hypermyope no normal soft according to the rule.
pre-presbyopic hypermyope yes reduced none
42
pre-presbyopic hypermyope yes normal none Classification Rules
43. Further Refinement
Current State:
If astigmatism = yes and ? then recommendation = hard
Possible tests:
Age = Young 2/4
Age = Pre- presbyopic 1/4
Age = Presbyopic 1/4
Spectacle prescription = Myope 3/6
Spectacle prescription = Hypermetrope 1/6
Tear production rate = Reduced 0/6
Tear production rate = Normal 4/6
43 Classification Rules
44. Modified Rule and Resulting Data
Rule with best test added:
If astigmatics = yes and tear prod. rate = normal then
recommendation = hard
age Spectacle astigmati Tear prod. Recom.
prescription sm rate lenses
Age Spectacle astigma Tear prod. Recom.
young myope no reduced none tism
prescription rate lenses
young myope no normal soft
presbyopic myope no reduced none
young myope yes reduced none
presbyopic myope no normal none
young myope yes normal hard
presbyopic myope yes reduced none
young hypermyope no reduced none
presbyopic myope yes normal hard
young hypermyope no normal soft
presbyopic hypermyope no reduced none
young hypermyope yes reduced none
presbyopic hypermyope no normal soft
young hypermyope yes normal hard
presbyopic hypermyope yes reduced none
pre-presbyopic myope no reduced none
presbyopic hypermyope yes normal none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none • The underlined rows match with
pre-presbyopic myope yes normal hard the rule.
pre-presbyopic hypermyope no reduced none
• Anyway, we need to refine the rule
pre-presbyopic hypermyope no normal soft
since they are not all correct,
pre-presbyopic hypermyope yes reduced none
44
pre-presbyopic hypermyope yes normal none
according to the rule.
Classification Rules: Covering Algorithm
45. Further Refinement
Current State:
If astigmatism = yes and tear prod. rate = normal and ? then
recommendation = hard
Possible tests:
Age = Young 2/2
Age = Pre- presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3
Tie between the first and the fourth test
We choose the one with greater coverage
45 Classification Rules
46. Modified Rule and Resulting Data
Final rule with best test added:
If astigmatics = yes and tear prod.rate = normal and
spectacle prescription = myope then recommendation = hard
age Spectacle astigmati Tear prod. Recom.
prescription sm rate lenses
Age Spectacle astigma Tear prod. Recom.
young myope no reduced none tism
prescription rate lenses
young myope no normal soft
presbyopic myope no reduced none
young myope yes reduced none
presbyopic myope no normal none
young myope yes normal hard
presbyopic myope yes reduced none
young hypermyope no reduced none
presbyopic myope yes normal hard
young hypermyope no normal soft
presbyopic hypermyope no reduced none
young hypermyope yes reduced none
presbyopic hypermyope no normal soft
young hypermyope yes normal hard
presbyopic hypermyope yes reduced none
pre-presbyopic myope no reduced none
presbyopic hypermyope yes normal none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none • The blue rows match with the rule.
pre-presbyopic myope yes normal hard
• All three rows are „hard‟.
pre-presbyopic hypermyope no reduced none
• No need to refine the rule since the
pre-presbyopic hypermyope no normal soft
pre-presbyopic hypermyope yes reduced none
rule becomes perfect.
ITS423: Data Warehouses and Data
46 Classification Rules
pre-presbyopic hypermyope yes normal none
Mining
47. Finding More Rules
Second rule for recommending “hard lenses”: (built from
instances not covered by first rule)
If age = young and astigmatism = yes and
tear production rate = normal then
recommendation = hard
These astigmatics = yes & tear.prod.rate = lenses”: spectacle.prescr = myope
(1) If
two rules cover all “hard normal &
then recommendation = hard
(2) If age = young and astigmatism = yes and tear production rate = normal
then recommendation = hard
Process is repeated with other two classes, that is “soft
lenses” and “none”.
47 Classification Rules
48. Pseudo-code for PRISM Algorithm
For each class C
• Initialize E to the instance set
• While E contains instances in class C
• Create a rule R with an empty left-hand-side that
predicts class C
• Until R is perfect (or there are no more
attributes to use) do
• For each attribute A not mentioned in R, and
each value v,
• Consider adding the condition A = v to the
left-hand side of R
• Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with
the largest p)
• Add A = v to R
• Remove the instances covered by R from E
48 Classification Rules
49. Order Dependency among Rules
PRISM without outerloop generates a decision list for
one class
Subsequent rules are designed for rules that are not
covered by previous rules
Here, order does not matter because all rules predict the
same class
Outer loop considers all classes separately
No order dependence implied
Two problems are
overlapping rules
default rule required
49 Classification Rules
50. Separate-and-Conquer
Methods like PRISM (for dealing with one class) are separate-and-
conquer algorithms:
First, a rule is identified
Then, all instances covered by the rule are separated out
Finally, the remaining instances are “conquered”
Difference to divide-and-conquer methods:
Subset covered by rule doesn’t need to be explored any further
Variety in separate-and-conquer approach.
Search method (e. g. greedy, beam search, ...)
Test selection criteria (e. g. accuracy, ...)
Pruning method (e. g. MDL, hold-out set, ...)
Stopping criterion (e. g. minimum accuracy)
Post- processing step
Also: Decision list vs. one rule set for each class
50 Classification Rules
51. Good Rules and Bad Rules
(overview)
Sometimes it is better not to generate perfect rules that guarantee to give the
correct classification on all instances in order to avoiding overfitting.
How do we decide which rules are worthwhile?
How do we tell when it becomes counterproductive to continue adding terms to
a rule to exclude a few pecky instances of the wrong type?
Two main strategies of pruning rules
Global pruning (post-pruning)
Create all perfect rules then prune
Incremental pruning (pre-pruning)
Three pruning criteria Prune a rule when generating
MDL principle (Minimum Description Length)
Rule size + Exception
Statistical significance INDUCT
Error on hold-out set (reduced-error pruning)
51 Classification Rules
52. Hypergeometric Distribution
The dataset contains T examples
The rule selects
t examples
The class contains
P examples
P T-P
The p examples out of t
p t-p
examples selected by the
rule are correctly covered
T
Hypergeometric Distribution t
52 Classification Rules
53. Computing Significance
We want the probability that a random rule does
at least as well (statistical significance of rule):
P T P
min(t , P )
i t i
Or
m( R ) Ci T PCt i
min(t , P ) P
m( R)
T
T
i p Ct
i p
t
p p!
Here,
q q!( p q)!
53 Classification Rules
54. Good/Bad Rules by Statistical significance
(An Example) “Reduced
probability”
1 If astigmatism = yes then recommendation = hard means better
success fraction = 4/12 0.047 0.0014
P = p = 4, T = 24, t=12
no information success fraction = 4/24 4 24−4 1∗ 20 20!
4 12−4 8
probability of 4/24 4/12 = 0.047 24 = 24! = 8!∗12! 24!
12 12!∗12! 12!∗12!
20! ∗ 12!
= = 0.047
If astigmatism = yes and 8! ∗ 24!
2 tear production rate = normal then recommendation = hard
success fraction = 4/6 The Best Rule
no information success fraction = 4/24
probability of 4/24 4/6 = 0.0014
3 If astigmatism = yes and tear prod. rate = normal and age = young
then recommendation = hard 0.0014 0.022
success fraction = 2/2 “Increased
no information success fraction = 4/24 P T P probability”
i t i
probability of 4/24 2/2 = 0.022
min(t , P )
m( R) means worse
i p T
54
t
55. Good/Bad Rules by Statistical significance
(Another Example)
If astigmatism = yes and tear production rate = normal
4 then recommendation = none
success fraction = 2/6
no information success fraction = 15/24 Bad Rule
probability of 15/24 2/6 = 0.985 High Probability
5 If astigmatism = no and tear production rate = normal
then recommendation = soft
success fraction = 5/6
no information success fraction = 5/24 Good Rule
probability of 5/24 5/6 = 0.0001 Low Probability
6 If tear production rate = reduced then recommendation = none
success fraction = 12/12
no information success fraction = 15/24
probability of 15/24 12/12 = 0.0017
55 Classification Rules
56. The Binomial Distribution
Approximation: can use sampling with replacement instead of sampling
without replacement
Dataset contains T examples
Rule selects t examples Class contains P examples
p examples are correctly covered
t i
t P P
min(t , P ) i
m( R) 1
i T
i p T
56 Classification Rules
57. Pruning Strategies
For better estimation, a rule should be evaluated on data not used for
training.
This requires a growing set and a pruning set
Two options are
Reduced-error pruning for rules builds a full unpruned rule set and
simplifies it subsequently
Incremental reduced-error pruning simplifies a rule immediately after it
has been built.
57 Classification Rules
58. INDUCT (Incremental Pruning Algorithm)
Initialize E to the instance set
Until E is empty do
For each class C for which E contains an instance
Use basic covering algorithm to create best perfect rule for C
Calculate significance m(R) for rule and significance
m(R-) for rule with final condition omitted
If (m(R-) < m(R)), prune rule and repeat previous step
From the rules for the different classes, select
the most significant one
(i.e. the one with smallest m(R))
Print the rule
Remove the instances covered by rule from E
Continue
INDUCT’s significance computation for a rule:
• Probability of completely random rule with same coverage performing at least as well.
• Random rule R selects t cases at random from the dataset
• We want to know how likely it is that p of these belong to the correct class?
• This probability is given by the hypergeometric distribution
58 Classification Rules
59. Example:
Classification task is to predict whether a customer will buy a computer
RID age income student Credit_rating Class:buys_computer
1 youth High No Fair No
2 youth High No Excellent No
3 middle_age High No Fair Yes
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_age Low Yes Excellent Yes
8 youth Medium No Fair No
9 youth Low Yes Fair Yes
10 senior Medium Yes Fair Yes
11 youth Medium Yes Excellent Yes
12 middle_age Medium No Excellent Yes
13 middle_age High Yes Fair Yes
14 senior medium no Excellent No
59
Data Warehousing and Data Mining by Kritsada Sriphaew