TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Dbm630 lecture08
1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
Lecture 8
Classification and Prediction
Evaluation
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
1
2. Topics
Train, Test and Validation sets
Evaluation on Large data
Unbalanced data
Evaluation on Small data
Cross validation
Bootstrap
Comparing data mining schemes
Significance test
Lift Chart / ROC curve
Numeric Prediction Evaluation
2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. Evaluation in Classification Tasks
How predictive is the model we learned?
Error on the training data is not a good indicator of
performance on future data
Q: Why?
A: Because new data will probably not be exactly the
same as the training data!
Overfitting – fitting the training data too precisely -
usually leads to poor results on new data
3 Classification and Prediction: Evaluation
4. Model Selection and Bias-Variance Tradeoff
Typical behavior of the test and training error, as model complexity is
varied.
High Bias Low Bias
Low Variance High Variance
Prediction Error
Test Sample
Training Sample
Low Model Complexity High
4 Classification and Prediction: Evaluation
5. Classifier error rate
Natural performance measure for classification
problems: error rate
Success: instance’s class is predicted correctly
Error: instance’s class is predicted incorrectly
Error rate: proportion of errors made over the whole set
of instances
Resubstitution error: error rate on training data
(too optimistic way!)
Accuracy Error rate (1-Accuracy)
Generalization error: error rate 15 % data 13 %
85 % 87 % on test
2% improvement 2% error reduction
2.35% improvement rate 13.3% error reduction rate
5 Classification and Prediction: Evaluation
6. Evaluation on LARGE data
If many (thousands) of examples are available,
including several hundred examples from each
class, then how can we evaluate our classifier
model?
A simple evaluation is sufficient
For example, randomly split data into training and test sets
(usually 2/3 for train, 1/3 for test)
Build a classifier using the train set and evaluate it
using the test set.
6 Classification and Prediction: Evaluation
7. Classification Step 1:
Split data into train and test sets
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Testing set
7 Classification and Prediction: Evaluation
8. Classification Step 2:
Build a model on a training set
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Testing set
8 Classification and Prediction: Evaluation
9. Classification Step 3:
Evaluate on test set (and may be re-train)
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Model Builder
feedback
Predictions
+
Y N
-
+
Testing set
-
9 Classification and Prediction: Evaluation
10. A note on parameter tuning
It is important that the test data is not used in any way to create
the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data, validation data,
and test data
Validation data is used to optimize parameters
10 Classification and Prediction: Evaluation
11. Classification: Train, Validation, Test split
Results Known
+
Training set Model
+
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -
+
- Final Evaluation
+
Final Test Set Final Model -
11 Classification and Prediction: Evaluation
12. Unbalanced data
Sometimes, classes have very unequal frequency
Accommodation prediction: 97% stay, 3% don’t stay
medical diagnosis: 90% healthy, 10% disease
eCommerce: 99% don’t buy, 1% buy
Security: >99.99% of Americans are not terrorists
Similar situation with multiple classes
Majority class classifier can be 97% correct, but useless
Solution: With two classes, a good approach is to build
BALANCED train and test sets, and train model on a balanced set
randomly select desired number of minority class instances
add equal number of randomly selected majority class
That is, we ignore the effect of the number of instances for each class.
12 Classification and Prediction: Evaluation
13. Evaluation on SMALL data
The holdout method reserves a certain amount for testing and
uses the remainder for training
Usually: one third for testing, the rest for training
For “unbalanced” datasets, samples might not be representative
Few or none instances of some classes
Stratified sample: advanced version of balancing the data
Make sure that each class is represented with approximately equal
proportions in both subsets
What if we have a small data set?
The chosen 2/3 for training may not be representative.
The chosen 1/3 for testing may not be representative.
13 Classification and Prediction: Evaluation
14. Repeated Holdout Method
Holdout estimate can be made more reliable by repeating
the process with different subsamples
In each iteration, a certain proportion is randomly selected
for training (possibly with stratification)
The error rates on the different iterations are averaged to
yield an overall error rate
This is called the repeated holdout method
Still not optimum: the different test sets overlap.
Can we prevent overlapping?
14 Classification and Prediction: Evaluation
15. Cross-validation
Cross-validation avoids overlapping test sets
First step: data is split into k subsets of equal size
Second step: each subset in turn is used for testing and
the remainder for training
This is called k-fold cross-validation
Often the subsets are stratified before the cross-
validation is performed
The error estimates are averaged to yield an overall
error estimate
15 Classification and Prediction: Evaluation
16. Cross-validation Example
Break up data into groups of the same size (possibly with
stratification)
Hold aside one group for testing and use the rest to build
model
Test Train
Repeat by another
test data until
16 Classification and Prediction: Evaluation
17. More on cross-validation
Standard method for evaluation: stratified ten-fold
cross-validation
Why ten?
Extensive experiments have shown that this is the
best choice to get an accurate estimate
Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)
17 Classification and Prediction: Evaluation
18. Leave-One-Out cross-validation
Leave-One-Out: Remove one instance for testing and the
other for training
a particular form of cross-validation:
Set the number of folds equal to the number of training
instances
i.e., for n training instances, build classifier n times
Makes best use of the data
Involves no random subsampling
Very computationally expensive
18 Classification and Prediction: Evaluation
19. Leave-One-Out-CV and stratification
Disadvantage of Leave-One-Out-CV: stratification is not
possible
It guarantees a non-stratified sample because there is
only one instance in the test set!
Extreme example: random dataset split equally into two
classes
Best inducer predicts majority class
50% accuracy on fresh data
Leave-One-Out-CV estimate is 100% error!
19 Classification and Prediction: Evaluation
20. *The bootstrap
CV uses sampling without replacement
The same instance, once selected, can not be selected again for a
particular training/test set
The bootstrap uses sampling with replacement to form the
training set
Sample a dataset of n instances n times with replacement to form
a new dataset of n instances
Use this data as the training set
Use the instances from the original dataset that do not occur in
the new training set for testing
20 Classification and Prediction: Evaluation
21. Evaluating the Accuracy of a Classifier or
Predictor
Bootstrap method
The training tuples are sampled uniformly with replacement
Each time a tuple is selected, it is equally likely to be selected again and readded
to the training set
There are several bootstrap method – the commonly used one is .632 bootstrap
which works as follows
Given a data set of d tuples
The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d
samples
It is very likely that some of the original data tuples will occur more than once in this sample
The data tuples that did not make it into the training set end up forming the test set
Suppose we try this out several times – on average 63.2% of original data tuple will end up in
the bootstrap, and the remaining 36.8% will form the test set
21 Classification and Prediction: Evaluation
22. *The 0.632 bootstrap
Also called the 0.632 bootstrap
For n instances, a particular instance has a probability
of 1–1/n of not being picked
Thus its probability of ending up in the test data is:
n
1
1 e 1 0.368
n
This means the training data will contain approximately
63.2% of the instances
Note: e is an irrational constant approximately equal to 2.718281828
22 Classification and Prediction: Evaluation
23. *Estimating error with the bootstrap
The error estimate on the test data will be very pessimistic
Trained on just ~63% of the instances
Therefore, combine it with the re-substitution error:
err 0.632 errortest_instances 0.368 errortraining_instances
The re-substitution error gets less weight than the error on
the test data
Repeat process several times with different replacement
samples; average the results
23 Classification and Prediction: Evaluation
24. *More on the bootstrap
Probably the best way of estimating performance for very
small datasets
However, it has some problems
Consider the random dataset from above
A perfect memorizer will achieve 0% resubstitution
error and
~50% error on test data
Bootstrap estimate for this classifier:
err 0.632 50% 0.368 0% 31.6%
True expected error: 50%
24 Classification and Prediction: Evaluation
25. Evaluating Two-class Classification
(Lift Chart vs. ROC Curve)
Information Retrieval or Search Engine
An application to find a set of related documents given a set of
keywords.
Hard Decision vs. Soft Decision
Focus on soft decision
Multiclass
Class probability (Ranking)
Class by class evaluation
Example: promotional mailout
Situation 1: classifier predicts that 0.1% of all households will respond
Situation 2: classifier predicts that 0.4% of the 100000 most
promising households will respond
27 Classification and Prediction: Evaluation
26. Confusion Matrix (Two-class)
Also called contingency table
Actual Class
Yes No
True False
Yes
Predicted Positive Positive
Class False True
No
Negative Negative
28 Classification and Prediction: Evaluation
27. Measures in Information Retrieval
precision: Percentage of
retrieved documents that are
relevant.
TP
recall: Percentage of relevant precision
documents that are returned. TP FP
F-measure: The combination TP
recall
measure of recall and precision TP FN
Precision/recall curves have 2 recall precision
F measure
hyperbolic shape recall precision
Summary measures: average
precision at 20%, 50% and 80%
recall (three-point average
recall)
29 Classification and Prediction: Evaluation
28. Measures in Two-Class Classification
For Positive Class For Negative Class
TP TN
precision precision
TP FP TN FN
TP TN
recall recall
TP FN TN FP
TP TN TP TN
accuracy accuracy
TP TN FP FN TP TN FP FN
Usually, we focus only positive class FP
(“True” cases or “Yes” cases), FP Rate
TN FP
therefore, only precision and recall
TP
of positive class are used for TP Rate recall
performance comparison TP FN
30 Classification and Prediction: Evaluation
29. Confusion matrix: Example
Actual Buys_computer Buys_computer Total
Predict = yes = no
Buys_computer=yes 6,954 46 7,000
Buys_computer=no 412 2,588 3,000
Total 7,366 2,634 10,000
No. of tuple of class buys_computer=yes
that were labeled by a classifier as class
buys_computer=no: FN
No. of tuple of class buys_computer=no
that were labeled by a classifier as class
buys_computer=yes: FP
31
30. Cumulative Gains Chart/Lift Chart/
ROC curve
They are visual aids for measuring model performance
Cumulative Gains is a measure of the effectiveness of predictive
model on TP Rate (%true responses)
Lift is a measure of the effectiveness of a predictive model calculated
as the ratio between the results obtained with and without the
predictive model
Cumulative gains and lift charts consist of a lift curve and a baseline.
The greater the area between the lift curve and the baseline, the
better the model.
ROC is a measure of the effectiveness of a predictive model on
%positive response against %negative response TP Rate against FP
Rate. The greater the area under ROC curve, the better the model.
32 Classification and Prediction: Evaluation
31. Generating Charts
Instances are sorted according to their predicted
probability of being a true positive:
33 Classification and Prediction: Evaluation
32. Cumulative Gains Chart
The x-axis shows the percentage of samples
The y-axis shows the percentage of positive responses
(or true positive rate). This is a percentage of the total
possible positive responses
Baseline (overall response rate): If we sample X% of
data then we will receive X% of the total positive
responses.
Lift Curve: Using the predictions of the response model,
calculate the percentage of positive responses for the
percent of customers contacted and map these points to
create the lift curve.
34 Classification and Prediction: Evaluation
34. Lift Chart
The x-axis shows the percentage of samples
The y-axis shows the ratio of true positives with
model and without model
To plot the chart: Calculate the points on the lift
curve by determining the ratio between the result
predicted by our model and the result using no model
Example: For contacting 10% of customers, using no
model we should get 10% of responders and using
the given model we should get 30% of responders.
The y-value of the lift curve at 10% is 30 / 10 = 3.
36 Classification and Prediction: Evaluation
36. ROC Curve
ROC curves are similar to lift charts
“ROC” stands for “receiver operating characteristic”
Used in signal detection to show tradeoff between hit rate
and false alarm rate over noisy channel
Differences:
y axis shows percentage of true positives in sample
x axis shows percentage of false positives in sample
(rather than sample size)
38 Classification and Prediction: Evaluation
38. Extending to Multiple-Class
Classification
PREDICTED CLASS
C1 C2 Cn Sum Recall
C1 n11 n12 n1n R1 n11/R1
ACTUAL CLASS
C2 n21 n22 n2n R2 n22 /R2
Cn nn1 nn2 nnn Rn nnn /Rn
Sum P1 P2 Pn T R
Preci n11 n22 nnn n11 +n22+..+nnn
sion P1
P T
P2 Pn
40 Classification and Prediction: Evaluation
39. Measures in Multiple-Class Classification
nii
precision (Ci )
Pi
nii
recall (Ci )
Ri
n11 n22 nnn
accuracy
T
41 Classification and Prediction: Evaluation
40. Numeric Prediction Evaluation
Same strategies: independent test set, cross-validation,
significance tests, etc.
Difference: error measures or generalization error
Actual target values: a1, a2,…, an
Predicted target values: p1, p2,…, pn
Most popular measure: mean-squared error
p 1
a1 p2 a2 ... pn an
2 2 2
n
Easy to manipulate mathematically
42 Classification and Prediction: Evaluation
41. Other Measures
The root mean-squared (rms) error:
p1 a1 p2 a2 ... pn an
2 2 2
n
The mean absolute error is less sensitive to outliers
than the mean-squared error:
p1 a1 p2 a2 ... pn an
n
Sometimes relative error values are more
appropriate
(e. g. 10% for an error of 50 when predicting 500)
43 Classification and Prediction: Evaluation
42. Improvement on the Mean
Often we want to know how much the scheme improves
on simply predicting the average
The relative squared error is:
p1 a1
p2 a2 ... pn an
2 2 2
a a a a ... a a
1
2
2
2
n
2
a is average
The relative absolute error is:
p1 a1 p2 a2 ... pn an
a a1 a a2 ... a an
44 Classification and Prediction: Evaluation
43. The Correlation Coefficient
Measures the statistical correlation between the
predicted values and the actual values
S PA
(ai a)
2
S pSA SA i
n 1
( pi p)(ai a)
( pi p)
2
S PA i
n 1 SP i
n 1
Scale independent, between -1 and +1
Good performance leads to large values!
45 Classification and Prediction: Evaluation
44. Practice 1
A company wants to do a mail marketing campaign. It costs
the company $1 for each item mailed. They have information
on 100,000 customers. Create a cumulative gains and a lift
chart from the following data.
46 Classification and Prediction: Evaluation
45. Results
47 Classification and Prediction: Evaluation
46. Practice 2
Using the response model P(x)=Salary(x)/1000+Age(x) for customer x
and the data table shown below, construct the cumulative gains, lift
charts and ROC curves. Ties in ranking should be arbitrarily broken by
assigning a higher rank to who appears first in the table.
Customer Name Salary Age Actual P(x) Rank
Response
A 10000 39 N
B 50000 21 Y
C 65000 25 Y
D 62000 30 Y
E 67000 19 Y
F 69000 48 N
G 65000 12 Y
H 64000 51 N
I 71000 65 Y
48 J 73000 42 N Classification and Prediction: Evaluation
47. Customer Salary Age Actual P(x) Rank Ordered by P(x)
Name Response Customer Actual P(x) Rank
A 10000 39 N 49 10 Name Response
B 50000 21 Y 71 9 I Y 136 1
C 65000 25 Y 90 6 F N 117 2
D 62000 30 Y 92 5 H N 115 3
E 67000 19 Y 86 7 J N 115 4
F 69000 48 N 117 2 D Y 92 5
G 65000 12 Y 77 8 C Y 90 6
H 64000 51 N 115 3 E Y 86 7
I 71000 65 Y 136 1 G Y 77 8
J 73000 42 N 115 4 B Y 71 9
A N 49 10
% Positive responses
Cumulative Gains Chart
Baseline Lift Chart Lift Baseline
100.00%
2.00
80.00%
1.50
60.00%
40.00% 1.00
20.00% 0.50
0.00% 0.00
70%
10%
20%
30%
40%
50%
60%
80%
90%
100%
100%
10%
20%
30%
40%
50%
60%
70%
80%
90%
49
48. Customer Actual #Sample TP FP TPR FPR
Name Response
I Y 1 1 0 16.67% 0.00%
F N 2 1 1 16.67% 25.00%
H N 3 1 2 16.67% 50.00%
J N 4 1 3 16.67% 75.00%
C Y 5 2 3 33.33% 75.00%
E Y 6 3 3 50.00% 75.00%
D Y 7 4 3 66.67% 75.00%
G Y 8 5 3 83.33% 75.00%
B Y 9 6 3 100.00% 75.00%
A N 10 6 4 100.00% 100.00%
ROC Curve
120.00%
100.00%
80.00%
TPR 60.00%
40.00%
20.00%
0.00%
0.00% 50.00% 100.00% 150.00%
50 FPR
49. *Predicting performance
Assume the estimated error rate is 25%. How close
is this to the true error rate?
Depends on the amount of test data
Prediction is just like tossing a biased (!) coin
“Head” is a “success”, “tail” is an “error”
In statistics, a succession of independent events like
this is called a Bernoulli process
Statistical theory provides us with confidence
intervals for the true underlying proportion!
51 Classification and Prediction: Evaluation
50. *Confidence intervals
We can say: p lies within a certain specified interval with a
certain specified confidence
Example: S=750 successes in N=1000 trials
Estimated success rate: 75%
How close is this to true success rate p?
Answer: with 80% confidence p[73.2,76.7]
Another example: S=75 and N=100
Estimated success rate: 75%
With 80% confidence p[69.1,80.1]
52 Classification and Prediction: Evaluation
51. *Mean and variance (also Mod 7)
Mean and variance for a Bernoulli trial:
p, p (1–p)
Expected success rate f=S/N
Mean and variance for f : p, p (1–p)/N
For large enough N, f follows a Normal distribution
c% confidence interval [–z X z] for random variable with 0
mean is given by:
Pr[ z X z ] c
With a symmetric distribution:
Pr[ z X z] 1 2 Pr[ X z]
53 Classification and Prediction: Evaluation
52. *Confidence limits Pr[X z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
–1 0 1 1.65 40% 0.25
Confidence limits for the normal distribution with 0 mean and a
variance of 1:
Thus:
Pr[1.65 X 1.65] 90%
To use this we have to reduce our random variable f to have 0
mean and unit variance
54 Classification and Prediction: Evaluation
53. *Transforming f
f p
Transformed value for f : p(1 p) / N
(i.e. subtract the mean and divide by the standard
deviation) f p
Pr z z c
p(1 p) / N
Resulting equation:
z2 f f2 z2 z2
Solving for p : p f
z 1
2
2N N N 4N N
55 Classification and Prediction: Evaluation
54. *Examples
f = 75%, N = 1000, c = 80% (so that z = 1.28):
p [0.732,0.767]
f = 75%, N = 100, c = 80% (so that z = 1.28):
p [0.691,0.801]
Note that normal distribution assumption is only valid for
large N (i.e. N > 100)
f = 75%, N = 10, c = 80% (so that z = 1.28):
p [0.549,0.881]
(should be taken with a grain of salt)
56 Classification and Prediction: Evaluation