MLlectureMethod.ppt

Learning from Examples: Standard Methodology for Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Using Tuning Sets ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Experimental Methodology: A Pictorial Overview generate solutions select best LEARNER training examples train’ set tune set testing examples classifier expected accuracy on future examples collection of classified examples Statistical techniques such as 10-fold cross validation and t -tests are used to get meaningful results

Proper Experimental Methodology Can Have a Huge Impact! ,[object Object],[object Object],[object Object],[object Object]

Parameter Setting ,[object Object],[object Object],[object Object]

Using Multiple Tuning Sets ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Tuning a Parameter - Sample Usage ,[object Object],[object Object],[object Object],Tune set accuracy (ave. over 10 runs)=92% 1 10 2 K=2 Tune set accuracy (ave. over 10 runs)=97% 1 10 2 … Tune set accuracy (ave. over 10 runs)=80% 1 10 2 K=100 K=0 tune train

What to Do for the FIELDED System? ,[object Object],[object Object],[object Object],[object Object]

What’s Wrong with This? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Why Not Learn After Each Test Example? ,[object Object],[object Object],[object Object],[object Object]

Choosing a Good N for CV (from Weiss & Kulikowski Textbook) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Recap: N -fold Cross Validation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Examples Fold 2 Fold 3 Fold 4 Fold 5 Fold 1

Confusion Matrices - Useful Way to Report TESTSET Errors Useful for NETtalk testbed – task of pronouncing written words

Scatter Plots - Compare Two Algo’s on Many Datasets Algo A’s Error Rate Algo B’s Error Rate Each dot is the error rate of the two algo’s on ONE dataset

Statistical Analysis of Sampling Effects ,[object Object],[object Object],[object Object],[object Object]

The Binomial Distribution ,[object Object]

Using the Binomial ,[object Object],[object Object],[object Object],[object Object]

Central Limit Theorem ,[object Object],Surprisingly, N = 30 is large enough! (in most cases at least) - see pg 132 of textbook 0 1 Ave Y over N trials (repeated many times)

As You Already Learned in “Stat 101” ,[object Object],[object Object],[object Object]

Alg 1 vs. Alg 2 ,[object Object],[object Object],[object Object],[object Object]

Leave-One-Out: Sign Test ,[object Object],[object Object],[object Object],[object Object]

What about 10-fold? ,[object Object],[object Object],[object Object]

Paired Student t -tests ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Paired Student t –Tests (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],i

[object Object],[object Object],[object Object],[object Object],The Random Variable in the t -Test ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],i

More on the Paired t -Test ,[object Object],[object Object],[object Object],[object Object],[object Object]

The Null Hypothesis Graphically (View #1) δ Assume zero mean and use the sample’s variance (sample = experiment) 1. ½ (1 – M ) probability mass in each tail (ie, M inside) Typically M = 0.95 Does our measured δ lie in the regions indicated by arrows? If so, reject null hypothesis, since it is unlikely we’d get such a δ by chance P( δ )

View #2 – The Confidence Interval for δ δ Use sample’s mean and variance 2. Is zero in the M % of probability mass? If NOT, reject null hypothesis P( δ )

The t -test Confidence Interval ,[object Object],[object Object],[object Object],[object Object],See if contains ZERO. If not, we can reject the NULL HYPOTHESIS i.e. algorithms A & B perform equivalently * Hence if N is the typical 10, our dataset must have ≥ 300 examples

The t -Test Calculation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],See table 5.6 in Mitchell We don’t know an analytical expression for the variance, so we need to estimate it on the data

The t -test Calculation (cont.) - Using View #2 (get same result using view #1) ,[object Object],[object Object],PDF δ

Some Jargon: P –values (Uses View #1) ,[object Object],[object Object],P NULL HYPO DISTRIBUTION

From Wikipedia ( http:// en.wikipedia.org/wiki/P -value ) ,[object Object],[object Object],[object Object]

“ Accepting” the Null Hypothesis ,[object Object],[object Object],[object Object],[object Object],[object Object]

More on the t -Distribution ,[object Object],[object Object],[object Object],[object Object],[object Object],Gaussian t N different curve for each N

Some Assumptions Underlying our Calculations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Stability Stability = how much the model an algorithm learns changes due to minor perturbations of the training set Paired t -test assumptions are a better match to stable algorithm Example: k -NN, higher the k , the more stable

More on Paired t -Test Assumption Ideally train on one data set and then do a 10-fold paired t -test What we should do: train test1 … test10 What we usually do: train1 test1 … train10 test10 However, not enough data usually to do the ideal If we assume that train data is part of each paired experiment then we violate independence assumptions - each train set overlaps 90% with every other train set Learned model does not vary while we’re measuring its performance

The Great Debate (or one of them, at least) ,[object Object],[object Object],[object Object]

One vs. Two-Tailed Graphically P(x) x 2.5% 2.5% 2.5% One-Tailed Test Two-Tailed Test

The Great Debate (More) ,[object Object],[object Object],[object Object],See http://www.psychstat.missouristate.edu/introbook/sbk25m.htm By being more confident, it is easier to show significance!

Two Sided vs. One Sided ,[object Object],[object Object],Measured mean mean - x mean + x

Two Sided vs. One Sided ,[object Object],85%

Two Sided vs. One Sided ,[object Object],A - B

Contingency Tables + - + - True Answer Algorithm Answer Counts of occurrences n(0,0) [true neg] n(0,1) [false neg] n(1,0) [false pos] n(1,1) [true pos]

TPR and FPR True Positive Rate = n(1,1) / ( n(1,1) + n(0,1) ) (TPR) = correctly categorized +’s / total positives  P(algo outputs + | + is correct) False Positive Rate = n(1,0) / ( n(1,0) + n(0,0) ) (FPR) = incorrectly categorized –’s / total neg’s  P(algo outputs + | - is correct) Can similarly define False Negative Rate and True Negative Rate See http:// en.wikipedia.org/wiki/Type_I_and_type_II_errors

ROC Curves ,[object Object],[object Object],[object Object],[object Object],[object Object]

ROC Curves Graphically 1.0 1.0 False positives rate True positives rate Prob (alg outputs + | + is correct) Prob (alg outputs + | - is correct) Ideal Spot Alg 1 Alg 2 Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false -

Creating an ROC Curve - the Standard Approach ,[object Object],[object Object],[object Object]

Algo for Creating ROC Curves ( one possibility; use it on HW2) ,[object Object],[object Object],[object Object],[object Object]

Plotting ROC Curves - Example Ex 9 .99 + Ex 7 .98 + Ex 1 .72 - Ex 2 .70 + Ex 6 .65 + Ex 10 .51 - Ex 3 .39 - Ex 5 .24 + Ex 4 .11 - Ex 8 .01 - ML Algo Output (Sorted) Correct Category 1.0 1.0 P(alg outputs + | + is correct) P(alg outputs + | - is correct) TPR=(2/5), FPR=(0/5) TPR=(2/5), FPR=(1/5) TPR=(4/5), FPR=(1/5) TPR=(4/5), FPR=(3/5) TPR=(5/5), FPR=(3/5) TPR=(5/5), FPR=(5/5)

ROC’s and Many Models ( not in the ensemble sense) ,[object Object],[object Object],[object Object]

Area Under ROC Curve ,[object Object],1.0 1.0 False positives True positives

Asymmetric Error Costs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

ROC’s & Skewed Data ,[object Object],[object Object],[object Object],[object Object]

Precision vs. Recall (think about search engines) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Precision vs. Recall ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

ROC vs. Recall-Precision ,[object Object],[object Object],The reason for this is that there may be lots of – ex’s (eg, might need to include 100 neg’s to get 1 more pos) vs. P ( + | - ) Recall Precision P ( + | + )

Recall-Precision Curves ,[object Object],[object Object],[object Object],[object Object],Recall Precision x

Interpolating in PR Space ,[object Object],[object Object],[object Object],[object Object]

The Relationship between Precision-Recall and ROC Curves Jesse Davis & Mark Goadrich Department of Computer Sciences University of Wisconsin

Four Questions about PR space and ROC space ,[object Object],[object Object],[object Object],[object Object]

Definition: Area Under the Curve (AUC) Precision Recall TPR FPR

How do we evaluate ML algorithms? ,[object Object],[object Object],[object Object],[object Object],[object Object]

Two Highly Skewed Domains Is an abnormality on a mammogram benign or malignant? Do these two identities refer to the same person? ? =

Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005]

Predicting Aliases [Synthetic data: Davis et al. ICIA 2005]

A1: Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in PR space

Q2: What is the “best” PR curve? ,[object Object],[object Object],[object Object],[object Object],[object Object]

Constructing the Achievable Curve ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Q3: Interpolation ,[object Object],[object Object],A B TPR FPR

Linear Interpolation Not Achievable in PR Space ,[object Object],Example Counts PR Curves ROC Curves 750 4750 0.75 0.53 0.75 0.14 0.10 1.00 1.00 1.00 9000 1000 0.50 0.50 0.06 0.50 500 500 Prec Recall FP Rate TP Rate FP TP

Example Interpolation ,[object Object],Q: For each extra TP covered, how many FPs do you cover? A: 0.25 0.5 30 10 B A 5 TP 5 FP 0.25 REC 0.5 PREC TP B -TP A FP B -FP A

Example Interpolation ,[object Object],0.25 0.5 30 10 B A 5 TP 5 FP 0.25 REC 0.5 PREC

Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . A 6 5 TP 10 5 FP 0.3 0.25 REC 0.375 0.5 PREC

Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . . . . A 9 8 7 6 5 TP 25 20 15 10 5 FP 0.45 0.4 0.35 0.3 0.25 REC 0.265 0.286 0.318 0.375 0.5 PREC

Optimizing AUC ,[object Object],[object Object],[object Object]

Back to Q1 ,[object Object],[object Object]

Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in Precision-Recall space

For Fixed N, P and TPR: FPR Precision (Not =) + - + - True Answer Algorithm Answer N P 900 25 100 75

Conclusions about PR and ROC Curves ,[object Object],[object Object],[object Object],[object Object]

MLlectureMethod.ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to MLlectureMethod.ppt

Similar to MLlectureMethod.ppt (20)

More from butest

More from butest (20)

MLlectureMethod.ppt