3. Experimental Methodology: A Pictorial Overview generate solutions select best LEARNER training examples train’ set tune set testing examples classifier expected accuracy on future examples collection of classified examples Statistical techniques such as 10-fold cross validation and t -tests are used to get meaningful results
4.
5.
6.
7.
8.
9.
10.
11.
12.
13. Confusion Matrices - Useful Way to Report TESTSET Errors Useful for NETtalk testbed – task of pronouncing written words
14. Scatter Plots - Compare Two Algo’s on Many Datasets Algo A’s Error Rate Algo B’s Error Rate Each dot is the error rate of the two algo’s on ONE dataset
29. The Null Hypothesis Graphically (View #1) δ Assume zero mean and use the sample’s variance (sample = experiment) 1. ½ (1 – M ) probability mass in each tail (ie, M inside) Typically M = 0.95 Does our measured δ lie in the regions indicated by arrows? If so, reject null hypothesis, since it is unlikely we’d get such a δ by chance P( δ )
30. View #2 – The Confidence Interval for δ δ Use sample’s mean and variance 2. Is zero in the M % of probability mass? If NOT, reject null hypothesis P( δ )
31.
32.
33.
34.
35.
36.
37.
38.
39. Stability Stability = how much the model an algorithm learns changes due to minor perturbations of the training set Paired t -test assumptions are a better match to stable algorithm Example: k -NN, higher the k , the more stable
40. More on Paired t -Test Assumption Ideally train on one data set and then do a 10-fold paired t -test What we should do: train test1 … test10 What we usually do: train1 test1 … train10 test10 However, not enough data usually to do the ideal If we assume that train data is part of each paired experiment then we violate independence assumptions - each train set overlaps 90% with every other train set Learned model does not vary while we’re measuring its performance
41.
42. One vs. Two-Tailed Graphically P(x) x 2.5% 2.5% 2.5% One-Tailed Test Two-Tailed Test
49. TPR and FPR True Positive Rate = n(1,1) / ( n(1,1) + n(0,1) ) (TPR) = correctly categorized +’s / total positives P(algo outputs + | + is correct) False Positive Rate = n(1,0) / ( n(1,0) + n(0,0) ) (FPR) = incorrectly categorized –’s / total neg’s P(algo outputs + | - is correct) Can similarly define False Negative Rate and True Negative Rate See http:// en.wikipedia.org/wiki/Type_I_and_type_II_errors
50.
51. ROC Curves Graphically 1.0 1.0 False positives rate True positives rate Prob (alg outputs + | + is correct) Prob (alg outputs + | - is correct) Ideal Spot Alg 1 Alg 2 Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false -
52.
53.
54. Plotting ROC Curves - Example Ex 9 .99 + Ex 7 .98 + Ex 1 .72 - Ex 2 .70 + Ex 6 .65 + Ex 10 .51 - Ex 3 .39 - Ex 5 .24 + Ex 4 .11 - Ex 8 .01 - ML Algo Output (Sorted) Correct Category 1.0 1.0 P(alg outputs + | + is correct) P(alg outputs + | - is correct) TPR=(2/5), FPR=(0/5) TPR=(2/5), FPR=(1/5) TPR=(4/5), FPR=(1/5) TPR=(4/5), FPR=(3/5) TPR=(5/5), FPR=(3/5) TPR=(5/5), FPR=(5/5)
55.
56.
57.
58.
59.
60.
61.
62.
63.
64. The Relationship between Precision-Recall and ROC Curves Jesse Davis & Mark Goadrich Department of Computer Sciences University of Wisconsin
74. A1: Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in PR space
85. Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . A 6 5 TP 10 5 FP 0.3 0.25 REC 0.375 0.5 PREC
86. Example Interpolation A dataset with 20 positive and 2000 negative examples 0.25 0.5 30 10 B . . . . A 9 8 7 6 5 TP 25 20 15 10 5 FP 0.45 0.4 0.35 0.3 0.25 REC 0.265 0.286 0.318 0.375 0.5 PREC
87.
88.
89. Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in Precision-Recall space
90. For Fixed N, P and TPR: FPR Precision (Not =) + - + - True Answer Algorithm Answer N P 900 25 100 75