CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

IEEE Expert, October 1996 CptS 570 - Machine Learning 2

Given sample S from all possible examples D Learner L learns hypothesis h based on S Sample error: error S (h) True error: error D (h) Example Hypothesis h misclassifies 12 of 40 examples in S error S (h) = 0.3 What is error D (h)? CptS 570 - Machine Learning 3

Learner A learns hypothesis h A on sample S Learner B learns hypothesis h B on sample S Observe: error S (h A ) < error S (h B ) Is error D (h A ) < error D (h B )? Is learner A better than learner B? CptS 570 - Machine Learning 4

How can we estimate the true error of a classifier? How can we determine if one learner is better than another? Using sample error is too optimistic Using error on a separate test set is better, but might still be misleading Repeating above for multiple iterations, each with different training/testing sets, yields better estimate of true error

David Wolpert, 1995 For any learning algorithm there are datasets for which it does well, and datasets for which is does poorly Performance estimates are based on specific datasets, not an estimate of the learner on all datasets There is no one best learning algorithm CptS 570 - Machine Learning 6

Multiple iterations of learning on a training set and testing on a separate validation set are only for evaluation and parameter tuning Final learning should be done on all available data If the validation set is used to choose/tune a learning method, then it cannot also be used to compare performance against another learning algorithm Need yet another test set that is unseen during tuning/learning CptS 570 - Machine Learning 7

Error costs (false positives vs. false negatives) Training time and space complexity Testing time and space complexity Interpretability Ease of implementation CptS 570 - Machine Learning 8

Given dataset X For each of K trials Randomly divide X into training set (2/3) and testing set (1/3) Learn classifier on training set Test classifier on testing set (compute error) Compute average error over K trials Problem Training and testing sets overlap between trials Biases the results CptS 570 - Machine Learning 9

Given dataset X Partition X into K disjoint sets X 1,, X K For i = 1 to K Learn classifier on training set X X i Test classifier on testing set X i (compute error) Compute average error over K trials Testing sets no longer overlap Training sets still overlap CptS 570 - Machine Learning 10

Stratification Distribution of classes in training and testing sets should be the same as in original dataset Called stratified cross validation Leave-one-out cross validation K = N = X Used when classified data is scarce CptS 570 - Machine Learning 11

Tom Dietterich, 1998 For each of 5 trials (shuffling X each time) Divide X in two halves X 1 and X 2 Compute error using X 1 as training and X 2 as testing Compute error using X 2 as training and X 1 as testing Computer average error of all 10 results 5 trials best number to minimize overlap among training and testing sets CptS 570 - Machine Learning 12

If not enough data for k-fold cross validation Generate multiple samples of size N from X by sampling with replacement Each sample has approximately 63% of the examples in X Compute average error over all samples CptS 570 - Machine Learning 13

Confusion matrix Predicted class True class Positive Negative Total Positive tp: true positive fn: false negative p Negative fp: false positive tn: true negative n Total p n N CptS 570 - Machine Learning 14

Name error accuracy tp-rate fp-rate precision recall sensitivity specificity Formula (fp + fn)/n (tp + tn)/n tp/p fp/n tp/p tp/p = tp_rate tp/p = tp_rate tn/n = 1 fp_rate F-measure: F = 2 precision recall precision + recall CptS 570 - Machine Learning 15

=== Run information === Scheme: weka.classifiers.rules.oner -B 6 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class CptS 570 - Machine Learning 16

Test mode: 10-fold cross-validation === Classifier model (full training set) === wage-increase-first-year: < 2.9 -> bad >= 2.9 -> good? -> good (48/57 instances correct) Time taken to build model: 0 seconds CptS 570 - Machine Learning 17

=== Stratified cross-validation === === Summary === Correctly Classified Instances 43 75.4386 % Incorrectly Classified Instances 14 24.5614 % Kappa statistic 0.4063 Mean absolute error 0.2456 Root mean squared error 0.4956 Relative absolute error 53.6925 % Root relative squared error 103.7961 % Coverage of cases (0.95 level) 75.4386 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 CptS 570 - Machine Learning 18

=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.45 0.081 0.75 0.45 0.563 0.684 bad 0.919 0.55 0.756 0.919 0.829 0.684 good Weighted Avg. 0.754 0.385 0.754 0.754 0.736 0.684 === Confusion Matrix === a b <-- classified as 9 11 a = bad 3 34 b = good CptS 570 - Machine Learning 19

Most comparisons of machine learning algorithms use classification error Problems with this approach May be different costs associated with false positive and false negative errors Training data may not reflect true class distribution CptS 570 - Machine Learning 20

Receiver Operating Characteristic (ROC) Originated from signal detection theory Common in medical diagnosis Becoming common in ML evaluations ROC curves assess predictive behavior independent of error costs or class distributions Area Under ROC Curve (AUC) Single measure of learning algorithm performance independent of error costs and class distributions CptS 570 - Machine Learning 21

1.0 True positive rate 0.75 0.5 0.25 0 0 0.25 0.5 0.75 False positive rate Learner L1 Learner L2 Learner L3 Random 1.0 CptS 570 - Machine Learning 22

Learner L1 dominates L2 if L1 s ROC curve is always above L2 s curve If L1 dominates L2, then L1 better than L2 for all possible error costs and class distributions If neither dominates (L2 and L3), then different classifiers are better under different conditions CptS 570 - Machine Learning 23

Assume classifier outputs P(C x) instead of just C (the predicted class for instance x) Let θ be a threshold such that if P(C x) > θ, then x is classified as C, else not C Compute fp-rate and tp-rate for different values of θ from 0 to 1 Plot each (fp-rate, tp-rate) and interpolate (or convex hull) If multiple points with same fp-rate, then average tp-rates (k-fold cross-validation) CptS 570 - Machine Learning 24

What if classifier does not provide P(C x), but just C? E.g., decision tree, rule Generally, even these discrete classifiers maintain statistics for classification E.g., decision tree leaf nodes use proportion of examples of each class E.g., rules have the number of examples covered by the rule These statistics can be compared against a varying threshold (θ) CptS 570 - Machine Learning 25

ROC for J48 vs NN on Labor 1.2 1 True Positive Rate 0.8 0.6 0.4 J48 NN 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 False Positive Rate CptS 570 - Machine Learning 26

We have seen several ways to estimate learning performance Train/test split, cross-validation, ROC, AUC But how good are these at estimating the true performance? E.g., error S (h) ~ error D (h)? CptS 570 - Machine Learning 27

Estimate the mean μ of a normal distribution N(μ, σ 2 ) Given sample X = {x t } of size N Estimate m = Σ t x t /N, where m ~ N (μ, σ 2 /N) Define statistic Z with a unit normal distribution N(0,1): ( m µ ) σ / N ~ Z CptS 570 - Machine Learning 28

95% of Z lies in (-1.96,1.96) 99% of Z lies in (-2.58, 2.58) P(-1.96 < Z < 1.96) = 0.95 Two-sided confidence interval CptS 570 - Machine Learning 29

CptS 570 - Machine Learning 30 ( ) α σ µ σ σ µ σ σ µ α α = + < < = + < < = < < 1 0.95 1.96 1.96 0.95 1.96 1.96 2 / 2 / N z m N z m P N m N m P m N P z α/2 1-α 2.58 0.99 2.33 0.98 1.96 0.95 1.64 0.90 1.28 0.80 1.00 0.68 0.67 0.50

CptS 570 - Machine Learning 31 α µ σ µ σ α = < = < 1 0.95 1.64 N z m P N m P z α 2.33 1.64 1.28 1-α 0.99 0.95 0.90 One-sided:

Previous analysis requires we know σ 2 We can use sample variance S 2 ( t ) 2 x m /( N ) 2 S = 1 t When x t ~ N(μ, σ 2 ), then (N 1)S 2 /σ 2 is chisquared with N 1 degrees of freedom Since m and S 2 are independent, then m is t-distributed with N 1 degrees of freedom N ( µ) / S CptS 570 - Machine Learning 32

Similar to normal, but with larger spread (longer tails) Corresponds to additional uncertainty with using sample variance CptS 570 - Machine Learning 33

When σ 2 not known E.g., t 0.025,9 =2.685, t 0.025,29 =2.364 (2-tailed) CptS 570 - Machine Learning 34 ( ) ( ) ( ) α µ µ α α = + < < = 1 1 1 2 1 2 1 2 2 N S t m N S t m P t S m N N m x S N N N t t, /, / ~ /

t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 30 m = = 3.0 10 2 0.2 S = = 0.022, 9 α = 0.05, df = N P P = 1 = 0.149 9, { 3 0.127 µ 3 + 0.127} { 2.873 µ 3.127} = 0. 95 S t 0.025,9 = 0.95 = 2.685 CptS 570 - Machine Learning 35

Want to claim a hypothesis H 1 E.g., H 1 : error D (h) < 0.10 Define the opposite of H 1 to be the null hypothesis H 0 E.g., H 0 : error D (h) 0.10 Perform experiment collecting data about error D (h) With what probability can we reject H 0? CptS 570 - Machine Learning 36

Example Sample X = {x t } of size N from N(μ, σ 2 ) Estimate mean m = Σ t x t /N Want to test if μ equals some constant μ 0 Null hypothesis H 0 : μ = μ 0 Alternative hypothesis H 1 : μ μ 0 Reject H 0 if m too far from μ 0 CptS 570 - Machine Learning 37

Example (cont.) We fail to reject H 0 with level of significance α if μ 0 lies in the (1- α) confidence interval: N ( m µ ) σ 0 ( z z ) α / 2, α / 2 We reject H 0 if μ 0 falls outside this interval on either side (two-sided test) CptS 570 - Machine Learning 38

Example (cont.) One-sided test H 0 : μ μ 0 vs. H 1 : μ > μ 0 Fail to reject H 0 with level of significance α if N ( m µ ) σ 0 Reject H 0 if outside interval (, z ) α CptS 570 - Machine Learning 39

Example (cont.) If variance σ 2 unknown, use sample variance S 2 Statistic now described by student-t distribution Fail to reject H 0 with level of significance α if N ( m ) N µ S ( m µ ) Reject H 0 if outside interval S 0 ~ t N 1 ( t ) 0, α, N 1 CptS 570 - Machine Learning 40

Example (cont.) H 0 : μ μ 0 vs. H 1 : μ > μ 0 (one-sided) 30 µ 0 = 2.9, m = = 3.0 10 2 0.2 S = = 0.022, S = 0.149 9 α = 0.05, df = N 1 = 9, t N ( m µ 0) S Reject H 0 0.05,9 = 2.121 (,1.833) = 1.833 Note that t 0.03145,9 = 2.121 t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 CptS 570 - Machine Learning 41

Learn classifier on training set Test classifier on test set V of size N Assume probability p of error by classifier X = number of errors made by classifier on V X described by binomial distribution P N { } j ( ) N j X = j = p 1 p j CptS 570 - Machine Learning 42

Test hypothesis H 0 : p p 0 vs. H 1 : p > p 0 Reject H 0 with significance α if { } j ( ) N j X e = p 1 p < α where e = p 0 N N e P j= 1 j 0 0 CptS 570 - Machine Learning 43

Approximating X with normal distribution X is sum of N independent random variables from same distribution X/N is approximately normal for large N with mean p 0 and variance p 0 (1- p 0 )/N (central limit theorem) X / N p0 ~ Z p0( 1 p0 )/ N Fail to reject H 0 (p p 0 ) with significance α if X p N p Reject H 0 if outside 0 / (1 0 p ) / N (, ) 0 z α (e.g., z 0.05 = 1.64) Works well for Np 5 and N(1-p) 5 CptS 570 - Machine Learning 44

Example Recall earlier example error S (h)=0.3, N = S = 40, X = 12 error D (h) = p (?) H 0 : p p 0, H 1 : p > p 0 X / N Let p 0 = 0.2, α = 0.05 Fail to reject H 0 p 0 (1 p 0 p 0 ) / N (, z ) 0.3 0.2 (, z 0.2*0.8 / 40 1.58 (,1.64) α 0.05 ) CptS 570 - Machine Learning 45

Example (cont.) What is the 95% (α=0.05) confidence interval around error D (h) = p? Let p 0 = error S (h) = 0.3 P p 0 α / 2 P 0.3 1.96 P P z p 0 ( 1 p0) p0(1 p0) < p < p0 + zα / 2 N N 0.3(0.7) 40 { 0.3 0.142 < p < 0.3 + 0.142} { 0.158 < p < 0.442} = 0. 95 < p < 0.3 + 1.96 = 0.95 0.3(0.7) 40 = = 1α 0.95 CptS 570 - Machine Learning 46

Evaluate learner on K training/testing sets yielding errors p i, 1 i K K K p i= 1 i 2 i 1, S = = m = K K ( m p ) S 0 t K 1 Reject H 0 with significance α if this value is greater than t α,k-1 Typically K is 10 or 30 (t 0.05,9 =1.83, t 0.05,29 =1.70) ~ ( p K i m) 1 2 CptS 570 - Machine Learning 47

K-fold cross-validated paired t test Paired test: Both learners get same train/test sets Use K-fold CV to get K training/testing folds p i1, p i 2 : Errors of learners 1 and 2 on fold i p i = p i1 p i2 : Paired difference on fold i Null hypothesis is whether p i has mean 0 CptS 570 - Machine Learning 48 ( ) ( ) ( ) 1 2, / 1 2, / 1 1 2 2 1 1 0, in Accept if ~ 0 1 0 : vs. 0 : = = = = = = K K K K i i K i i t t t s m K s m K K m p s K p m H H α α µ µ

Tester: weka.experiment.pairedcorrectedttester Analysing: Percent_correct Datasets: 8 Resultsets: 2 Confidence: 0.05 (two tailed) Sorted by: - Date: 10/6/10 12:00 AM Dataset (1) rules.on (2) bayes -------------------------------------------------- loan (100) 39.50 84.50 v contact-lenses (100) 72.17 76.17 iris (100) 93.53 95.53 labor-neg-data (100) 72.77 93.57 v segment (100) 63.33 81.12 v soybean (100) 39.75 92.94 v weather (100) 36.00 67.50 weather.symbolic (100) 38.00 57.50 -------------------------------------------------- (v/ /*) (4/4/0) CptS 570 - Machine Learning 49

Be careful when comparing more than two learners Each comparison has probability α of yielding an incorrect conclusion Incorrectly reject null hypothesis Incorrectly conclude learner A better than learner B Probability of at least one incorrect conclusion among c comparisons is (1-(1-α) c ) One approach: Analysis of variance (ANOVA) CptS 570 - Machine Learning 50

Evaluating a learning algorithm Error of learned hypotheses (and other measures) K-fold and 5x2 cross validation ROC curve and AUC Confidence in error estimate Comparing two learners