Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1
IEEE Expert, October 1996 2
Given sample S from all possible examples D Learner L learns hypothesis h based on S Sample error: error S (h) True error: error D (h) Example Hypothesis h misclassifies 12 of 40 examples in S error S (h) = 0.3 What is error D (h)? 3
Learner A learns hypothesis h A on sample S Learner B learns hypothesis h B on sample S Observe: error S (h A ) < error S (h B ) Is error D (h A ) < error D (h B )? Is learner A better than learner B? 4
How can we estimate the true error of a classifier? How can we determine if one learner is better than another? Using sample error is too optimistic Using error on a separate test set is better, but might still be misleading Repeating above for multiple iterations, each with different training/testing sets, yields better estimate of true error
David Wolpert, 1995 For any learning algorithm there are datasets for which it does well, and datasets for which is does poorly Performance estimates are based on specific datasets, not an estimate of the learner on all datasets There is no one best learning algorithm 6
Multiple iterations of learning on a training set and testing on a separate validation set are only for evaluation and parameter tuning Final learning should be done on all available data If the validation set is used to choose/tune a learning method, then it cannot also be used to compare performance against another learning algorithm Need yet another test set that is unseen during tuning/learning 7
Error costs (false positives vs. false negatives) Training time and space complexity Testing time and space complexity Interpretability Ease of implementation 8
Given dataset X For each of K trials Randomly divide X into training set (2/3) and testing set (1/3) Learn classifier on training set Test classifier on testing set (compute error) Compute average error over K trials Problem Training and testing sets overlap between trials Biases the results 9
Given dataset X Partition X into K disjoint sets X 1,, X K For i = 1 to K Learn classifier on training set X X i Test classifier on testing set X i (compute error) Compute average error over K trials Testing sets no longer overlap Training sets still overlap 10
Stratification Distribution of classes in training and testing sets should be the same as in original dataset Called stratified cross validation Leave-one-out cross validation K = N = X Used when classified data is scarce Medical diagnosis 11
Tom Dietterich, 1998 For each of 5 trials (shuffling X each time) Divide X randomly in two halves X 1 and X 2 Compute error using X 1 as training and X 2 as testing Compute error using X 2 as training and X 1 as testing Compute average error of all 10 results 5 trials best number to minimize overlap among training and testing sets 12
If not enough data for k-fold cross validation Generate multiple samples of size N from X by sampling with replacement Each sample has approximately 63% of the examples in X Compute average error over all samples 13
Draw instances from a dataset with replacement Prob that we do not pick an instance after N draws N 1 1 1 e 0. 368 N that is, only 36.8% is new!
Confusion matrix Predicted class True class Positive Negative Total Positive tp: true positive fn: false negative p Negative fp: false positive tn: true negative n Total p n N 15
Name error accuracy tp-rate fp-rate precision recall sensitivity specificity Formula (fp + fn)/n (tp + tn)/n tp/p fp/n tp/p tp/p = tp_rate tp/p = tp_rate tn/n = 1 fp_rate F-measure: F 2 precisionrecall precision recall 16
Error rate = # of errors / # of instances = (FN+FP)/N Precision = # of found positives / # of found = TP / (TP+FP) Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity
Sensitivity is the same as tp-rate and recall Specificity is how well we detect the negatives # of true negatives / total # of negatives 1 false alarm rate Sensitivity vs. specificity curve for different thresholds 19
=== Run information === Scheme: weka.classifiers.rules.oner -B 6 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class 20
Test mode: 10-fold cross-validation === Classifier model (full training set) === wage-increase-first-year: < 2.9 -> bad >= 2.9 -> good? -> good (43/57 instances correct) Time taken to build model: 0 seconds 21
=== Stratified cross-validation === === Summary === Correctly Classified Instances 43 75.4386 % Incorrectly Classified Instances 14 24.5614 % Kappa statistic 0.4063 Mean absolute error 0.2456 Root mean squared error 0.4956 Relative absolute error 53.6925 % Root relative squared error 103.7961 % Coverage of cases (0.95 level) 75.4386 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 22
=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.45 0.081 0.75 0.45 0.563 0.684 bad 0.919 0.55 0.756 0.919 0.829 0.684 good Weighted Avg. 0.754 0.385 0.754 0.754 0.736 0.684 === Confusion Matrix === a b <-- classified as 9 11 a = bad 3 34 b = good Predicted class True class Class a Class b Total Class a 9 11 20 Class b 3 34 37 Total 12 45 57 23
Most comparisons of machine learning algorithms use classification error Problems with this approach May be different costs associated with false positive and false negative errors Training data may not reflect true class distribution 24
Receiver Operating Characteristic (ROC) Originated from signal detection theory Common in medical diagnosis Becoming common in ML evaluations ROC curves assess predictive behavior independent of error costs or class distributions Area Under ROC Curve (AUC) Single measure of learning algorithm performance independent of error costs and class distributions 25
True positive rate 1.0 0.75 0.5 0.25 0 0 0.25 0.5 0.75 False positive rate Learner L1 Learner L2 Learner L3 Random 1.0 26
Learner L1 dominates L2 if L1 s ROC curve is always above L2 s curve If L1 dominates L2, then L1 better than L2 for all possible error costs and class distributions If neither dominates (L2 and L3), then different classifiers are better under different conditions 27
Assume classifier outputs P(C x) instead of just C (the predicted class for instance x) Let θ be a threshold such that if P(C x) > θ, then x is classified as C, else not C Compute fp-rate and tp-rate for different values of θ from 0 to 1 Plot each (fp-rate, tp-rate) and interpolate (or convex hull) If multiple points with same fp-rate, then average tp-rates (k-fold cross-validation) 28
What if classifier does not provide P(C x), but just C? E.g., decision tree, rule Generally, even these discrete classifiers maintain statistics for classification E.g., decision tree leaf nodes use proportion of examples of each class E.g., rules have the number of examples covered by the rule These statistics can be compared against a varying threshold (θ) 29
True Positive Rate ROC for J48 vs NN on Labor 1.2 1 0.8 0.6 J48 NN 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 False Positive Rate 31
We have seen several ways to estimate learning performance Train/test split, cross-validation, ROC, AUC But how good are these at estimating the true performance? E.g., error S (h) ~ error D (h)? 33
Estimate the mean μ of a normal distribution N(μ, σ 2 ) Given sample X = {x t } of size N Estimate m = Σ t x t /N, where m ~ N (μ, σ 2 /N) Define statistic Z with a unit normal distribution N(0,1): m / N ~ Z 34
95% of Z lies in (-1.96,1.96) 99% of Z lies in (-2.58, 2.58) P(-1.96 < Z < 1.96) = 0.95 Two-sided confidence interval 35
36 1 0.95 1.96 1.96 0.95 1.96 1.96 2 / 2 / N z m N z m P N m N m P m N P z α/2 1-α 2.58 0.99 2.33 0.98 1.96 0.95 1.64 0.90 1.28 0.80 1.00 0.68 0.67 0.50
One-sided: z α 2.33 1.64 1.28 1-α 0.99 0.95 0.90 Pm Pm 1.64 z N 0.95 N 1 37
Previous analysis requires we know σ 2 We can use sample variance S 2 t 2 x m / N 2 S 1 t When x t ~ N(μ, σ 2 ), then (N 1)S 2 /σ 2 is chisquared with N 1 degrees of freedom Since m and S 2 are independent, then m is t-distributed with N 1 degrees of freedom N ( ) / S 38
Similar to normal, but with larger spread (longer tails) Corresponds to additional uncertainty with using sample variance 39
When σ 2 not known E.g., t 0.025,9 =2.685/2.262, t 0.025,29 =2.364/2.045 (2-tailed) 40 1 1 1 2 1 2 1 2 2 N S t m N S t m P t S m N N m x S N N N t t, /, / ~ /
t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 30 m 3.0 10 2 0.2 S 0.022, 9 0.05, df N P P 3 0.127 1 0.149 9, 3 0.127 2.873 3.127 0. 95 S t 0.025,9 0.95 2.685 Pm t / 2, N S S 1 m t / 2, N 1 1 N N 41
Want to claim a hypothesis H 1 E.g., H 1 : error D (h) < 0.10 Define the opposite of H 1 to be the null hypothesis H 0 E.g., H 0 : error D (h) 0.10 Perform experiment collecting data about error D (h) With what probability can we reject H 0? 42
Example Sample X = {x t } of size N from N(μ, σ 2 ) Estimate mean m = Σ t x t /N Want to test if μ equals some constant μ 0 Null hypothesis H 0 : μ = μ 0 Alternative hypothesis H 1 : μ μ 0 Reject H 0 if m too far from μ 0 43
Example (cont.) We fail to reject H 0 with level of significance α if μ 0 lies in the (1- α) confidence interval: N m 0 We reject H 0 if μ 0 falls outside this interval on either side (two-sided test) z z / 2, / 2 44
Example (cont.) One-sided test H 0 : μ μ 0 vs. H 1 : μ > μ 0 Fail to reject H 0 with level of significance α if N Reject H 0 if outside interval m 0, z 45
Example (cont.) If variance σ 2 unknown, use sample variance S 2 Statistic now described by student-t distribution N Fail to reject H 0 with level of significance α if N Reject H 0 if outside interval m S m S 0 ~ t N 1 0,, N 1 t 46
Example (cont.) H 0 : μ μ 0 vs. H 1 : μ > μ 0 (one-sided) 30 0 2.9, m 3.0 10 2 0.2 S 0.022, S 0.149 9 0.05, df N 1 9, t N ( m 0) S Reject H 0 0.05,9 2.121(,1.833) 1.833 Note that t 0.03145,9 = 2.121 t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 47
Learn classifier on training set Test classifier on test set V of size N Assume probability p of error by classifier X = number of errors made by classifier on V X described by binomial distribution P N j N j X j p 1 p j 48
Test hypothesis H 0 : p p 0 vs. H 1 : p > p 0 Reject H 0 with significance α if j N j X e p 1 p where e = p 0 N N e P j1 j 0 0 49
Single training/validation set: Binomial Test If error prob is p 0, prob that there are e errors or less in N validation trials is e N j j N j P X e p0 1 p0 j1 j 1- α N=100, e=20 Accept if this prob is less than 1- α 50
Number of errors X is approx N with mean Np 0 and var Np 0 (1-p 0 ) X Np Np 0 0 1 p 0 ~ Z Accept if this prob for X = e is less than z 1-α 1- α 51
Approximating X with normal distribution X is sum of N independent random variables from same distribution X/N is approximately normal for large N with mean p 0 and variance p 0 (1- p 0 )/N (central limit theorem) X / N p0 ~ Z p01 p0 / N Fail to reject H 0 (p p 0 ) with significance α if X p p Reject H 0 if outside 0 / N (1 0 p ) / N 0, z (e.g., z 0.05 = 1.64) Works well for Np 5 and N(1-p) 5 52
Example Recall earlier example error S (h)=0.3, N = S = 40, X = 12 error D (h) = p (?) H 0 : p p 0, H 1 : p > p 0 X / N Let p 0 = 0.2, α = 0.05 Fail to reject H 0 p 0 (1 p 0 p 0 ) / N 0.3 0.2 (, z 0.2*0.8/ 40 1.58(,1.64), z 0.05 ) 53
Example (cont.) What is the 95% (α=0.05) confidence interval around error D (h) = p? Let p 0 = error S (h) = 0.3 P p 0 / 2 P0.3 1.96 P P z p 0.3(0.7) 40 0.3 0.142 p 0.3 0.158 p 0.442 0. 95 0 ( 1 p0) p0(1 p0) p p0 z / 2 N N p 0.3 1.96 0.142 0.95 0.3(0.7) 40 1 0.95 54
Evaluate learner on K training/testing sets yielding errors p i, 1 i K Reject H 0 with significance α if this value is greater than t α,k-1 Typically K is 10 or 30 (t 0.05,9 =1.83, t 0.05,29 =1.70) 55 1 0 t K S p m K ~ 1 ) (, 1 2 2 1 K m p S K p m K i i K i i
K-fold cross-validated paired t test Paired test: Both learners get same train/test sets Use K-fold CV to get K training/testing folds p i1, p i 2 : Errors of learners 1 and 2 on fold i p i = p i1 p i2 : Paired difference on fold i Null hypothesis is whether p i has mean 0 56 1 2, / 1 2, / 1 1 2 2 1 1 0, in Accept if ~ 0 1 0 : vs. 0 : K K K K i i K i i t t t s m K s m K K m p s K p m H H
Tester: weka.experiment.pairedcorrectedttester Analysing: Percent_correct Datasets: 8 Resultsets: 2 Confidence: 0.05 (two tailed) Sorted by: - Date: 10/6/10 12:00 AM Dataset (1) rules.on (2) bayes -------------------------------------------------- loan (100) 39.50 84.50 v contact-lenses (100) 72.17 76.17 iris (100) 93.53 95.53 labor-neg-data (100) 72.77 93.57 v segment (100) 63.33 81.12 v soybean (100) 39.75 92.94 v weather (100) 36.00 67.50 weather.symbolic (100) 38.00 57.50 -------------------------------------------------- (v/ /*) (4/4/0) 57
Be careful when comparing more than two learners Each comparison has probability α of yielding an incorrect conclusion Incorrectly reject null hypothesis Incorrectly conclude learner A better than learner B Probability of at least one incorrect conclusion among c comparisons is (1-(1-α) c ) One approach: Analysis of variance (ANOVA) 58
Evaluating a learning algorithm Error of learned hypotheses (and other measures) K-fold and 5x2 cross validation ROC curve and AUC Confidence in error estimate Comparing two learners