CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Size: px

Start display at page:

Download "CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1"

Clara Fox
6 years ago
Views:

1 CptS 570 Machine Learning School of EECS Washington State University CptS Machine Learning 1

2 IEEE Expert, October 1996 CptS Machine Learning 2

3 Given sample S from all possible examples D Learner L learns hypothesis h based on S Sample error: error S (h) True error: error D (h) Example Hypothesis h misclassifies 12 of 40 examples in S error S (h) = 0.3 What is error D (h)? CptS Machine Learning 3

4 Learner A learns hypothesis h A on sample S Learner B learns hypothesis h B on sample S Observe: error S (h A ) < error S (h B ) Is error D (h A ) < error D (h B )? Is learner A better than learner B? CptS Machine Learning 4

5 How can we estimate the true error of a classifier? How can we determine if one learner is better than another? Using sample error is too optimistic Using error on a separate test set is better, but might still be misleading Repeating above for multiple iterations, each with different training/testing sets, yields better estimate of true error

6 David Wolpert, 1995 For any learning algorithm there are datasets for which it does well, and datasets for which is does poorly Performance estimates are based on specific datasets, not an estimate of the learner on all datasets There is no one best learning algorithm CptS Machine Learning 6

7 Multiple iterations of learning on a training set and testing on a separate validation set are only for evaluation and parameter tuning Final learning should be done on all available data If the validation set is used to choose/tune a learning method, then it cannot also be used to compare performance against another learning algorithm Need yet another test set that is unseen during tuning/learning CptS Machine Learning 7

8 Error costs (false positives vs. false negatives) Training time and space complexity Testing time and space complexity Interpretability Ease of implementation CptS Machine Learning 8

9 Given dataset X For each of K trials Randomly divide X into training set (2/3) and testing set (1/3) Learn classifier on training set Test classifier on testing set (compute error) Compute average error over K trials Problem Training and testing sets overlap between trials Biases the results CptS Machine Learning 9

10 Given dataset X Partition X into K disjoint sets X 1,, X K For i = 1 to K Learn classifier on training set X X i Test classifier on testing set X i (compute error) Compute average error over K trials Testing sets no longer overlap Training sets still overlap CptS Machine Learning 10

11 Stratification Distribution of classes in training and testing sets should be the same as in original dataset Called stratified cross validation Leave-one-out cross validation K = N = X Used when classified data is scarce CptS Machine Learning 11

12 Tom Dietterich, 1998 For each of 5 trials (shuffling X each time) Divide X in two halves X 1 and X 2 Compute error using X 1 as training and X 2 as testing Compute error using X 2 as training and X 1 as testing Computer average error of all 10 results 5 trials best number to minimize overlap among training and testing sets CptS Machine Learning 12

13 If not enough data for k-fold cross validation Generate multiple samples of size N from X by sampling with replacement Each sample has approximately 63% of the examples in X Compute average error over all samples CptS Machine Learning 13

14 Confusion matrix Predicted class True class Positive Negative Total Positive tp: true positive fn: false negative p Negative fp: false positive tn: true negative n Total p n N CptS Machine Learning 14

15 Name error accuracy tp-rate fp-rate precision recall sensitivity specificity Formula (fp + fn)/n (tp + tn)/n tp/p fp/n tp/p tp/p = tp_rate tp/p = tp_rate tn/n = 1 fp_rate F-measure: F = 2 precision recall precision + recall CptS Machine Learning 15

16 === Run information === Scheme: weka.classifiers.rules.oner -B 6 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class CptS Machine Learning 16

17 Test mode: 10-fold cross-validation === Classifier model (full training set) === wage-increase-first-year: < 2.9 -> bad >= 2.9 -> good? -> good (48/57 instances correct) Time taken to build model: 0 seconds CptS Machine Learning 17

18 === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Coverage of cases (0.95 level) % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 CptS Machine Learning 18

19 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class bad good Weighted Avg === Confusion Matrix === a b <-- classified as 9 11 a = bad 3 34 b = good CptS Machine Learning 19

20 Most comparisons of machine learning algorithms use classification error Problems with this approach May be different costs associated with false positive and false negative errors Training data may not reflect true class distribution CptS Machine Learning 20

21 Receiver Operating Characteristic (ROC) Originated from signal detection theory Common in medical diagnosis Becoming common in ML evaluations ROC curves assess predictive behavior independent of error costs or class distributions Area Under ROC Curve (AUC) Single measure of learning algorithm performance independent of error costs and class distributions CptS Machine Learning 21

22 1.0 True positive rate False positive rate Learner L1 Learner L2 Learner L3 Random 1.0 CptS Machine Learning 22

23 Learner L1 dominates L2 if L1 s ROC curve is always above L2 s curve If L1 dominates L2, then L1 better than L2 for all possible error costs and class distributions If neither dominates (L2 and L3), then different classifiers are better under different conditions CptS Machine Learning 23

24 Assume classifier outputs P(C x) instead of just C (the predicted class for instance x) Let θ be a threshold such that if P(C x) > θ, then x is classified as C, else not C Compute fp-rate and tp-rate for different values of θ from 0 to 1 Plot each (fp-rate, tp-rate) and interpolate (or convex hull) If multiple points with same fp-rate, then average tp-rates (k-fold cross-validation) CptS Machine Learning 24

25 What if classifier does not provide P(C x), but just C? E.g., decision tree, rule Generally, even these discrete classifiers maintain statistics for classification E.g., decision tree leaf nodes use proportion of examples of each class E.g., rules have the number of examples covered by the rule These statistics can be compared against a varying threshold (θ) CptS Machine Learning 25

26 ROC for J48 vs NN on Labor True Positive Rate J48 NN False Positive Rate CptS Machine Learning 26

27 We have seen several ways to estimate learning performance Train/test split, cross-validation, ROC, AUC But how good are these at estimating the true performance? E.g., error S (h) ~ error D (h)? CptS Machine Learning 27

28 Estimate the mean μ of a normal distribution N(μ, σ 2 ) Given sample X = {x t } of size N Estimate m = Σ t x t /N, where m ~ N (μ, σ 2 /N) Define statistic Z with a unit normal distribution N(0,1): ( m µ ) σ / N ~ Z CptS Machine Learning 28

29 95% of Z lies in (-1.96,1.96) 99% of Z lies in (-2.58, 2.58) P(-1.96 < Z < 1.96) = 0.95 Two-sided confidence interval CptS Machine Learning 29

30 CptS Machine Learning 30 ( ) α σ µ σ σ µ σ σ µ α α = + < < = + < < = < < / 2 / N z m N z m P N m N m P m N P z α/2 1-α

CptS 570 - Machine Learning 31 α µ σ µ σ α = < = < 1 0.95 1.

31 CptS Machine Learning 31 α µ σ µ σ α = < = < N z m P N m P z α α One-sided:

32 Previous analysis requires we know σ 2 We can use sample variance S 2 ( t ) 2 x m /( N ) 2 S = 1 t When x t ~ N(μ, σ 2 ), then (N 1)S 2 /σ 2 is chisquared with N 1 degrees of freedom Since m and S 2 are independent, then m is t-distributed with N 1 degrees of freedom N ( µ) / S CptS Machine Learning 32

33 Similar to normal, but with larger spread (longer tails) Corresponds to additional uncertainty with using sample variance CptS Machine Learning 33

34 When σ 2 not known E.g., t 0.025,9 =2.685, t 0.025,29 =2.364 (2-tailed) CptS Machine Learning 34 ( ) ( ) ( ) α µ µ α α = + < < = N S t m N S t m P t S m N N m x S N N N t t, /, / ~ /

35 t x t m = = S = = 0.022, 9 α = 0.05, df = N P P = 1 = , { µ } { µ 3.127} = S t 0.025,9 = 0.95 = CptS Machine Learning 35

36 Want to claim a hypothesis H 1 E.g., H 1 : error D (h) < 0.10 Define the opposite of H 1 to be the null hypothesis H 0 E.g., H 0 : error D (h) 0.10 Perform experiment collecting data about error D (h) With what probability can we reject H 0? CptS Machine Learning 36

37 Example Sample X = {x t } of size N from N(μ, σ 2 ) Estimate mean m = Σ t x t /N Want to test if μ equals some constant μ 0 Null hypothesis H 0 : μ = μ 0 Alternative hypothesis H 1 : μ μ 0 Reject H 0 if m too far from μ 0 CptS Machine Learning 37

38 Example (cont.) We fail to reject H 0 with level of significance α if μ 0 lies in the (1- α) confidence interval: N ( m µ ) σ 0 ( z z ) α / 2, α / 2 We reject H 0 if μ 0 falls outside this interval on either side (two-sided test) CptS Machine Learning 38

39 Example (cont.) One-sided test H 0 : μ μ 0 vs. H 1 : μ > μ 0 Fail to reject H 0 with level of significance α if N ( m µ ) σ 0 Reject H 0 if outside interval (, z ) α CptS Machine Learning 39

40 Example (cont.) If variance σ 2 unknown, use sample variance S 2 Statistic now described by student-t distribution Fail to reject H 0 with level of significance α if N ( m ) N µ S ( m µ ) Reject H 0 if outside interval S 0 ~ t N 1 ( t ) 0, α, N 1 CptS Machine Learning 40

41 Example (cont.) H 0 : μ μ 0 vs. H 1 : μ > μ 0 (one-sided) 30 µ 0 = 2.9, m = = S = = 0.022, S = α = 0.05, df = N 1 = 9, t N ( m µ 0) S Reject H ,9 = (,1.833) = Note that t ,9 = t x t CptS Machine Learning 41

42 Learn classifier on training set Test classifier on test set V of size N Assume probability p of error by classifier X = number of errors made by classifier on V X described by binomial distribution P N { } j ( ) N j X = j = p 1 p j CptS Machine Learning 42

43 Test hypothesis H 0 : p p 0 vs. H 1 : p > p 0 Reject H 0 with significance α if { } j ( ) N j X e = p 1 p < α where e = p 0 N N e P j= 1 j 0 0 CptS Machine Learning 43

44 Approximating X with normal distribution X is sum of N independent random variables from same distribution X/N is approximately normal for large N with mean p 0 and variance p 0 (1- p 0 )/N (central limit theorem) X / N p0 ~ Z p0( 1 p0 )/ N Fail to reject H 0 (p p 0 ) with significance α if X p N p Reject H 0 if outside 0 / (1 0 p ) / N (, ) 0 z α (e.g., z 0.05 = 1.64) Works well for Np 5 and N(1-p) 5 CptS Machine Learning 44

45 Example Recall earlier example error S (h)=0.3, N = S = 40, X = 12 error D (h) = p (?) H 0 : p p 0, H 1 : p > p 0 X / N Let p 0 = 0.2, α = 0.05 Fail to reject H 0 p 0 (1 p 0 p 0 ) / N (, z ) (, z 0.2*0.8 / (,1.64) α 0.05 ) CptS Machine Learning 45

46 Example (cont.) What is the 95% (α=0.05) confidence interval around error D (h) = p? Let p 0 = error S (h) = 0.3 P p 0 α / 2 P P P z p 0 ( 1 p0) p0(1 p0) < p < p0 + zα / 2 N N 0.3(0.7) 40 { < p < } { < p < 0.442} = < p < = (0.7) 40 = = 1α 0.95 CptS Machine Learning 46

47 Evaluate learner on K training/testing sets yielding errors p i, 1 i K K K p i= 1 i 2 i 1, S = = m = K K ( m p ) S 0 t K 1 Reject H 0 with significance α if this value is greater than t α,k-1 Typically K is 10 or 30 (t 0.05,9 =1.83, t 0.05,29 =1.70) ~ ( p K i m) 1 2 CptS Machine Learning 47

48 K-fold cross-validated paired t test Paired test: Both learners get same train/test sets Use K-fold CV to get K training/testing folds p i1, p i 2 : Errors of learners 1 and 2 on fold i p i = p i1 p i2 : Paired difference on fold i Null hypothesis is whether p i has mean 0 CptS Machine Learning 48 ( ) ( ) ( ) 1 2, / 1 2, / , in Accept if ~ : vs. 0 : = = = = = = K K K K i i K i i t t t s m K s m K K m p s K p m H H α α µ µ

49 Tester: weka.experiment.pairedcorrectedttester Analysing: Percent_correct Datasets: 8 Resultsets: 2 Confidence: 0.05 (two tailed) Sorted by: - Date: 10/6/10 12:00 AM Dataset (1) rules.on (2) bayes loan (100) v contact-lenses (100) iris (100) labor-neg-data (100) v segment (100) v soybean (100) v weather (100) weather.symbolic (100) (v/ /*) (4/4/0) CptS Machine Learning 49

50 Be careful when comparing more than two learners Each comparison has probability α of yielding an incorrect conclusion Incorrectly reject null hypothesis Incorrectly conclude learner A better than learner B Probability of at least one incorrect conclusion among c comparisons is (1-(1-α) c ) One approach: Analysis of variance (ANOVA) CptS Machine Learning 50

51 Evaluating a learning algorithm Error of learned hypotheses (and other measures) K-fold and 5x2 cross validation ROC curve and AUC Confidence in error estimate Comparing two learners

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1 IEEE Expert, October 1996 2 Given sample S from all possible examples D Learner L learns hypothesis h based on