Hypothesis Evaluation

Size: px

Start display at page:

Download "Hypothesis Evaluation"

Anis Lindsay Young
5 years ago
Views:

1 Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

2 Table of contents 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

3 Outline 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

4 Introduction 1 Machine Learning algorithms induce hypothesis that depend on the training set, and there is a need for statistical testing to Asses expected performance of a hypothesis and Compare expected Performances of two hypothesis to compare them. 2 Classifier evaluation criteria Accuracy (or Error) The ability of a hypothesis to correctly predict the label of new/previously unseen data. Speed The computational costs involved in generating and using a hypothesis. Robustness The ability of a hypothesis to make correct predictions given noisy data or data with missing values. Scalability The ability to construct a hypothesis efficiently given large amounts of data. Interpretability The level of understanding and insight that is provided by the hypothesis. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

5 Evaluating the accuracy/error of classifiers 1 Given the observed accuracy of a hypothesis over a limited sample data, how well does this estimate its accuracy over additional examples. 2 Given one hypothesis outperforms another over sample data, how probable is that this hypothesis is more accurate in general. 3 When data is limited what is the best way to use this data to learn a hypothesis and estimate its accuracy. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

6 Measuring performance of a classifier 1 Measuring performance of a hypothesis is partitioning data to Training set Validation set different from training set. Test set different from training and validation sets. 2 Problems with this approach Training and validation sets may be small and may contain exceptional instances such as noise, which may mislead us. The learning algorithm may depend on other random factors affecting the accuracy such as initial weights of a neural network trained with BP. We must train/test several times and average the results. 3 Important points Performance of a hypothesis estimated using training set conditioned on the used data set and cant used to compare algorithms in domain independent ways. Validation set is used for model selection, comparing two algorithms, and decide to stop learning. In order to report the expected performance, we should use a separate test set unused during learning. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

7 Error of a classifier Definition (Sample error) The sample error (denoted E E (h)) of hypothesis h with respect to target concept c and data sample S of size N is. Definition (True error) E E (h) = 1 I[c(x) h(x)] N x S The true error (denoted E(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to distribution D. E(h) = P x S [c(x) h(x)] Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

8 Notions of errors 1 True error is Instance space X c h 2 Our concern How we can estimate the true error (E(h)) of hypothesis using its sample error (E E (h))? Can we bound true error of h given sample error of h? Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

9 Outline 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

10 Some performance measures of classifiers 1 Error rate The error rate is the fraction of incorrect predictions for the classifier over the testing set, defined as E E (h) = 1 I[c(x) h(x)] N x S Error rate is an estimate of the probability of misclassification. 2 Accuracy The accuracy of a classifier is the fraction of correct predictions over the testing set: Accuracy(h) = 1 I[c(x) = h(x)] = 1 E E (h) N x S Accuracy gives an estimate of the probability of a correct prediction. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

11 Some performance measures of classifiers 1 What you can say about the accuracy of 90% or the error of 10%? 2 For example, if 3 4% of examples are from negative class, clearly accuracy of 90% is not acceptable. 3 Confusion matrix Actual label (+) ( ) Predicted label (+) ( ) TP FN FP TN 4 Given C classes, a confusion matrix is a table of C C. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

12 Some performance measures of classifiers 1 Precision (Positive predictive value) Precision is proportion of predicted positives which are actual positive and defined as Precision(h) = TP TP + FP 2 Recall (Sensitivity) Recall is proportion of actual positives which are predicted positive and defined as Recall(h) = TP TP + FN 3 Specificity Specificity is proportion of actual negative which are predicted negative and defined as Specificity(h) = TN TN + FP Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

13 Some performance measures of classifiers 1 Balanced classification rate (BCR) Balanced classification rate provides an average of recall (sensitivity) and specificity, it gives a more precise picture of classifier effectiveness. Balanced classification rate defined as BCR(h) = 1 [Specificity(h) + Recall(h)] 2 2 F-measure F-measure is harmonic mean between precision and recall and defined as F Measure(h) = 2 Persicion(h) Recall(h) Persicion(h) + Recall(h) Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

14 Outline 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

15 Evaluating the performance of a classifier 1 Hold-out method 2 Random Sub-sampling 3 Cross validation method 4 Leave-one-out method Cross validation method 6 Bootstrapping method Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

16 4 Hold-out method 1 Hold-out Hold-out partitions the given data into two independent sets : training and test sets. The holdout method Split dataset into two groups Training set: used to train the classifier Test set: used to estimate the error rate of the trained classifier Total number of examples Training Set Test Set A typical application the holdout method is determining a stopping point for the back propagation error Typically two-thirds of the data are allocated to the training set and the remaining one-third is allocated to the test set. The training set is used to drive the model. The test set is used to estimate the accuracy of the model. MSE Test set error Stopping point Training set error Epochs Intelligent Sensor Systems Ricardo Gutierrez-Osuna Wright State University Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

17 6 Random sub-sampling method 1 Random sub-sampling Random sub-sampling is a variation of the hold-out method in which hold-out is repeated k times. Random Subsampling Random Subsampling performs K data splits of the dataset Each split randomly selects a (fixed) no. examples without replacement For each data split we retrain the classifier from scratch with the training examples and estimate E i with the test examples Total number of examples Test example Experiment 1 Experiment 2 Experiment 3 The true error estimate is obtained as the average of the separate estimates E i This estimate is significantly better than the holdout estimate i K 1 The estimated error rate is the average of the error rates for classifiers derived for the independently and randomly generated test partitions. Random subsampling can produce better error estimates than a single train-and-test partition. = 1 E i = E K Intelligent Sensor Systems Ricardo Gutierrez-Osuna Wright State University Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

18 7 Cross validation method 1 K-fold cross validation The initial data are randomly partitioned into K mutually exclusive subsets or folds, S 1, S 2,..., S K, each of approximately equal size.. K-Fold Cross-validation Create a K-fold partition of the the dataset For each of K experiments, use K-1 folds for training and the remaining one for testing Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling The advantage of K-Fold Cross validation is that all the examples in the i K 1 Training and testing is performed K times. In iteration k, partition S k is used for test and the remaining partitions collectively used for training. The accuracy is the percentage of the total number of correctly classified test examples. The advantage of K-fold cross validation is that all the examples in the Hamid Beigy (Sharifdataset University of are Technology) eventually used Hypothesis for Evaluation both training and testing. Fall / 31 dataset are eventually used for both training and testing As before, the true error is estimated as the average error rate E i = 1 = E K Intelligent Sensor Systems Ricardo Gutierrez-Osuna Wright State University

19 8 Leave-one-out method i 1 1 Leave-one-out Leave-one-out is a special case of K-fold cross validation where K is set to number of examples in dataset. Leave-one-out Cross Validation Leave-one-out is the degenerate case of K-Fold Cross Validation, where K is chosen as the total number of examples For a dataset with N examples, perform N experiments For each experiment use N-1 examples for training and the remaining example for testing Total number of examples Experiment 1 Experiment 2 Experiment 3 Single test example For a dataset with N examples, perform N experiments. For each experiment use N 1 examples for training and the remaining example for testing Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31 Experiment N As usual, the true error is estimated as the average error rate on test examples N E i = = E 1 N Intelligent Sensor Systems Ricardo Gutierrez-Osuna Wright State University

5 2 Cross validation method p1 s 2 1 p2 s 2 2 p3 s 2 3 1 5 2 Cross validation method repeates five times 2-fold cross validation method. 2 Training and testing is performed 10 times.

20 5 2 Cross validation method p1 s 2 1 p2 s 2 2 p3 s Cross validation method repeates five times 2-fold cross validation method. 2 Training and testing is performed 10 times. 3 The estimated error rate is the average of the error rates for classifiers derived for the independently and randomly generated test partitions. 5x2cv F test p (1,1) p (1,1) p (1) A B 1 p4 s 2 4 p (1,2) p (1,2) p (2) A B 1 (2,1) (2,1) (1) p (2,1) A p (2,1) B p (1) 2 p5 s 2 5 (2,2) (2,2) (2) p (2,2) A p (2,2) B p (2) 2 (3,1) (3,1) (1) p (3,1) A p (3,1) B p (1) 3 (3,2) (3,2) (2) p (3,2) A p (3,2) B p (2) 3 p (1) 4 p (4,1) B p (4,1) A (4,2) (4,2) (2) p (4,2) A p (4,2) B p (2) 4 p (5,1) p (5,1) p (1) A B 5 p (5,2) p (5,2) p (2) A B 5 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

21 Bootstrapping method 1 Bootstrapping The bootstrap uses sampling with replacement to form the training set. Sample a dataset of N instances N times with replacement to form a new dataset of N instances. Use this data as the training set. An instance may occur more than once in the training set. Use the instances from the original dataset that dont occur in the new training set for testing. This method traines classifieron just on 63% of the instances. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

22 Outline 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

23 Estimating true error 1 How well does E E (h) estimate E(h)? Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E[E E (h)] E(h). For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S, E E (h) may still vary from E(h). The smaller test set results in a greater expected variance. 2 Hypothesis h misclassifies 12 of the 40 examples in S. What is E(h)? 3 We use the following experiment Choose sample S of size N according to distribution D. Measure E E (h) E E (h) is a random variable (i.e., result of an experiment). E E (h) is an unbiased estimator for E(h). 4 Given observed E E (h), what can we conclude about E(h)? Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

24 Distribution of error 1 E E (h) is a random variable with binomial distribution, for the experiment with different randomly drawn S of size N, the probability of observing r misclassified examples is N! P(r) = r!(n r)! E(h)r [1 E(h)] N r 2 For example for N = 40 and E(h) = p = 0.2, Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

25 Distribution of error 1 For binomial distribution, we have E(r) = Np Var(r) = Np(1 p) We have shown that the random variable E E (h) obeys a Binomial distribution. 2 Let p be the probability that the result of trial is a success. 3 The E E (h) and E(h) are E E (h) = r N E(h) = p where N is the number of instances in the sample S, r is the number of instances from S misclassified by h, and p is the probability of misclassifying a single instance drawn from D. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

26 Distribution of error 1 It can be shown that E E (h) is unbiased estimator for E(h). 2 Since r is Binomially distributed, its variance is Np(1 p). 3 Unfortunately p is unknown, but we can substitute our estimate r N for p. 4 In general, given r errors in a sample of N independently drawn test examples, the standard deviation for E E (h) is given by Var [E E (h)] = p(1 p) N EE (h)(1 E E (h)) N Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

27 Outline 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

28 Confidence intervals 1 One common way to describe the uncertainty associated with an estimate is to give an interval within which the true value is expected to fall, along with the probability with which it is expected to fall into this interval. 2 How can we derive confidence intervals for E E (h)? 3 For a given value of M, how can we find the size of the interval that contains M% of the probability mass? 4 Unfortunately, for the Binomial distribution this calculation can be quite tedious. 5 Fortunately, however, an easily calculated and very good approximation can be found in most cases, based on the fact that for sufficiently large sample sizes the Binomial distribution can be closely approximated by the Normal distribution. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

29 Confidence intervals 1 In Normal Distribution N(µ, σ 2 ), M% of area (probability) lies in µ ± z M σ, where M% 50% 68% 80% 90% 95% 98% 99% z M y σ x Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

30 Comparing two hypotheses Test h 1 on sample S 1 and test h 2 on S 2. 1 Pick parameter to estimate d = E(h 1 ) E(h 2 ). 2 Choose an estimator ˆd = E E (h 1 ) E E (h 2 ). 3 Determine probability distribution that governs estimator: 4 E E (h 1 ) and E E (h 2 ) can be approximated by Normal Distribution. 5 Difference of two Normal distributions is also a Normal distribution, ˆd will also approximated by a Normal distribution with mean d and variance of this distribution is equal to sum of variances of E E (h 1 ) and E E (h 2 ). Hence, we have E E (h 1 )(1 E E (h 1 )) Var[ ˆd] = + E E (h 2 )(1 E E (h 2 )) N 1 N 2 6 Find interval (L, U) such that M% of probability mass falls in interval E E (h 1 )(1 E E (h 1 )) ˆd ± z M + E E (h 2 )(1 E E (h 2 )) N 1 N 2 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

31 Comparing two algorithms For comparing two learning algorithms L A and L B, we would like to estimate E S D [E(L A (S)) E(L B (S))] where L(S) is the hypothesis output by learner L using training set S. This shows the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data S 0, what is a good estimator? 1 Could partition S 0 into training set S tr 0 ˆd = E S ts 0 E and test set S ts 0 (L A(h 1 )) E Sts 0 E (L B(h 2 )). where h1 and h 2 are trained using training set S0 tr error using test set S ts 0. and E S ts 0 E 2 Even better, repeat this many times and average the results., and measure is emprical Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

32 Outline 1 Introduction 2 Some performance measures of classifiers 3 Evaluating the performance of a classifier 4 Estimating true error 5 Confidence intervals 6 Paired t Test Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

33 Paired t Test consider the following estimation problem 1 We are given the observed values of a set of independent, identically distributed random variables Y 1, Y 2,..., Y K. 2 We wish to estimate the mean p of the probability distribution governing these Y i. 3 The estimator we will use is the sample mean Ȳ = 1 K K k=1 Y k ithe task is to estimate the sample mean of a collection of independent, identically and Normally distributed random variables. The approximate M% confidence interval for estimating Ȳ is given by where SȲ = Ȳ ± +t M,K 1 SȲ 1 K (Y i K(K 1 Ȳ )2 k=1 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

34 Paired t Test Values alues of t M,K 1 for two-sided confidence intervals. M 90% 95% 98% 99% K 1 = K 1 = K 1 = As K, t M,K 1 approaches z M. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

35 ROC Curves 1 ROC puts false positive rate (FPR = FP/NEG) on x axis. 2 ROC puts true positive rate (TPR = TP/POS) on y axis. 3 Each classifier represented by a point in ROC space corresponding to its (FPR, TPR) pair TP rate FP rate Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

36 Practical aspects 1 A note on parameter tuning Some learning schemes operate in two stages: Stage 1: builds the basic structure Stage 2: optimizes parameter settings It is important that the test data is not used in any way to create the classifier The test data cant be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters 2 No Free Lunch Theorem For any ML algorithm there exist data sets on which it performs well and there exist data sets on which it performs badly! We hope that the latter sets do not occur too often in real life. Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

37 References 1 C. Schaffer, Selecting a Classification Method by cross-validation, Machine Learning, vol. 13, pp , E. Alpaydin, Combined 5x2 CVF Test for Supervised Classification Learning Algrrithm, Neural Computation, Vol. 11, , C. Schaffer, Multiple Comparisons in Induction Algorithms, Machine Learning, vol. 38, pp , Tom Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (2006) Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall / 31

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Hypothesis Testing 1 Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of future unlabeled