CSC314 / CSC763 Introduction to Machine Learning

CSC314 / CSC763 Introduction to Machine Learning COMSATS Institute of Information Technology Dr. Adeel Nawab

More on Evaluating Hypotheses/Learning Algorithms Lecture Outline: Review of Confidence Intervals for Discrete-Valued Hypotheses General Approach to Deriving Confidence Intervals Difference in Error of Two Hypotheses Comparing Learning Algorithms k-fold crossvalidation Disadvantages of Accuracy as a Measure Confusion matrices and ROC graphs Reading: Chapter 5 of Mitchell Chapter 5 of Witten and Frank, 2nd ed. Reference: M. H. DeGroot, Probability and Statistics, 2nd Ed., Addison-Wesley, 1986.

Review of Confidence Intervals for Discrete-Valued Hypotheses Last lecture we examined the question of how we might evaluate a hypothesis. More precisely: 1. Given a hypothesis h and a data sample S of n instances drawn at random according to D, what is the best estimate of the accuracy of h over future instances drawn from D? 2. What is the possible error in this accuracy esimate?

Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...) In answer, we saw that assuming: the n instances in S are drawn independently of one another independently of h according to probability distribution D n 30 Then 1. the most probable value of error D (h) is error S (h) 2. With approximately N% probability, error D (h) lies in interval.

Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...)

Review (cont) Equation (1) was derived by observing: error S (h) follows a binomial distribution with mean value error D (h) and standard deviation approximated by

Review (cont) the N% confidence level for estimating the mean of a normal distribution of a random variable Y with observed value y can be calculated by noting µ falls into y±znσ N% of the time.

Two-Sided and One-Sided Bounds Confidence intervals discussed so far offer two-sided bounds above and below May only be interested in one-sided bound E.g. may only care about upper bound on error answer to question: What is the probability that error D (h) is at most U? and not mind if error is lower than our estimate.

Two-Sided and One-Sided Bounds

General Approach to Deriving Confidence Intervals Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1...Y n, all governed by an arbitrary probability distribution with mean µ and finite variance σ2. Define the sample mean,

General Approach to Deriving Confidence Intervals Central Limit Theorem Central Limit Theorem. As n, the distribution governing Y approaches a Normal distribution, with mean µ and variance σ 2 /n. Significance: we know the form of the distribution of the sample mean even if we do not know the distribution of the underlying Y i that are being observed. Useful because whenever we pick an estimator that is the mean of some sample (e.g. error S (h)), the distribution governing the estimator can be approximated by the Normal distribution for suitably large n (typically n 30). e.g. use Normal distribution to approximate Binomial distribution that more accurately describes error S (h).

General Approach to Deriving Confidence Intervals Now have a general approach to deriving confidence intervals for many estimation problems: 1. Pick parameter p to estimate e.g. error D (h) 2. Choose an estimator e.g. error S (h) 3. Determine probability distribution that governs estimator e.g. error S (h) governed by Binomial distribution, approximated by Normal when n 30

General Approach to Deriving Confidence Intervals 4. Find interval (Lower,Upper) such that N% of probability mass falls in the interval e.g Use table of zn values Things are made easier if we pick an estimator that is the mean of some sample then (by Central Limit Theorem) we can ignore the probability distribution underlying the sample and approximate the distribution governing the estimator by the Normal distribution.

Example: Difference in Error of Two Hypotheses Suppose we have two hypotheses h 1 and h 2 for a discrete-valued target function h 1 is tested on sample S 1, h 2 on S 2, S 1, S 2 independently drawn from the same distribution Wish to estimate difference d in true error between h 1 and h 2 Use the 4-step generic procedure to derive confidence interval for d:

Example: Difference in Error of Two Hypotheses

Comparing Learning Algorithms Suppose we want to compare two learning algorithms rather than two specific hypotheses. Not complete agreement in the machine learning community about best way to do this. One way to do this is to determine whether learning algorithm L A is better on average for learning a target function f than learning algorithm L B.

Comparing Learning Algorithms(Cont.) By better on average here we mean relative performance across all training sets of size n drawn from instance distribution D. I.e. we want to estimate: E S D [error D (L A (S)) error D (L B (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D.

Comparing Learning Algorithms(Cont ) But, given limited data D 0, what is a good estimator? could partition D 0 into training set S 0 and test set T 0, and measure error T 0 (L A (S 0 )) error T 0 (L B (S 0 )) even better, repeat this many times and average the results

Comparing Learning Algorithms(Cont ) Rather than divide limited training/testing data just once, do so multiple times and average the results called k-fold cross-validation: 1. Partition data D 0 into k disjoint test sets T 1, T 2,..., D K of equal size, where this size is at least 30.

Comparing Learning Algorithms(Cont.)

Comparing Learning Algorithms Further Considerations Can determine approximate N% confidence intervals for estimator δ using a statistical test called a paired t test A paired test is one where hypotheses are compared over identical samples (unlike discussion of comparing hypotheses above). t test uses the t distribution (instead of Normal distribution).

Comparing Learning Algorithms Further Considerations Another paired test which is increasingly used is the Wilcoxon signed rank test Has advantage that unlike t test does not assume any particular distribution underlying the error (i.e. it is a non-parametric test) Rather than partitioning available data D 0 into k disjoint equal-sized partitions, can repeatedly randomly select a test set of n 30 examples from D 0 and use rest for training.

Comparing Learning Algorithms Further Considerations can do this indefinitely many times, to shrink confidence intervals to arbitrary width however test sets are no longer independently drawn from underlying instance distribution D, since instances will recur in separate test sets in k-fold cross validation each instance is included in only one test set

Disadvantages of Accuracy as a Measure (I)

Disadvantages of Accuracy as a Measure (I) Accuracy not always a good measure. Consider a two class classification problem where 995 of 1000 instances in a test sample are negative and 5 positive a classifier that always predicts negative will have an accuracy of 99.5 % even though it never correctly predicts positive examples

Confusion Matrices Can get deeper insights into classifier behaviour by using a confusion matrix

Confusion Matrices (cont)

Disadvantages of Accuracy as a Measure (II) Accuracy ignores possibility of different misclassification costs incorrectly predicting +ve costs may be more or less important than incorrectly predicting -ve costs not treating an ill-patient vs. treating a healthy one refusing credit to a credit-worthy client vs. denying credit a client who defaults

Disadvantages of Accuracy as a Measure (II) To address this many classifiers have parameters that can be adjusted to allow increased TPR at cost of increased FPR; or decreased FPR at the cost of decreased TPR. For each such parameter setting a (TPR,FPR) pair results, and the results may be plotted on a ROC graph (ROC = receiver operating characteristic ).

Disadvantages of Accuracy as a Measure (II) provides graphical summary of trade-offs between sensitivity and specificity term originated in signal detection theory e.g. identifying radar signals of enemy aircraft in noisy environments See Witten and Frank, Chapter 5.7 and http://en.wikipedia.org/wiki/receiver-operating-characteristic for more

ROC Graphs

ROC Graphs Example

Summary Confidence intervals give us a way of assessing how likely the true error of a hypothesis is to fall within an interval around an observed over a sample For many practical purposes we will be interested in a one-sided confidence interval only The approach to confidence intervals for sample error many be generalized to apply to any estimator which is the mean of some sample E.g. may use this approach to derive confidence interval for estimated difference in true error between two hypotheses

Summary Difference between learning algorithms, as opposed to hypotheses, are typically assessed by k-fold cross- validation Accuracy (complement of error) has a number of disadvantages as the sole measure of a learning algorithm Deeper insight may be obtained using a confusion matrix which allows use to distinguish numbers of false positives/negatives from true positives/negatives Costs of different classification errors may be taken into account using ROC graphs