[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Size: px

Start display at page:

Download "[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]"

Ada Pope
5 years ago
Views:

1 Evaluating Hypotheses [Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4] Sample error, true error Condence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem Paired t tests Comparing learning methods 74 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

2 Two Denitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [f(x) 6= h(x)] x2d The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassies error S (h) 1 n X x2s (f(x) 6= h(x)) Where (f(x) 6= h(x)) is 1 if f(x) 6= h(x), and 0 otherwise. How well does error S (h) estimate error D (h)? 75 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

3 Problems Estimating Error 1. Bias: If S is training set, error S (h) is optimistically biased bias E[error S (h)]? error D (h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S, error S (h) may still vary from error D (h) 76 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

4 Example Hypothesis h misclassies 12 of the 40 examples in S error S (h) = = :30 What is error D (h)? 77 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

5 Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) is an unbiased estimator for error D (h) Given observed error S (h) what can we conclude about error D (h)? 78 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

6 Condence Intervals If S contains n examples, drawn independently of h and each other n 30 Then With approximately 95% probability, error D (h) lies in interval vu u error S (h) 1:96 error S(h)(1? error S (h)) t n 79 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

7 Condence Intervals If S contains n examples, drawn independently of h and each other n 30 Then With approximately N% probability, error D (h) lies in interval where vu u error S (h) z error S(h)(1? error S (h)) N t N%: 50% 68% 80% 90% 95% 98% 99% z N : n 80 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

8 error S (h) is a Random Variable Rerun the experiment with dierent randomly drawn S (of size n) Probability of observing r misclassied examples: P(r) 0.14 Binomial distribution for n = 40, p = P (r) = n! r!(n? r)! error D(h) r (1? error D (h)) n?r 81 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

9 Binomial Probability Distribution P(r) Binomial distribution for n = 40, p = n! P (r) = (1 r!(n? r)! pr? p) n?r Probability P (r) of r heads in n coin ips, if p = Pr(heads) Expected, or mean value of X, E[X], is Variance of X is E[X] n X i=0 ip (i) = np V ar(x) E[(X? E[X]) 2 ] = np(1? p) Standard deviation of X, X, is X r E[(X? E[X]) 2 ] = r np(1? p) 82 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

10 Normal Distribution Approximates Binomial error S (h) follows a Binomial distribution, with mean errors (h) = error D (h) standard deviation errors (h) errors (h) = vu u t error D(h)(1? error D (h)) n Approximate this by a Normal distribution with mean errors (h) = error D (h) standard deviation errors (h) errors (h) vu u t error S(h)(1? error S (h)) n 83 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

11 Normal Probability Distribution Normal distribution with mean 0, standard deviation p(x) = 1 p 2 2 e? 1 2 ( x? )2 The probability that X will fall into the interval (a; b) is given by Z b a p(x)dx Expected, or mean value of X, E[X], is Variance of X is E[X] = V ar(x) = 2 Standard deviation of X, X, is X = 84 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

12 Normal Probability Distribution % of area (probability) lies in 1:28 N% of area (probability) lies in z N N%: 50% 68% 80% 90% 95% 98% 99% z N : lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

13 Condence Intervals, More Correctly If S contains n examples, drawn independently of h and each other n 30 Then With approximately 95% probability, error S (h) lies in interval vu u error D (h) 1:96 error D(h)(1? error D (h)) t equivalently, error D (h) lies in interval vu u error S (h) 1:96 error D(h)(1? error D (h)) t which is approximately vu u error S (h) 1:96 error S(h)(1? error S (h)) t n n n 86 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

14 Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1 : : : Y n, all governed by an arbitrary probability distribution with mean and nite variance 2. Dene the sample mean, Y 1 n nx Y i i=1 Central Limit Theorem. As n! 1, the distribution governing Y approaches a Normal distribution, with mean and variance 2 n. 87 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

15 Calculating Condence Intervals 1. Pick parameter p to estimate error D (h) 2. Choose an estimator error S (h) 3. Determine probability distribution that governs estimator error S (h) governed by Binomial distribution, approximated by Normal when n Find interval (L; U) such that N% of probability mass falls in the interval Use table of z N values 88 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

16 Dierence Between Hypotheses Test h 1 on sample S 1, test h 2 on S 2 1. Pick parameter to estimate 2. Choose an estimator d error D (h 1 )? error D (h 2 ) ^d error S1 (h 1 )? error S2 (h 2 ) 3. Determine probability distribution that governs estimator ^d s error S1 (h 1 )(1? error S1 (h 1 )) n 1 + error S2 (h 2)(1? error S2 (h 2)) n 2 4. Find interval (L; U) such that N% of probability mass falls in the interval ^dz N v uuuuut error S1 (h 1 )(1? error S1 (h 1 )) n 1 + error S2 (h 2)(1? error S n 2 89 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

17 Paired t test to compare h A,h B 1. Partition data into k disjoint test sets T 1 ; T 2 ; : : : ; T k of equal size, where this size is at least For i from 1 to k, do i error Ti (h A )? error Ti (h B ) 3. Return the value, where 1 k kx i i=1 N% condence interval estimate for d: Note i s v u u t t N;k?1 s 1 k(k? 1) kx i=1 ( i? ) 2 approximately Normally distributed 90 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

18 Comparing learning algorithms L A and L B What we'd like to estimate: E SD [error D (L A (S))? error D (L B (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected dierence in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data D 0, what is a good estimator? could partition D 0 into training set S and training set T 0, and measure error T0 (L A (S 0 ))? error T0 (L B (S 0 )) even better, repeat this many times and average the results (next slide) 91 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

19 Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1 ; T 2 ; : : : ; T k of equal size, where this size is at least For i from 1 to k, do use T i for the test set, and the remaining data for training set S i S i fd 0? T i g h A L A (S i ) h B L B (S i ) i error Ti (h A )? error Ti (h B ) 3. Return the value, where 1 k kx i i=1 92 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

20 Comparing learning algorithms L A and L B Notice we'd like to use the paired t test on to obtain a condence interval but not really correct, because the training sets in this algorithm are not independent (they overlap!) more correct to view algorithm as producing an estimate of instead of E SD0 [error D (L A (S))? error D (L B (S))] E SD [error D (L A (S))? error D (L B (S))] but even this approximation is better than no comparison 93 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Evaluating Hypotheses

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution,