Evaluating Hypotheses IEEE Expert, October 1996 1
Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem Paired t tests Comparing learning methods 2
Evaluating Hypotheses and Learners Consider hypotheses H 1 and H 2 learned by learners L 1 and L 2 How to learn H and estimate accuracy with limited data? How well does observed accuracy of H over limited sample estimate accuracy over unseen data? If H 1 outperforms H 2 on sample, will H 1 outperform H 2 in general? Same conclusion for L 1 and L 2? 3
Two Definitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [f(x) h(x)] x D The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies error S (h) 1 δ(f(x) h(x)) n x S Where δ(f(x) h(x)) is 1 if f(x) h(x), and 0 otherwise. How well does error S (h) estimate error D (h)? 4
Problems Estimating Error 1. Bias: If S is training set, error S (h) is optimistically biased bias E[error S (h)] error D (h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S, error S (h) may still vary from error D (h) 5
Example Hypothesis h misclassifies 12 of the 40 examples in S error S (h) = 12 40 =.30 What is error D (h)? 6
Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) isanunbiasedestimator for error D (h) Given observed error S (h) what can we conclude about error D (h)? 7
Confidence Intervals If S contains n examples, drawn independently of h and each other n 30 Then With approximately 95% probability, error D (h) lies in interval error S (h) ± 1.96 error S (h)(1 error S (h)) n 8
Confidence Intervals If S contains n examples, drawn independently of h and each other n 30 Then With approximately N% probability, error D (h) lies in interval where error S (h) ± z N error S (h)(1 error S (h)) n N%: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 9
error S (h) is a Random Variable Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples: P(r) P (r) = 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Binomial distribution for n = 40, p = 0.3 0 0 5 10 15 20 25 30 35 40 n! r!(n r)! error D(h) r (1 error D (h)) n r 10
Binomial Probability Distribution P(r) 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Binomial distribution for n = 40, p = 0.3 0 0 5 10 15 20 25 30 35 40 n! P (r) = r!(n r)! pr (1 p) n r Probability P (r) ofr heads in n coin flips, if p =Pr(heads) Expected, or mean value of X, E[X], is E[X] n i=0 ip (i) =np Variance of X is Var(X) E[(X E[X]) 2 ]=np(1 p) Standard deviation of X, σ X,is σ X E[(X E[X]) 2 ]= np(1 p) 11
Normal Distribution Approximates Binomial error S (h) follows a Binomial distribution, with mean µ errors (h) = error D (h) standard deviation σ errors (h) σ errors (h) = error D (h)(1 error D (h)) n Approximate this by a Normal distribution with mean µ errors (h) = error D (h) standard deviation σ errors (h) σ errors (h) error S (h)(1 error S (h)) n 12
Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Normal distribution with mean 0, standard deviation 1 0-3 -2-1 0 1 2 3 p(x) = 1 2πσ 2 e 1 2 (x µ σ )2 The probability that X will fall into the interval (a, b) is given by b a p(x)dx Expected, or mean value of X, E[X], is Variance of X is E[X] =µ Var(X) =σ 2 Standard deviation of X, σ X,is σ X = σ 13
Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-3 -2-1 0 1 2 3 80% of area (probability) lies in µ ± 1.28σ N% of area (probability) lies in µ ± z N σ N%: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 14
Confidence Intervals, More Correctly If S contains n examples, drawn independently of h and each other n 30 Then With approximately 95% probability, error S (h) lies in interval error D (h) ± 1.96 equivalently, error D (h) lies in interval error S (h) ± 1.96 which is approximately error S (h) ± 1.96 error D (h)(1 error D (h)) n error D (h)(1 error D (h)) n error S (h)(1 error S (h)) n 15
Two-Sided and One-Sided Bounds 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-3 -2-1 0 1 2 3 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-3 -2-1 0 1 2 3 If µ z N σ y µ + z N σ with confidence N = 100(1 α)% Then y µ + z N σ with confidence N = 100(1 α/2)% and µ z N σ y + with confidence N = 100(1 α/2)% Example: n = 40, r =12 Two-sided, 95% confidence (α =0.05) One-sided P (0.16 y 0.44) = 0.95 P (y 0.44) = P (y 0.16) = (1 α/2) = 0.975 16
Calculating Confidence Intervals 1. Pick parameter p to estimate error D (h) 2. Choose an estimator error S (h) 3. Determine probability distribution that governs estimator error S (h) governed by Binomial distribution, approximated by Normal when n 30 4. Find interval (L, U) such that N% of probability mass falls in the interval Use table of z N values 17
Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1...Y n, all governed by an arbitrary probability distribution with mean µ and finite variance σ 2. Define the sample mean, Ȳ 1 n n i=1 Y i Central Limit Theorem. As n,the distribution governing Ȳ approaches a Normal distribution, with mean µ and variance σ2 n. 18
Difference Between Hypotheses Test h 1 on sample S 1, test h 2 on S 2 1. Pick parameter to estimate d error D (h 1 ) error D (h 2 ) 2. Choose an estimator ˆd error S1 (h 1 ) error S2 (h 2 ) 3. Determine probability distribution that governs estimator σ ˆd error S1 (h 1 )(1 error S1 (h 1 )) n 1 + error S 2 (h 2 )(1 error S2 (h 2 )) n 2 Find interval (L, U) such that N% of probability mass falls in the interval ˆd ± z N error S1 (h 1 )(1 error S1 (h 1 )) n 1 + error S 2 (h 2 )(1 error S2 (h 2 )) n 2 19
Hypothesis Testing Example P (error D (h 1 ) > error D (h 2 )) =? S 1 = S 2 = 100 error S1 (h 1 )=0.30 error S2 (h 2 )=0.20 ˆd =0.10 σ ˆd =0.061 P ( ˆd <µˆd +0.10) = probability ˆd does not overestimate d by more than 0.10 z N σ ˆd =0.10 z N =1.64 P ( ˆd <µˆd +1.64σ ˆd) =0.95 I.e., reject null hypothesis with 0.05 level of significance 20
Paired t test to compare h A,h B 1. Partition data into k disjoint test sets T 1,T 2,...,T k of equal size, where this size is at least 30. 2. For i from 1 to k, do δ i error Ti (h A ) error Ti (h B ) 3. Return the value δ, where δ 1 k k i=1 δ i N% confidence interval estimate for δ: δ ± t N,k 1 s δ 1 k s δ k(k 1) (δ i δ) 2 i=1 Note δ i approximately Normally distributed 21
Comparing learning algorithms L A and L B What we d like to estimate: E S D [error D (L A (S)) error D (L B (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data D 0, what is a good estimator? could partition D 0 into training set S 0 and testing set T 0, and measure error T0 (L A (S 0 )) error T0 (L B (S 0 )) even better, repeat this many times and average the results (next slide) 22
Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1,T 2,...,T k of equal size, where this size is at least 30. 2. For i from 1 to k, do use T i for the test set, and the remaining data for training set S i S i {D 0 T i } h A L A (S i ) h B L B (S i ) δ i error Ti (h A ) error Ti (h B ) 3. Return the value δ, where δ 1 k k i=1 δ i 23
Comparing learning algorithms L A and L B Notice we d like to use the paired t test on δ to obtain a confidence interval but not really correct, because the training sets in this algorithm are not independent (they overlap!) more correct to view algorithm as producing an estimate of instead of E S D0 [error D (L A (S)) error D (L B (S))] E S D [error D (L A (S)) error D (L B (S))] but even this approximation is better than no comparison 24
More on t-tests Good for comparing two learners But not for multiple pairs P(reject null hypothesis) grows exponentially with # of pairs Null hypothesis = learners perform equally α e (1 (1 α c ) m ) m = k(k 1)/2 α e = P(incorrectly rejecting all means equal) α c = P(incorrectly rejecting null hypothesis for one pair) k = # learners 25
Analysis of Variance (ANOVA) P(means are not equal) MS = Mean Square MS between = 1 df j (x j x) 2 b MS within = 1 df w j k (x jk x j ) 2 j = # groups (learners) k = # trials per group x j = mean of the trials for group j (i.e., error) x = mean of all trials for all groups df b = degrees of freedom between groups (j 1) df w = degrees of freedom within all groups j(k 1) If no difference between groups, then MS within = MS between ANOVA : F = MS between MS within Maps (F, df b,df w ) to probability of rejecting null hypothesis Increased F leads to decreased P(means are equal) 26
Summary Evaluating hypotheses Sample error vs. true error Confidence intervals for observed hypothesis error Error estimators Binomial distribution, Normal distribution, Central Limit Theorem Comparing hypotheses and learners Hypothesis testing Paired t tests Cross validation Analysis of variance 27