Empirical Evaluation (Ch 5)

Size: px

Start display at page:

Download "Empirical Evaluation (Ch 5)"

Gertrude Cameron
5 years ago
Views:

1 Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased error: error train (h) = #misclassifications/ S train error D (h) error train (h) could set aside a random subset of data for testing sample error for any finite sample S drawn randomly from D is unbiased, but not necessarily same as true error err S (h) err D (h) what we want is estimate of true accuracy over distribution D

2 Confidence Intervals put a bound on error D (h) based on Binomial Distribution suppose sample error rate is error S (h)=p then 95% CI for error D (h) is E[error D (h)] = error S (h) = p E[var(error D (h))] = p(1-p)/n standard deviation = var; var = comes from confidence level (95%)

3 Binomial Distribution put a bound on error D (h) based on Binomial distribution suppose true error rate is error D (h)=p on a sample of size n, would expect np errors on average, but could vary around that due to sampling (error rate, as a proportion:)

4 Hypothesis Testing is error D (h)<0.2 (is error rate of h less than 20%?) example: is better than majority classifier? (suppose error maj (h)=20%) if we approximate Binomial as Normal, then ±2 should bound 95% of likely range for error D (h) two-tailed test: risk of true error being higher or lower is 5% Pr[Type I error] 0.05 restrictions: n 30 or np(1-p) 5

5 Gaussian Distribution 1.28 = 80% of distr. z-score: relative distance of a value x from mean

6 for a one-tailed test, use z value for /2 for example suppose error S (h)=0.19 and suppose you want 95% confidence that error D (h)<20%, then test if 0.2-error S (h)> comes from z-score for =90%

7 notice that confidence interval on error rate tightens with larger sample sizes example: compare 2 trials that have 10% error test set A has 100 examples, h makes 10 errors: 10/100 sqrt(.1x.9/100)=0.03 CI 95% (err(h)) = [10±6%] = [4-16%] test set B has 100,000 examples, 10,000 errors: 10,000/100,000=sqrt(.1x.9/100,000)= CI 95% (err(h)) = [10±0.19%] = [ %]

8 Comparing 2 hypotheses (decision trees) test whether 0 is in conf. interval of difference add variances example...

9 Estimating the accuracy of a Learning Algorithm error S (h) is the error rate of a particular hypothesis, which depends on training data what we want is estimate of error on any training set drawn from the distribution we could repeat the splitting of data into independent training/testing sets, build and test k decisions trees, and take average

10 k-fold Cross-Validation partition the dataset D into k subsets of equal size ( 30), T 1..T k for i from 1 to k do: S i = D-T i // training set, 90% of D h i = L(S i ) // build decision tree e i = error(h i,t i ) // test d-tree on 10% held out = (1/k) e i = (1/k) e i - ) 2 SE = k (1/k(k-1)) e i - ) 2 CI 95 = t dof, SE (t dof, =2.23 for k=10 and =95%)

11 k-fold Cross-Validation Typically k=10 note that this is a biased estimator, probably under-estimates true accuracy because uses less examples this is a disadvantage of CV: building d-trees with only 90% of the data (and it takes 10 times as long)

12 what to do with 10 accuracies from CV? accuracy of alg is just the mean (1/k) acc i for CI, use standard error (SE): = (1/k) e i - ) 2 SE = k (1/k(k-1)) e i - ) 2 standard deviation for estimate of the mean 95% CI = ± t dof, (1/k(k-1)) e i - ) 2 Central Limit Theorem we are estimating a statistic (parameter of a distribution, e.g. the mean) from multiple trials regardless of underlying distribution, estimate of the mean approaches a Normal distribution if std. dev. of underlying distribution is, then std. dev. of mean of distribution is / n

13 example: multiple trials of testing the accuracy of a learner, assuming true acc=70% and =7% there is intrinsic variability in accuracy between different trials with more trials, distribution converges to underlying (std. dev. stays around 7) but the estimate of the mean (vertical bars, 2 / n) gets tighter est of true mean= est of true mean= est of true mean=

14 Student s T distribution is similar to Normal distr., but adjusted for small sample size; dof = k-1 example: t 9, (Table 5.6)

15 Comparing 2 Learning Algorithms e.g. ID3 with 2 different pruning methods approach 1: run each algorithm 10 times (using CV) independently to get CI for acc of each alg acc(a), SE(A) acc(b), SE(B) T-test: statistical test if difference means 0 d=acc(a)-acc(b) problem: the variance is additive (unpooled)

16 suppose mean acc for A is 61% 2, mean acc for B is 64% 2 acc(l A,T i ) acc(l B,T i ) d=b-a % 63% +3%, SE=1% mean diff is same but B is systematically higher than A d=acc(b)-acc(a) mean = 3% SE = 3.7 (just a guess)

17 approach 2: Paired T-test run the algorithms in parallel on the same divisions of tests test whether 0 is in CI of differences:

CS 543 Page 1 John E. Boon, Jr.

CS 543 Page 1 John E. Boon, Jr. CS 543 Machine Learning Spring 2010 Lecture 05 Evaluating Hypotheses I. Overview A. Given observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over