Accuracy & confidence

Size: px

Start display at page:

Download "Accuracy & confidence"

Buddy Hensley
5 years ago
Views:

1 Accuracy & confidence Most of course so far: estimating stuff from data Today: how much do we trust our estimates? Last week: one answer to this question prove ahead of time that training set estimate of prediction error will have accuracy ϵ w/ probability 1 δ had to handle two issues: limited data can t get exact error of single model selection bias we pick lucky model r.t. right one error is just one thing we could estimate from data; tail bounds are just one way to get accuracy of estimate 1

2 Selection bias CDF of max of n samples of N(μ=2, σ 2 =1) [representing error estimates for n models] n=1 n=4 n= each sample represents estimate of accuracy for a single model we re evaluating (true accuracy = 2) 2 with 4 samples, only 2^-4 =.0625 chance of max below μ w/ 30 samples, typical values are near upper 2.5% quantile of a single sample (need 28 samples to have 50% chance of max μ+1.96) just 4 models means almost 95% chance of thinking we found a positive effect when there is none >> zs = -2:.05:4; ps=(1+erf(zs/sqrt(2)))/2; plot(zs+2, ps, zs+2, ps.^4, zs+2, ps.^30, 'linewidth', 2); vertline(2); legend({'n=1', 'n=4', 'n=30'}, 'location', 'nw'); set(gca, 'fontsize', 24)

3 Overfitting Overfitting = selection bias when fitting complex models to little/noisy data to limit overfitting: limit noise in data, get more data, simplify model class Today: not trying to limit overfitting instead, try to evaluate accuracy of selected model (and recursively, accuracy of our accuracy estimate) can lead to detection of overfitting complex models == large model classes 3

4 What is accuracy? Simple problem: estimate μ and σ 2 for a Gaussian from samples x1, x2, xn ~ Normal(μ, σ 2 ) typical estimator: sample mean xbar = sum_i x_i / N 4 E(xbar) = E(sum_i x_i / N) = sum_i E(x_i) / N [linearity of expectation] = N μ / N = μ I.e., sample mean is *unbiased* bias = E(statistic) - parameter V(xbar) = E((xbar - E(xbar))^2) = E((sum_i x_i / N - μ)^2) pretend μ=0 for simplicity = E((sum_i x_i)^2) / N^2 = E(sum_s sum_i x_s x_i) / N^2 = E(sum_i x_i^2) / N^2 [independence] = N σ^2/n^2 = σ^2/n

5 Bias vs. variance vs. residual Mean squared prediction error: predict xn+1 E((xbar - x_{n+1})^2) [= prediction error] = E(((xbar - mu) - (x_{n+1}-mu))^2) = E((xbar - mu)^2-2 (xbar-mu)(x_{n+1}-mu) + (x_{n+1}-mu)^2) [E(product of indep 0-mean vars) = 0] = E((xbar - mu)^2 + (x_{n+1}-mu)^2) = E((xbar - mu)^2) + σ 2 = E(((xbar - E(xbar)) - (mu - E(xbar)))^2) + σ 2 = E((xbar - E(xbar))^2) + E(xbar - E(xbar))(mu - E(xbar)) + (mu - E(xbar))^2 + σ 2 [E(xbar - E(xbar)) = 0 by linearity] = E((xbar - E(xbar))^2) + (mu - E(xbar))^2 + σ 2 = V(xbar) + bias^2 + σ 2 = bias^2 + variance + residual^2 = (estimation error)^2 + residual^2 5 this decomposition holds for squared error of any prediction

6 Bias-variance tradeoff Can t do much about residual, so we re mostly concerned w/ estimation error = bias 2 + variance Can trade bias v. variance to some extent: e.g., always estimate 0 variance=0, but bias big Cramér-Rao bound on estimation error: Cramér-Rao bound: if \hat\theta is an estimator of \theta, with bias E(\hat\theta - \theta) = b(\theta) then (under mild conditions) E((\hat\theta - \theta)^2) \geq b(\theta)^2 + (1+b (\theta))^2 / I(\theta) where I(theta) = Fisher information (positive; describes how hard the estimation problem is; high information = easy problem) 6 note: b = 0 means bound is 1/I(\theta) note: if b < 0, a biased estimator can beat unbiased one wikipedia page gives useful proofs:

7 Prediction error v. estimation error Several ways to get at accuracy prediction error (bias 2 + var + residual 2 ) talks only about predictions estimation error (bias 2 + var) same; tries to concentrate on error due to estimation parameter error E((µ ˆµ) 2 ) talks about parameters r.t. predictions in simple case, numerically equal to estimation error but only makes sense if our model class is right pred error: E[(x-\hat x)^2] for observation x 7 estimation error: since we can t reduce residual

8 Evaluating accuracy In N(μ, σ 2 ) example, we were able to derive bias, variance, and residual from first principles In general, have to estimate prediction error, estimation error, or model error from data Holdout data, tail bounds, normal theory (use CLT & tables of normal dist n), and today s topics: crossvalidation & bootstrap 8

9 Goal: estimate sampling variability We ve computed something from our sample classification error rate, a parameter vector, mean squared prediction error, for simplicity, a single number (e.g., i th component of weight vector) t = f(x1, x2,, xn) How much would t vary if we had taken a different sample? For concreteness: f = sample mean (an estimate of population mean) t = the number we estimated f = the estimation procedure x_1, : the sample 9 sample mean: bias = 0, var = σ 2 /N

10 Gold standard: new samples Get M independent data sets Run our computation M times: t1, t2, tm tj = Look at distribution of tj mean, variance, upper and lower 2.5% quantiles, A tad wasteful of data M indep sets: x^j_1 thru x^j_n, for j = 1..M 10 tj = f(x^j_1..n)

11 Crossvalidation & bootstrap CV and bootstrap: approximate the gold standard, but cheaper spend computation instead of data Work for nearly arbitrarily complicated models Typically tighter than tail bounds, but involve difficult-to-verify approximations/assumptions Basic idea: surrogate samples Rearrange/modify x1,, xn to build each new sample Getting something from nothing? (hence name) cv&boot: appropriate for complicated learners where getting tight theory is hard; also get tighter by taking averages in situations that are closer to reality (e.g. taking advantage of correlations among learners) 11 might repeat 10 times, 1000 times or 10k times -- depends on how much computation we can afford something from nothing: no, taking advantage of difficult-to-verify assumptions

12 For example μ= ˆ μ=1.5 true variance of a single sample: E(x^2) - E(x)^2 = mu.^2*w' + sig^2-1.5^2 = true stdev of muhat sqrt(1.36/n) = >> zs = -2:.05:4; sig =.6; mu = [-.5 2]; w = [.2.8]; p1 = w(1) * exp(-0.5*(zs-mu(1)).^2/sig^2); p2 = w(2) * exp(-0.5*(zs-mu(2)).^2/sig^2); plot(zs, p1+p2, 'linewidth', 2)

13 Basic bootstrap Treat x1 xn as our estimate of true distribution To get a new sample, draw N times from this estimate (with replacement) Do this M times each original xi part of many samples (on average 1 1/e of them, about 63%) each sample contains many repeated values (single xi selected multiple times) 13

14 Basic bootstrap μ= original resamples μ= μ= μ= Repeat 100k times: est. stdev of \hat\mu = compare to true stdev, got 3 sig figs w/ 100k reps; to get 1 sig fig, need only ~10 reps === true variance of a single sample: E(x^2) - E(x)^2 = mu.^2*w' + sig^2-1.5^2 = true stdev of muhat sqrt(1.36/n) =.0825 >> k = ; muhats = zeros(k, 1); for j = 1:k; idx = randi(n, n, 1); xx = xs(idx); muhats(j) = mean(xx); end >> sqrt(var(muhats))

15 What can go wrong? Convergence is only asymptotic (large original sample) here: what if original sample hits mostly the larger mode? Original sample might not be i.i.d. unmeasured covariate answer: will badly underestimate variance (original sample is more compact than actual distribution) 15 chance of this gets higher as original sample gets smaller (e.g., 10% chance that n=10 hits *only* larger mode) not iid: e.g., suppose we measure yields of 100 plots of a new feed corn variety now suppose 10 plots on each of 10 farms or 50 plots on each of 2 farms unmeasured covariate: which farm 2-farm case: what if both happen to be farms w/ higher than average yield? -- bootstrap will underestimate variance again

16 Types of errors Conservative estimate of uncertainty: tends to be high (too uncertain) Optimistic estimate of uncertainty: tends to be low (too certain) conservative (good) vs. optimistic (bad) both of above failure modes can lead to optimism 16

17 Should we worry? New drug: mean outcome [higher is better] old one: outcome Bootstrap underestimates σ =.04 true σ =.08 Tell investors: new drug better than old one Enter Phase III trials cost $millions Whoops, it isn t better after all 17

18 Blocked resampling Partial fix for one issue (original sample not i.i.d.) Divide sample into blocks that tend to share the unmeasured covariates, and resample blocks e.g., time series: break up into blocks of adjacent times assumes unmeasured covariates change slowly e.g., matrix: break up by rows or columns assumes unmeasured covariates are associated with rows or columns (e.g., user preferences in Netflix) time series: could also use GP bootstrap or parametric bootstrap 18 issue: need enough blocks to resample, else variance is very high

19 Further reading pdf/moore14.pdf Hesterberg et al. (2005). Bootstrap methods and permutation tests. In Moore & McCabe, Introduction to the Practice of Statistics. 19

20 Cross-validation Used to estimate classification error, RMSE, or similar error measure of an algorithm Surrogate sample: exactly the same as x1,, xn except for train-test split k-fold CV: randomly permute x1, xn split into folds: first N/k samples, second N/k samples, train on k 1 folds, measure error on remaining fold repeat k times, with each fold being holdout set once f = function from whole sample to single number = train model on k-1 folds then evaluate error on remaining one 20 CV: uses sample splitting idea twice first: split into train & validation second: repeat to estimate variability only the second is approximated k = N: leave-one-out CV (LOOCV)

21 Cross-validation: caveats Original sample might not be i.i.d. Size of surrogate sample is wrong: want to estimate error we d get on a sample of size N actually use samples of size N(k 1)/k Failure of i.i.d, even if original sample was i.i.d. two of these are potentially optimistic; middle one is conservative (but usually pretty small effect) 21

STAT 830 Non-parametric Inference Basics

STAT 830 Non-parametric Inference Basics Richard Lockhart Simon Fraser University STAT 801=830 Fall 2012 Richard Lockhart (Simon Fraser University)STAT 830 Non-parametric Inference Basics STAT 801=830