Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methds Chapter 5 Chapter 5 1 / 52

1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52

Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and repeatedly draw sample frm the data Main task: assess the validity/accuracy f statistical methds and mdels Of particular imprtance fr statistical predictin, by assessing test errr Crss-validatin and btstrap are tw types t be addressed in this chapter Chapter 5 3 / 52

Validatin and Crss validatin Validatin set apprach LOOCV (Leave-ne-ut crss valiatin) K-fld crss validatin Chapter 5 4 / 52

Btstrap Sampling with replacement (typically) n times, n is the sample size f data Especially useful in statistical inference (can be even mre accurate than nrmal apprximatin) All purpse resampling prcedure Used in, fr example, bagging and randm frest Chapter 5 5 / 52

51 Validatin set apprach traing errr easily cmputable with training data because f pssibility f ver-fit, it cannt be used t prperly assess test errr It is pssible t estimate the test errr, by, fr example, making adjustments f the training errr The adjusted R-squared, AIC, BIC, etc serve this purpse These methds rely n certain assumptins and are nt fr general purpse Chapter 5 6 / 52

51 Validatin set apprach Test errr wuld be als easily cmputable, if test data are well designated Nrmally we are just given data Shall have t create test data fr the purpse f cmputing test errr Artifically separate data int traning data and test data fr validatin purpse is called crss-validatin The test data here shuld be mre accurately called validatin data r hld ut data, meaning that they nt used in training mdels Mdel fitting nly uses the training data Chapter 5 7 / 52

51 Validatin set apprach In a data-rich scenari, we can affrd t separate the data int three parts training data: used t train varius mdels validatin data: used t assess the mdels and identify the best test data: test the results f the best mdel Usually, peple als call validatin data r hld-ut data as test data Chapter 5 8 / 52

51 Validatin set apprach Validatin set apprach Figure: 51 A schematic display f the validatin set apprach A set f n bservatins are randmly split int a training set (shwn in blue, cntaining bservatins 7, 22, and 13, amng thers) and a validatin set (shwn in beige, and cntaining bservatin 91, amng thers) The statistical learning methd is fit n the training set, and its perfrmance is evaluated n the validatin set Chapter 5 9 / 52

51 Validatin set apprach Example: Aut Data A nn-linear relatinship between mpg and hrsepwer mpg hrsepwer + hrsepwer 2 is better than mpg hrsepwer Shuld we add higher terms int the mdel?, like as cubic r even higher? One can check the p-values f regressin ceffeicients t answer the questin In fact, a mdel selectin prblem, and we can use validatin set apprach Chapter 5 10 / 52

51 Validatin set apprach Example: Aut Data randmly split the 392 bservatins int tw sets: a training set cntaining 196 f the data pints, and a validatin set cntaining the remaining 196 bservatins fit varius regressin mdels n the training sample The validatin set errr rates result frm evaluating their perfrmance n the validatin sample Here we MSE as a measure f validatin set errr, are shwn in the left-hand panel f Figure 52 Chapter 5 11 / 52

51 Validatin set apprach Validatin set apprach Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial Figure: 52 The validatin set apprach was used n the Aut data set in rder t estimate the test errr that results frm predicting mpg using plynmial functins f hrsepwer Left: Validatin errr estimates fr a single split int training and validatin data sets Right: The validatin methd was repeated ten times, each time using a different randm split f the bservatins int a training set and a validatin set This illustrates the Chapter 5 12 / 52 variability in the estimated test MSE that results frm this apprach

51 Validatin set apprach Example: Aut Data The validatin set MSE fr the quadratic fit is cnsiderably smaller than fr the linear fit validatin set MSE fr the cubic fit is actually slightly larger than fr the quadratic fit This implies that including a cubic term in the regressin des NOT lead t better predictin than simply using a quadratic term Repeat the prcess f randmly splitting the sample set int tw parts, we will get a smewhat different estimate fr the test MSE Chapter 5 13 / 52

51 Validatin set apprach Example: Aut Data A quadratic term has a dramatically smaller validatin set MSE than the mdel with nly a linear term Nt much benefit in including cubic r higher-rder plynmial terms in the mdel Each f the ten curves results in a different test MSE estimate fr each f the ten regressin mdels cnsidered N cnsensus amng the curves as t which mdel results in the smallest validatin set MSE Based n the variability amng these curves, all that we can cnclude with any cnfidence is that the linear fit is nt adequate fr this data The validatin set apprach is cnceptually simple and is easy t i Chapter 5 14 / 52

51 Validatin set apprach A summary The validatin estimate f the test errr rate can be highly variable, depending n the randm split Only a subset f the bservatins the training set are used t fit the mdel Statistical methds tend t perfrm wrse when trained n fewer bservatins The validatin set errr rate may tend t verestimate the test errr rate fr the mdel fit n the entire data set Chapter 5 15 / 52

52 Crss validatin Crss validatin: vercme the drawback f validatin set apprach Our ultimate gal is t prduce the best mdel with best predictin accuracy Validatin set apprach has a drawback f using ONLY training data t fit mdel The validatin data d nt participate in mdel building but nly mdel assessment A waste f data We need mre data t participate in mdel building Chapter 5 16 / 52

52 Crss validatin Anther drawback f validatin set apprach It may ver-estimate the test errr fr the mdel with all data used t fit Statistical methds tend t perfrm wrse when trained n fewer bservatins The validatin set errr rate may tend t verestimate the test errr rate fr the mdel fit n the entire data set The methd f crss validatin culd vercme these drawbacks, by effectively using EVERY data pint in mdel building! Chapter 5 17 / 52

52 Crss validatin The leave-ne-ut crss-validatin Suppse the data cntain n data pints First, pick data pint 1 as validatin set, the rest as training set fit the mdel n the training set, evaluate the test errr, n the validatin set, dented as say MSE 1 Secnd, pick data pint 2 as validatin set, the rest as training set fit the mdel n the training set, evaluate the test errr n the validatin set, dented as say MSE 2 (repeat the prcedure fr all data pint) Obtain an estimate f the test errr by cmbining the MSE i, i =,, n: CV (n) = 1 n MSE i n i=1 Chapter 5 18 / 52

52 Crss validatin LOOCV Figure: 53 A schematic display f LOOCV A set f n data pints is repeatedly split int a training set (shwn in blue) cntaining all but ne bservatin, and a validatin set that cntains nly that bservatin (shwn in beige) The test errr is then estimated by averaging the n resulting MSE s The first training set cntains all but bservatin 1, the secnd training set cntains all but bservatin 2, and s frth Chapter 5 19 / 52

52 Crss validatin Advantages f LOOCV Far less bias, since the training data size (n 1) is clse t the entire data size (n) One single test errr estimate (thanks t the averaging), withut the variablity validatin set apprach A disadvantage: culd be cmputatinally expensive since the mdel need t be fit n times The MSE i may be t much crrelated Chapter 5 20 / 52

52 Crss validatin LOOCV applied t Aut data: LOOCV 10 fld CV Mean Squared Errr 16 18 20 22 24 26 28 Mean Squared Errr 16 18 20 22 24 26 28 2 4 6 8 10 Degree f Plynmial 2 4 6 8 10 Degree f Plynmial Figure: 54 Crss-validatin was used n the Aut data set in rder t estimate the test errr that results frm predicting mpg using plynmial functins f hrsepwer Left: The LOOCV errr curve Right: 10-fld CV was run nine separate times, each with a different randm split f the data int ten parts The figure shws the nine slightly different CV errr curves Chapter 5 21 / 52

52 Crss validatin Cmplexity f LOOCV in linear mdel? Cnsinder linear mdel: y i = x T i β + ϵ i, i = 1,, n and the fitted values y i = x T i ˆβ, where ˆβ is the least squares estimate f β based n all data (x i, y i ), i = 1,, n Using LOOCV, the CV (n) = 1 n n i=1 (y i ŷ (i) i ) 2 where ŷ (i) i = xi T ˆβ (i) is the mdel predictr f y i based n the linear mdel fitted by all data except (x i, y i ) (delete ne), ie, ˆβ (i) is the least squares estimate f β based n all data but (x i, y i ) Chapter 5 22 / 52

52 Crss validatin Cmplexity f LOOCV in linear mdel? Lks t be cmplicated t cmpute least squares estimate n times Easy frmula: CV (n) = 1 n n i=1 ( yi ŷ i 1 h i ) 2 where ŷ i is the fitted values f least squares methd based n all data h i is the leverage Chapter 5 23 / 52

52 Crss validatin Simplicity f LOOCV in linear mdel One fit (with all data) des it all! The predictin errr rate (in terms f MSE) is just weighted average f the least squares fit residuals High leverage pint gets mre weight in predictin errr estimatin Chapter 5 24 / 52

52 Crss validatin K-fld crss validatin Divide the data int K subsets, usually f equal r similar sizes (n/k) Treat ne subset as validatin set, the rest tgether as a training set Run the mdel fitting n training set Calculate the test errr estimate n the validatin set, dented as MSE i, say Repeat the prcedures ver every subset Average ver the abve K estimates f the test errrs, and btain CV (K) = 1 K K MSE i i=1 LOOCV is a special case f K-fld crss validatin, actually n-fld crss validatin Chapter 5 25 / 52

52 Crss validatin K-fld crss validatin Figure: 55 A schematic display f 5-fld CV A set f n bservatins is randmly split int five nn-verlapping grups Each f these fifths acts as a validatin set (shwn in beige), and the remainder as a training set (shwn in blue) The test errr is estimated by averaging the five resulting MSE estimates Chapter 5 26 / 52

52 Crss validatin K-fld crss validatin Cmmn chices f K: K = 5 r K = 10 Advantage ver LOOCV: 1 cmputatinally lighter, espeically fr cmplex mdel with large data 2 Likely less variance (t be addressed later) Advantage ver validatin set apprach: Less variability resulting frm the data-split, thanks t the averaging Chapter 5 27 / 52

52 Crss validatin Mean Squared Errr 00 05 10 15 20 25 30 Mean Squared Errr 00 05 10 15 20 25 30 Mean Squared Errr 0 5 10 15 20 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Figure: 56 True and estimated test MSE fr the simulated data sets in Figures 29 (left), 210 ( center), and 211 (right) The true test MSE is shwn in blue, the LOOCV estimate is shwn as a black dashed line, and the 10-fld CV estimate is shwn in range The crsses indicate the minimum f each f the MSE curves Chapter 5 28 / 52

52 Crss validatin Figure 29 Y 2 4 6 8 10 12 Mean Squared Errr 00 05 10 15 20 25 0 20 40 60 80 100 X 2 5 10 20 Flexibility Figure: 29 Left: Data simulated frm f, shwn in black Three estimates f f are shwn: the linear regressin line (range curve), and tw smthing spline fits (blue and green curves) Right: Training MSE (grey curve), test MSE (red curve), and minimum pssible test MSE ver all methds (dashed line) Squares represent the training and test MSEs fr the three fits shwn in the left-hand panel Chapter 5 29 / 52

52 Crss validatin Figure 210 Y 2 4 6 8 10 12 Mean Squared Errr 00 05 10 15 20 25 0 20 40 60 80 100 X 2 5 10 20 Flexibility Figure: 210 Details are as in Figure 29, using a different true f that is much clser t linear In this setting, linear regressin prvides a very gd fit t the data Chapter 5 30 / 52

52 Crss validatin Figure 211 Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility Figure: 211 Details are as in Figure 29, using a different f that is far frm linear In this setting, linear regressin prvides a very pr fit t the data Chapter 5 31 / 52

52 Crss validatin Special interests in the cmplexity parameter at miminum test errr A family f mdels indexed by a parameter, usually represeting flexibity r cmplexity f mdels Such parameter is ften called tuning parameter; it culd even be a number f variables Example: Order f plynmials f hrsepwer in the Aut data example Example: Penalizatin parameters in ridge, lass, etc (t be addressed in the next chapter) Intend t find the best mdel within this family, ie, t find the value f this tuning parameter Care less f the actual value f the test errr In the abve simulated data, all f the CV curves cme clse t identifying the crrect level f flexibility Chapter 5 32 / 52

52 Crss validatin Bias variance trade-ff In terms f bias f estimatin f test err: validatin set apprach has mre bias due t smaller size f training data; LOOCV is nearly unbiased; K-fld (eg, K=5 r 10) has intermediate bias In view f bias, LOOCV is mst preferred; and K-fld crss validatin next But, K-fld crss validatin has smaller variance than that f LOOCV The n traing sets LOOCV are t similar t each ther As a result, the trained mdels are t pstively crrelated The K training sets f K-fld crss validatin are much less similar t each ther As a result, the K-fld crss validatin generally has less variance than LOOCV Chapter 5 33 / 52

52 Crss validatin Crss validatin fr classificatin MSE is a ppular criteri t measure preditin/estimatin accuracy fr regressin There are ther criteria Fr classificatin with qualitative respnse, a natural chice is: 1 fr incrrect classificatin and 0 fr crrect classificatin Fr LOOCV, this leads t Err i = I(y i ŷ (i) i ), where y (i) i is the classificatin f i-th bservatin based n mdel fitted nt using i-th bservatin Then, CV (n) = 1 n n Err i which is just the average number f incrrect classifacitn i=1 Chapter 5 34 / 52

52 Crss validatin Example X 1 X2 Chapter 5 35 / 52

52 Crss validatin Figure 213 A simulated data set cnsisting f 100 bservatins in each f tw grups, indicated in blue and in range The purple dashed line represents the Bayes decisin bundary The range backgrund grid indicates the regin in which a test bservatin will be assigned t the range class, and the blue backgrund grid indicates the regin in which a test bservatin will be assigned t the blue class Chapter 5 36 / 52

52 Crss validatin Degree=1 Degree=2 Degree=3 Degree=4 Chapter 5 37 / 52

52 Crss validatin FIGURE 57 Lgistic regressin fits n the tw-dimensinal classificatin data displayed in Figure 213 The Bayes decisin bundary is represented using a purple dashed line Estimated decisin bundaries frm linear, quadratic, cubic and quartic (degrees 1 t 4) lgistic regressins are displayed in black The (TRUE) test errr rates fr the fur lgistic regressin fits are respectively 0201, 0197, 0160, and 0162, while the Bayes errr rate is 0133 Chapter 5 38 / 52

52 Crss validatin Remark abut the simulated example The previus example is simulated The true ppulatin distributin is knwn The figures 0201, 0197, 0160, and 0162, and 0133 (fr Bayes errr rate) are the true test errr, cmputed based n true ppulatin distributin In practice the true ppulatin distributin is unknwn Thus the true test errr cannt be cmputed We use crss validatin t slve the prblem Chapter 5 39 / 52

52 Crss validatin Errr Rate 012 014 016 018 020 Errr Rate 012 014 016 018 020 2 4 6 8 10 Order f Plynmials Used 001 002 005 010 020 050 100 1/K Figure: 58 Test errr (brwn), training errr (blue), and 10-fld CV errr (black) n the tw-dimensinal classificatin data displayed in Figure 57 Left: Lgistic regressin using plynmial functins f the predictrs The rder f the plynmials used is displayed n the x-axis Right: The KNN classifier with different values f K, the number f neighbrs used in the KNN classifier Chapter 5 40 / 52

52 Crss validatin Training errr declines in general when mdel cmplexity increases Sme times even reaches 0 Test errr general declines first and then increases 10-fld crss validatin prvides reasnable estimate f the test errr, with slight under-estimatin Chapter 5 41 / 52

53 Btstrap Bstrap as a resampling prcedure Suppse we have data x 1,, x n, representing the ages f n randmly selected peple in HK Use sample mean x t estimate the ppulatin mean µ, the avearge age f all residents f HK Hw t justify the estimatin errr x µ? Usually by t-cnfidence interval, test f hypthesis They rely n nrmality assumptin r central limit therm Is there anther reliable way? Just btstrap: Chapter 5 42 / 52

53 Btstrap Bstrap as a resampling prcedure Take n randm sample (with replacement) frm x 1,, x n calculate the sample mean f the re-sample, dented as x 1 Repeat the abve a large number M times We have x 1, x 2,, x M Use the distributin f x 1 x,, x M x t apprximate that f x µ Chapter 5 43 / 52

53 Btstrap Essential idea: Treat the data distribtin (mre prfessinally called empirical distributin) as a prxy f the ppulatin distributin Mimic the data generatin frm the true ppulatin, by trying resampling frm the empirical distributin Mimic yur statistical prcedure (such as cmputing an estimate x) n data, by ding the same n the resampled data Evalute yur statistical prcedure (which may be difficult because it invlves randmness and the unknwn ppulatin distributin) by evaluting yur analgue prcudres n the re-samples Chapter 5 44 / 52

53 Btstrap Example X and Y are tw randm variables Then minimizer f var(αx + (1 α)y )) is α = σ 2 Y σ XY σ 2 X + σ2 Y 2σ XY Data: (X 1, Y 1 ),, (X n, Y n ) We can cmpute sample variances and cvariances Estimate α by ˆα = ˆσ 2 Y ˆσ XY ˆσ 2 X + ˆσ2 Y 2ˆσ XY Hw t evalute ˆα α, (remember ˆα is randm and α is unknwn) Use Btstrap Chapter 5 45 / 52

53 Btstrap Example Sample n resamples frm (X 1, Y 1 ),, (X n, Y n ), and cmpute the sample the sample variance and cvariances fr this resample And then cmpute ˆα (ˆσ Y = )2 ˆσ XY (ˆσ X )2 + (ˆσ Y )2 2ˆσ XY Repeat this prcedure, and we have ˆα 1,, ˆα M fr a large M Use the distributin f ˆα 1 ˆα,, ˆα M ˆα t apprximate the distributin f ˆα α Fr example, we can use 1 M M (ˆα j ˆα) 2 j=1 t estimate E(ˆα α) 2 Use Btstrap Chapter 5 46 / 52

53 Btstrap Y 2 1 0 1 2 Y 2 1 0 1 2 2 1 0 1 2 X 2 1 0 1 2 X Y 3 2 1 0 1 2 Y 3 2 1 0 1 2 3 2 1 0 1 2 X 2 1 0 1 2 3 X Figure: 59 Each panel displays 100 simulated returns fr investments X and Y Frm left t right and tp t bttm, the resulting estimates fr α are 0576, 0532, 0657, and 0651 Chapter 5 47 / 52

53 Btstrap 0 50 100 150 200 0 50 100 150 200 α 03 04 05 06 07 08 09 04 05 06 07 08 09 α 03 04 05 06 07 08 09 α True Btstrap Figure: 510 Left: A histgram f the estimates f α btained by generating 1,000 simulated data sets frm the true ppulatin Center: A histgram f the estimates f α btained frm 1,000 btstrap samples frm a single data set Right: The estimates f α displayed in the left and center panels are shwn as bxplts In each panel, the pink line indicates the true value f α Chapter 5 48 / 52

53 Btstrap 28 53 3 11 21 2 24 43 1 Y X Obs 28 53 3 24 43 1 28 53 3 Y X Obs 24 43 1 28 53 3 11 21 2 Y X Obs 24 43 1 11 21 2 11 21 2 Y X Obs Original Data (Z) *1 Z *2 Z Z *B αˆ*1 ˆα *2 ˆα *B Chapter 5 49 / 52

53 Btstrap Figure 511 A graphical illustratin f the btstrap apprach n a small sample cntaining n = 3 bservatins Each btstrap data set cntains n bservatins, sampled with replacement frm the riginal data set Each btstrap data set is used t btain an estimate f α Chapter 5 50 / 52

53 Btstrap Exercises Exercise 54 f ISLR: Exercises 1-3 and 8 Chapter 5 51 / 52

53 Btstrap End f Chapter 5 Chapter 5 52 / 52