Chapter 3: Other Issues i Multiple regressio (Part 1) 1 Model (variable) selectio The difficulty with model selectio: for p predictors, there are 2 p differet cadidate models. Whe we have may predictors (with may possible iteractios), it ca be difficult to fid a good model. Model selectio tries to simplify this task. Suppose we have P predictors X 1,..., X P, but the true models oly depeds o a subset of X 1,..., X P. I other words i model Y = β 0 + β 1 X 1 +... + β P X P + ε some of the coefficiets are zeros. We eed to fid those predictors with ozero coefficiets. we call the set of predictors with ozero coefficiets best subset, all the predictors i the best subset importat variables Criteria: Statistical test; some idices of the model; predictability (Distictio betwee predictive ad explaatory research.) Example 1.1 (Surgical Uit example) X 1 : blood clottig score; X 2 : Progostic idex; X 3 : ezyme fuctio test score X 4 : liver fuctio test score; X 5 : age i year; X 6 : idicator of geder (0=mail, 1=femail); X 7,X 8 idicator for alcohol use; Y :survivaltime. If we oly cosider the first 4 predictors, we have the followig calculatio for the 1
possible models variables selected p SSE R 2 Ra 2 C p AIC SBC PRESS (BIC) (CV) Noe 1 12.808 0 0 151.4-75.7-73.7 13.3 X1 2 12.0 0.06 0.043 141-77 -73 13.5 X2 2 9.98 0.21 0.21 108.5-87.17-83.2 10.74 X3 2 7.3 0.428 0.417 66.49-103.8-99.84 8.32 X4 2 7.4 0.422 0.410 67.715-103.26-99.28 8.025 X1, X2 3 9.44 0.26 0.23 7102.037-88.16-82.19 11.06 X1, X3 3 5.71 0.549 0.531 43.85-114.65-108.69 6.98 X1, X4 3 7.29 0.43 0.408 67.97-102.067-96.1 8.472 X2, X3 3 4.312 0.663 0.65 20.52-130.48-124.5 5.065 X2, X4 3 6.62 0.483 0.463 57.21-107.32-101.357 7.476 X3, X4 3 5.13 0.6 0.58 33.5-121.1-115.146 6.12 X1, X2, X3 4 3.109 0.757 0.743 3.391-146.161-138.2 3.91 X1, X2, X4 4 6.57 0.487 0.456 58.39-105.74-97.79 7.9 X1, X3, X4 4 4.9 0.61 0.589 32.93-120.8-112.88 6.2 X2, X3, X4 4 3.6 0.718 0.7 11.42-138.023-130.067 4.597 X1, X2, X3, X4 5 3.08 0.759 0.74 5.00-144.59-134.65 4.07 where p is the umber of coefficiets icluded i the model. 2 R 2 ad R 2 a Criterio 1. R 2 : ca be used for models with the same umber of parameters/coefficiets. 2. R 2 a : ca be used for models with Differet umber of parameters/coefficiets. We eed to choose a model with the biggest R 2 a. 3 Mallows C p Criterio Suppose we select p predictors, p P ad try a model with the selected predictors. deote its SSE by SSE p. The criterio is C p = SSE p MSE(X 1,..., X P ) ( 2p ) where p is the umber of coefficiets icludig itercept (if there is). Criterio: We seek to idetify subsets of X for which (1) the C p values is small ad (2) the C p vale is ear p. 2
If a selected model icludes all the importat variables (But with some other uimportat variables), the model is still correct. The we have E{SSE p } =( p )σ 2 O the other had Roughly speakig, we have E{MSE(X 1,..., X P )} = σ 2 C p p ( 2p )=p Questio: are the estimators still ubiased? If a selected model does ot iclude all the importat variables, the model is wrog. The SSE p >> SSE P C p >> p ( 2p )=p Questio: are the estimators still ubiased? 4 Akaike s iformatio criterio (AIC) We caot use SSE aloe for the selectio. As p icreases, SSE p decreases. AIC try to balace the umber of parameters ad SSE p. AIC p =log( SSE p )+2p or AIC p = log( SSE p )+2p 3
5 Schwarz Bayesia criterio (BIC or SBC) Theoretically, people fid that AIC does ot give a right umber of variables. Schwarz proposed the BIC or BIC p =log( SSE p )+log()p BIC p = log( SSE p )+log()p BIC gives bigger pealty to the umber of parameters 6 Predictio sum of squares (PRESS) or Cross-validatio criterio (CV) A better model should have better predictio. Most of the time, we dot have a data for us to predict. A simple way is to partitio the data to two parts: traiig samples (set) ad predictio set (or validatio set). Use traiig set to estimate the model ad predictio set to check the predictability. A simple case that each time, the predictio set has oe sample i tur. There are may partitios. Usig all the partitios is the idea of cross-validatio (CV). The idea was proposed by M. Stoe (1974). If we use 1 observatio for validatio ad the other -1 for model estimatio, it is the leave-oe-observatio-out cross-validatio If we use m observatios for validatio ad the other -m for model estimatio, it is the leave-m-observatio-out cross-validatio. We eed to select variables from X 1,..., X p to be icluded i the model. There are may cadidate variables. For example, model 1: model 2: model 3: Y = a 0 + a 1 X 1 + ε Y = b 0 + b 1 X 1 + b 2 X 4 + ε Y = c 0 + c 1 X 2 + ε 4
Suppose we have samples. For each i = 1,...,, we use data (Y 1,X 1 ),..., (Y i 1,X i 1 ), (Y i+1,x i+1 ),...(Y,X ), where X i =(X i1,..., X ip ), to estimate the models. the estimated models are, say, model 1: model 2: model 3: Y =â i 0 +âi 1 X i1 Y = ˆb i 0 + ˆb i 1X i1 + ˆb i 2X i4 Y =ĉ i 0 +ĉ i 1X i2 The predictio errors for (Y i,x i ) are respectively model 1: err 1 (i) ={Y i â i 0 â i 1X i,1 } 2 model 2: err 2 (i) ={Y i ˆb i 0 ˆb i 1 X i,1 ˆb i 2 X i,4} 2 model 3: err 3 (i) ={Y i ĉ i 0 ĉi 1 X i,2} 2 The overall predictio errors (also called Cross-validatio value) are respectively the model 1: CV 1 = 1 err 1 (i) i=1 model 2: CV 2 = 1 err 2 (i) i=1 model 3: CV 3 = 1 err 3 (i) i=1 The model with the smallest CV value is the model we prefer. 5
Example 6.1 For the same data above (data) Our cadidate models are model 0 model 1 model 2 model 3 model 4 model 5 Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ε Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 5 X 5 + ε Y = β 0 + β 1 X 1 + β 2 X 2 + β 4 X 4 + β 5 X 5 + ε Y = β 0 + β 1 X 1 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ε Y = β 0 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ε The CV values for the above model are respectivly CV (model 0) = 0.3633548,CV(model 1) = 0.333161,CV(model 2) = 1.216745, CV (model 3) = 0.3922781,CV(model 4) = 1.400237,CV(model 5) = 0.4589498 Thus model 1 is selected (ad variable X 5 is deleted) R-code for the calculatio K-fold cross-validatio I K-fold cross-validatio, the origial sample is partitioed ito K subsamples. Of the K subsamples, a sigle subsample is retaied as the validatio data for testig the model, ad the remaiig K 1 subsamples are used as traiig data. The cross-validatio process is the repeated K times (the folds), with each of the K subsamples used exactly oce as the validatio data. The K results from the folds the ca be averaged (or otherwise combied) to produce a sigle estimatio. The advatage of this method over repeated radom sub-samplig is that all observatios are used for both traiig ad validatio, ad each observatio is used for validatio exactly oce. 10-fold cross-validatio is commoly used. 7 Searchig for the best subset Forward selectio: startig with o variables i the model, tryig out the variables oe by oe ad icludig them if they are statistically sigificat or ca icrease the predictability. 6
Backward elimiatio: startig with all cadidate variables ad testig them oe by oe for statistical sigificace, deletig ay that are ot sigificat or ca icrease the predictability. Stepwise: a combiatio of the above, testig at each stage for variables to be icluded or excluded. 8 R code step(object, directio = c("both", "backward", "forward"), steps = 1000, k =??) where k ca be ay positive values, but k =2forAIC,adk =log() forbic(sbc) Example 8.1 For the first example above with data, the selected model variables are Based o BIC: X1 + X2 + X3 + X5 + X6 + X8 or Based o BIC: X1 + X2 + X3 + X8 (code) 7