Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p + ɛ. () Liearity is a assumptio, which ca be wrog. Further assumptios take the form of restrictios o the oise: E [ɛ X = 0, [ɛ X = σ 2. Moreover, we assume ɛ is ucorrelated across observatios. We covert this to matrix form: Y = Xβ + ɛ (2) Y is a matrix of radom variables; X is a (p + ) matrix, with a extra colum of all s; ɛ is a matrix. Beyod liearity, the assumptios traslate to E [ɛ X = 0, [ɛ X = σ 2 I. (3) We do t kow β. If we guess it is b, we will make a vector of predictios Xb ad have a vector of errors Y Xb. The mea squared error, as a fuctio of b, is the MSE(b) = (Y Xb)T (Y Xb). (4) 2 Least Squares Estimatio ad Its Properties The least squares estimate of the coefficiets is the oe which miimizes the MSE: To fid this, we eed the derivatives: β argmi MSE(b). (5) b We set the derivative to zero at the optimum: b MSE = 2 (XT Y X T Xb). (6) ( ) XT Y X β = 0. (7) The term i paretheses is the vector of errors whe we use the least-squares estimate. This is the vector of residuals, e Y X β (8) so the have the ormal, estimatig or score equatios, XT e = 0. (9)
We say equatios, plural, because this is equivalet to the set of p + equatios e i i= = 0 (0) e i X ij = 0 () i= (May people omit the factor of /.) This tells us that while e is a -dimesioal vector, it is subject to p + liear costraits, so it is cofied to a liear subspace of dimesio p. Thus p is the umber of residual degrees of freedom. The solutio to the estimatig equatios is β = (X T X) X T Y. (2) This is oe of the two most importat equatios i the whole subject. It says that the coefficiets are a liear fuctio of the respose vector Y. The least squares estimate is a costat plus oise: β = (X T X) X T Y (3) = (X T X) X T (Xβ + ɛ) (4) = (X T X) X T Xβ + (X T X) X T ɛ (5) = β + (X T X) X T ɛ. (6) The least squares estimate is ubiased: E [ β = β + (X T X) X T E [ɛ = β. (7) Its variace is [ β = σ 2 (X T X). (8) Sice the etries i X T X are usual proportioal to, it ca be helpful to write this as The variace of ay oe coefficiet estimator is The vector of fitted meas or coditioal values is ( ) [ β = σ2 XT X. (9) ( ) [ βi = σ2 XT X. (20) i+,i+ This is more coveietly expressed i terms of the origial matrices: Ŷ X β. (2) Ŷ = X(X T X) X T Y = HY. (22) 2
The fitted values are thus liear i Y: set the resposes all to zero ad all the fitted values will be zero; double all the resposes ad all the fitted values will double. The hat matrix H X(X T X) X T, also called the ifluece, projectio or predictio matrix, cotrols the fitted values. It is a fuctio of X aloe, igorig the respose variable totally. It is a matrix with several importat properties: It is symmetric, H T = H. It is idempotet, H 2 = H. Its trace tr H = i H ii = p +, the umber of degrees of freedom for the fitted values. The variace-covariace matrix of the fitted values is [Ŷ = Hσ 2 IH T = σ 2 H. (23) To make a predictio at a ew poit, ot i the data used for estimatio, we take its predictor coordiates ad group them ito a (p + ) matrix X ew (icludig the for the itercept). The[ poit predictio for Y is the X ew β. The expected value is Xew β, ad the variace is X ew β = X ew [ β X T ew = σ 2 X ew (X T X) X T ew. The residuals are also liear i the respose: e Y m = (I H)Y. (24) The trace of I H is p. The variace-covariace matrix of the residuals is The mea squared error (traiig error) is [e = σ 2 (I H). (25) MSE = e 2 i = et e. (26) i= Its expectatio value is slightly below σ 2 : E [MSE = σ 2 p. (27) (This may be proved usig the trace of I H.) A ubiased estimate of σ 2, which I will call σ 2 throughout the rest of this, is σ 2 MSE p. (28) The leverage of data poit i is H ii. This has several iterpretatios:. [Ŷi = σ 2 H ii ; the leverage cotrols how much variace there is i the fitted value. 2. Ŷi/ Y i = H ii ; the leverage says how much chagig the respose value for poit i chages the fitted value there. 3
3. Cov [Ŷi, Y i = σ 2 H ii ; the leverage says how much covariace there is betwee the i th respose ad the i th fitted value. 4. [e i = σ 2 ( H ii ); the leverage cotrols how big the i th residual is. The stadardized residual is r i = e i σ H ii. (29) The oly restrictio we have to impose o the predictor variables X i is that (X T X) eeds to exist. This is equivalet to X is ot colliear: oe of its colums is a liear combiatio of other colums; which is also equivalet to The eigevalues of X T X are all > 0. (If there are zero eigevalues, the correspodig eigevectors idicate liearly-depedet combiatios of predictor variables.) Nearly-colliear predictor variables ted to lead to large variaces for coefficiet estimates, with high levels of correlatio amog the estimates. It is perfectly OK for oe colum of X to be a fuctio of aother, provided it is a oliear fuctio. Thus i polyomial regressio we add extra colums for powers of oe or more of the predictor variables. (Ay other oliear fuctio is however also legitimate.) This complicates the iterpretatio of coefficiets as slopes, just as though we had doe a trasformatio of a colum. Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. Oe colum of X could be a (oliear) fuctio of two or more of the other colums; this is how we represet iteractios. Usually the iteractio colum is just a product of two other colums, for a product or multiplicative iteractio; this also complicates the iterpretatio of coefficiets as slopes. (See the otes o iteractios.) Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. We ca iclude qualitative predictor variables with k discrete categories or levels by itroducig biary idicator variables for k of the levels, ad addig them to X. The coefficiets o these idicators tell us about amouts that are added (or subtracted) to the respose for every idividual who is a member of that category or level, compared to what would be predicted for a otherwiseidetical idividual i the baselie category. Equivaletly, every category gets its ow itercept. Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. Iteractig the idicator variables for categories with other variables gives coefficiets which say what amout is added to the slope used for each member of that category (compared to the slope for members of the baselie level). Equivaletly, each category gets its ow slope. Estimatio ad iferece for the coefficiets o these predictor variables goes exactly like estimatio ad iferece for ay other coefficiet. Model selectio for predictio aims at pickig a model which will predict well o ew data draw from the same distributio as the data we ve see. Oe way to estimate this out-of-sample performace is to look at what the expected squared error would be o ew data with the same X 4
matrix, but a ew, idepedet realizatio of Y. I the otes o model selectio, we showed that [ E (Y m) T (Y m) [ = E [ = E = E (Y m)t (Y m) + 2 Cov [Y i, m i (30) i= (Y m)t (Y m) + 2 σ2 tr H (3) [ (Y m)t (Y m) + 2 σ2 (p + ). (32) Mallow s C p estimates this by MSE + 2 σ2 (p + ) (33) usig the σ 2 from the largest, model beig selected amog (which icludes all the other models as special cases). A alterative is leave-oe-out cross-validatio, which amouts to i= ( ei H ii We also cosidered K-fold cross-validatio, AIC ad BIC. 3 Gaussia Noise ) 2. (34) The Gaussia oise assumptio is added o to the other assumptios already made. It is that ɛ i N(0, σ 2 ), idepedet of the predictor variables ad all other ɛ j. I other words, ɛ has a multivariate Gaussia distributio, ɛ MV N(0, σ 2 I). (35) Uder this assumptio, it follows that, sice β is a liear fuctio of ɛ, it also has a multivariate Gaussia distributio: β MV N(β, σ 2 (X T X) ) (36) ad It follows from this that ad Ŷ MV N(Xβ, σ 2 H). (37) β i N(β i, σ 2 (X T X) i+,i+ (38) Ŷ i N(X i β, σ 2 H ii ). (39) The samplig distributio of the estimated coditioal mea at a ew poit X ew is N(X ew β, σ 2 X ew (X T X) X T ew). The mea squared error follows a χ 2 distributio: MSE σ 2 χ 2 p. (40) 5
Moreover, the MSE is statistically idepedet of β. We may therefore defie [ βi = σ (X T X) i+,i+ (4) ad ad get t distributios: [Ŷi = σ H ii (42) β i β i t p N(0, ) (43) [ βi ad Ŷ i m(x i ) [ m i The Wald test for the hypothesis that β i = β i t p N(0, ). (44) therefore forms the test statistic β i β i (45) [ βi ad rejects the hypothesis if it is too large (above or below zero) compared to the quatiles of a t p distributio. The summary fuctio of R rus such a test of the hypothesis that β i = 0. There is othig magic or eve especially importat about testig for a 0 coefficiet, ad the same test works for testig whether a slope = 42 (for example). Importat! The ull hypothesis beig test is Y is a liear fuctio of X,... X p, ad of o other predictor variables, with idepedet, costat-variace Gaussia oise, ad the coefficiet β i = 0 exactly. ad the alterative hypothesis is Y is a liear fuctio of X,... X p, ad of o other predictor variables, with idepedet, costat-variace Gaussia oise, ad the coefficiet β i 0. The Wald test does ot test ay of the model assumptios (it presumes them all), ad it caot say whether i a absolutely sese X i matters for Y ; addig or removig other predictors ca chage whether the true β i = 0. Warig! Retaiig the ull hypothesis β i = 0 ca happe if either the parameter is precisely estimated, ad cofidetly kow to be close to zero, or if it is im-precisely estimated, ad might as well be zero or somethig huge o either side. Sayig We ca igore this because we ca be quite sure it s small ca make sese; sayig We ca igore this because we have o idea what it is is preposterous. To test whether several coefficiets (β j : j S) are all simultaeously zero, use a F test. The ull hypothesis is H 0 : β j = 0 for all j S ad the altermative is H : β j 0 for at least oe j S. 6
The F statistic is F stat = ( σ2 ull σ2 full )/s σ 2 full /( p ) (46) where s is the umber of elemets i S. Uder that ull hypothesis, F stat F s, p (47) If we are testig a subset of coefficiets, we have a partial F test. A full F test sets s = p, i.e., it tests the ull hypothesis of a itercept-oly model (with idepedet, costat-variace Gaussia oise) agaist the alterative of the liear model o X,... X p (ad oly those variables, with idepedet, costat-variace Gaussia oise). This is oly of iterest uder very uusual circumstaces. Oce agai, o F test is capable of checkig ay modelig assumptios. This is because both the ull hypothesis ad the alterative hypothesis presume that the all of the modelig assumptios are exactly correct. A α cofidece iterval for β i is β i ± [β i t p (α/2) β i ± [β i z α/2. (48) We saw how to create a cofidece ellipsoid for several coefficiets. These make a simultaeous guaratee: all the parameters are trapped iside the cofidece regio with probabiluty α. A simpler way to get a simultaeous cofidece regio for all p parameters is to use α/p cofidece itervals for each oe ( Boferroi correctio ). This gives a cofidece hyper-rectagle. A α cofidece iterval for the regressio fuctio at a poit is m(x i ) ± [ m(x i ) t p (α/2). (49) Residuals. The cross-validated or studetized residuals are:. Temporarily hold out data poit i 2. Re-estimate the coefficiets to get β ( i) ad σ ( i). 3. Make a predictio for Y i, amely, Ŷi(i) = m ( i) (X i ). 4. Calculate t i = Y i Ŷi(i) [ σ ( i) + m ( i) i This ca be doe without recourse to actually re-fittig the model: p t i = r i p ri 2 (Note that for large, this is typically extremely close to r i.) Also,. (50) (5) t i t p 2 (52) (The 2 is because we re usig data poits to estimate p + coefficiets.) Cook s distace for poit i is the sum of the (squared) chages to all the fitted values if i was omitted; it is D i = p + e2 i H ii ( H ii ) 2. (53) 7