1 The classical linear regression model

Size: px

Start display at page:

Download "1 The classical linear regression model"

Annice Joseph
5 years ago
Views:

1 THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 4: Multivariate Linear Regression Linear regression analysis is one of the most widely used statistical methods We shall start with the multiple linear regression (MLR) analysis before studying the multivariate linear regression model Our discussion of the multiple linear regression uses matrix algebra 1 The classical linear regression model The multiple linear regression model consists of a single response variable with multiple predictors and assumes the form where the error terms {ɛ i } satisfy 1 E(ɛ i ) = 0 for all i; Y i = β 0 + β 1 Z i1 + β 2 Z i2 + + β r Z ir + ɛ i, i = 1,, n, (1) 2 Var(ɛ i ) = σ 2, a constant; and 3 Cov(ɛ i, ɛ j ) = 0 if i j In matrix notation, Eq (1) becomes or Y 1 Y 2 Y n and the assumptions become 1 E(ɛ) = 0; and 2 Cov(ɛ) = E(ɛɛ ) = σ 2 I n n 1 Z 11 Z 12 Z 1r 1 Z 21 Z 22 Z 2r = 1 Z n1 Z n2 Z nr β 0 β 1 β r Y n 1 = Z n (r+1) β (r+1) 1 + ɛ n 1, The matrix Z is referred to as the design matrix + ɛ 1 ɛ 2 ɛ n

2 2 Least squares estimation The least squares estimate (LSE) of β is obtained by minimizing the sum of squares of errors, n S(β) = (Y i β 0 β 1 Z i1 β r Z ir ) 2 = (Y Zβ) (Y Zβ) i=1 Before deriving the LSE, we first define β = (Z Z) 1 Z Y and H = Z(Z Z) 1 Z, assuming that the inverse exists Then, it is easy to see that H is symmetric and H 2 = H The latter property says that H is an idempotent matrix Define ɛ = Y Z β = Y Z(Z Z) 1 Z Y = [I H]Y Then, using Z H = Z, we have Z ɛ = Z [I H]Y = [Z Z H]Y = [Z Z ]Y = 0 In other order, ɛ is orthogonal to the column space of the design matrix Z Result 71 Assume that Z is of full rank (r + 1), where r + 1 n The LSE of β is Proof Since we have β = (Z Z) 1 Z Y Y Zβ = Y Z β + Z β Zβ = Y Z β + Z( β β), S(β) = (Y Zβ) (Y Zβ) = (Y Z β) (Y Z β) + ( β β) Z Z( β β) + 2(Y Z β) Z( β β) = (Y Z β) (Y Z β) + ( β β) Z Z( β β) because (Y Z β) Z = ɛ Z = 0 The first term of S(β) does not depend on β and the second term is the squared length of Z( β β) Because Z has full rank, Z( β β) 0 if β β, so the minimum of S(β) is unique and ocurs at β = β Consequently, the LSE of β is β = β = (Z Z) 1 Z Y QED The deviations ˆɛ i = Y i ˆβ 0 ˆβ 1 Z i1 ˆβ r Z ir, i = 1,, n, are the residuals of the MLR model Let Ŷ = Z β = HY denote the fitted values of Y The H matrix is called the hat matrix The LS residuals then become ɛ = Y Ŷ = [I Z(Z Z) 1 Z ]Y = [I H]Y Note that I H is also a symmetric and idempotent matrix 2

3 Our prior discussion shows that the residuals satisfy (a) Z ɛ = 0 and (b) Ŷ ɛ = 0 Also, the residual sum of squares (RSS) is n RSS = ˆɛ 2 i = ɛ ɛ = Y [I H]Y i=1 Sum of squares decomposition: Since Ŷ ɛ = 0, we have Y Y = (Ŷ + ɛ) (Ŷ + ɛ) = Ŷ Ŷ + ɛ ɛ Since the first column of Z is 1, the result Z ɛ = 0 includes 0 = 1 ɛ Consequently, we have Ȳ = Ŷ Subtracing nȳ 2 = n( Ŷ ) 2 from the prior decomposition, we have Y Y nȳ 2 = Ŷ Ŷ n( Ŷ ) 2 + ɛ ɛ, or The quantity n n n (Y i Ȳ )2 = (Ŷi Ȳ )2 + ˆɛ 2 i i=1 i=1 i=1 R 2 = 1 ni=1 ˆɛ 2 i ni=1 (Y i Ȳ )2 is the coefficient of determination It is the proportion of the total variation in the Y i s explained by the predictors Z 1,, Z p Sampling properties: 1 The LSE β satisfies E( β) = β and Cov( β) = σ 2 (Z Z) 1 2 The residuals satisfy E( ɛ) = 0 and Cov( ɛ) = σ 2 [I H] 3 E( ɛ ɛ) = (n r 1)σ 2 so that, defining we have E(s 2 ) = σ 2 4 β and ɛ are uncorrelated s 2 = ɛ ɛ n r 1 = Y [I H]Y n r 1, Proof First, E( β) = E[(Z Z) 1 Z Y ] = (Z Z) 1 Z E(Y ) = (Z Z) 1 Z [Zβ + E(ɛ)] = β Also, Cov( β) = (Z Z) 1 Z Cov(Y )Z(Z Z) 1 = (Z Z) 1 Z (σ 2 I)Z(Z Z) 1 = σ 2 (Z Z) 1 Next, E( ɛ) = [I H]E(Y ) = [I H]Zβ = [Z Z]β = 0 Also, Cov( ɛ) = [I H]Cov(Y )[I H] = σ 2 [I H], because [I H] is idempotent 3

4 Next, using tr(h) = tr(z(z Z) 1 Z ) = tr[(z Z) 1 Z Z] = tr(i r+1 ) = r + 1, we have E( ɛ ɛ) = E[tr( ɛ ɛ)] = E[tr( ɛ ɛ )] = tr[cov( ɛ)] = tr[σ 2 (I H)] = σ 2 [tr(i) tr(h)] = σ 2 (n r 1) Finally, Cov( β, ɛ) = Cov[(Z Z) 1 Z Y, (I H)Y ] = (Z Z) 1 Z Cov(Y, Y )(I H) = (Z Z) 1 Z (σ 2 I)(I H) = σ 2 (Z Z) 1 Z (I H) = σ 2 (Z Z) 1 (Z Z ) = 0 so that β and ɛ are uncorrelated Gauss least squares theorem Let Y = Zβ + ɛ, where E(ɛ) = 0, Cov(ɛ) = σ 2 I, and Rank(Z) = r + 1 For any c, the estimator c β of c β has the smallest possible variance among all linear estiamtors of the form a Y that are unbiased for c β Proof For any fixed c, let a Y be an unibased estimator of c β Then, E(a Y ) = c β, whatever the value of β But E(a Y ) = E[a (Zβ + ɛ)] = a Zβ Consequently, c β = a Zβ or equivalently, (c a Z)β = 0 for all β In particular, choosing β = (c a Z), we obtain c = a Z for any unbiased estimator Next, c β = c (Z Z) 1 Z Y a Y, where a = Z(Z Z) 1 c Since E( β) = β, so c ˆβ = a Y is an unbiased estimator of c β The result of the last paragraph says that c = (a ) Z Finally, for any a satisfying the unbiased requirement c = a Z, we have Var(a Y ) = Var(a Zβ + a ɛ) = Var(a ɛ) = σ 2 a a = σ 2 (a a + a ) (a a + a ) = σ 2 [(a a ) (a a ) + (a ) a ], where (a a ) a = (a a ) Z(Z Z) 1 c = (a Z (a ) Z)(Z Z) 1 c = (c c )(Z Z) 1 c = 0 Because a is fixed, and (a a ) (a a ) is positive unless a = a, Var(a Y ) is minimized by the choice of a = a and we have, (a ) Y = Ec (Z Z) 1 Z Y = c ˆβ QED The result says that the LSE c ˆβ is the best linear unbiased estiamtor (BLUE) of c β 3 Inference For the multiple linear regression model in (1), we further assume that ɛ N n (0, σ 2 I) Result 74 β is also the maximum likelihood estimate of β In addition, β N r+1 [β, σ 2 (Z Z) 1 ] and is independent of the residuals ɛ = Y Z β Let σ 2 be the maximum likelihood estimate of σ 2 Then, n σ 2 = ɛ ɛ σ 2 χ 2 n r 1 Proof Follows what we discussed before for the MLE of multivariate model random sample QED 4

5 Note that the MLE σ 2 of σ 2 is ɛ ɛ/n, which is different from the LSE of ˆσ 2 Result 75 For the Guassian MLR model, a 100(1 α) percent confidence region for β is given by (β β) Z Z(β β) (r + 1)s 2 F r+1,n r 1 (α), where s 2 = ɛ ɛ/(n r 1) is the LSE of σ 2 Also, simultaneous 100(1 α) percent confidence intervals for the β j are ˆβ i ± Var( ˆβ i ) (r + 1)F r+1,n r 1 (α), i = 0, 1,, r, where Var( ˆβ i ) is the diagonal element of s 2 (Z Z) 1 corresponding to ˆβ i Proof Consider the vector V = (Z Z) 1/2 ( β β), which is normally distributed with mean zero and covariance matrix σ 2 I Consequently, V V σ 2 χ 2 r+1 In addition, (n r 1)s 2 = ɛ ɛ is distributed as σ 2 χ 2 n r 1 and is independent of V The result then follows QED Remark: The R command for MLR is lm, which stands for linear model Likelihood ratio tests for the regression parameters: Consider H o : β q+1 = β q+2 = = β r = 0, vs H a : β i 0 for some q + 1 i r Under H o, the model is Under H a, the model is where Y = Z 1 β 1 + ɛ (2) Y = Z 1 β 1 + Z 2 β 2 + ɛ, (3) Z = [Z 1, Z 2 ], β = Result 76 Let Z have full rank r + 1 and ɛ N n (0, σ 2 I) The likelihood ratio test for the null hypothesis H o : β 2 = 0 is [ β1 β 2 [SS res (Z 1 ) SS res (Z)]/(r q) s 2 F r q,n r 1, where SS res (Z 1 ) and SS res (Z 2 ) are the sum of squares of the models in (2) and (3), respectively Proof: Under the model in (3), the maximizied likelihood function is L( β, ˆσ 2 ) = ] 1 (2π) n/2ˆσ n e n/2, 5

6 where β = (Z Z) 1 Z Y and ˆσ 2 = (Y Z β) (Y Z β)/n On the other hand, under the submodel in (2), the maximized likelihood function is L( β 1, ˆσ 1) 2 1 = e n/2, (2π) n/2ˆσ n 1 where β 1 = (Z 1Z 1 ) 1 Z 1Y and ˆσ 1 2 = (Y Z β 1 1 ) (Y Z β 1 1 )/n Thus, the likelihood ratio is L( β 1, ˆσ 1) 2 ( ) ˆσ 2 n/2 ( L( β, ˆσ 2 ) = 1 = 1 + ˆσ2 1 ˆσ 2 ) n/2, ˆσ 2 ˆσ 2 which gives rise to the test statistic n(ˆσ 2 1 ˆσ 2 )/(r q) nˆσ 2 /(n r 1) = (SS res(z 1 ) SS res (Z))/(r q) s 2 F r q,n r 1 This completes the proof Alternatively, one can construct a matrix C such that the null hypothsis becomes H o : Cβ = 0 In this way, C β N r q (Cβ, σ 2 C(Z Z) 1 C ), which can be used to perform the test 4 Inferences from the fitted model Consider a specific point of interest, say z o = (1, z 01,, z 0r ), in the design-matrix space Then, the model says E(Y o z o ) = z oβ, and the LSE of this expectation is z β o In addition, Var(z β) o = σ 2 z o(z Z) 1 z o Consequently, under the normality assumption, a 100(1 α)% confidence interval for z oβ is z β o ± t n r 1 (α/2) z o(z Z) 1 z o s 2 Forecasting: The point prediction of Y at z o is z β, o which is an unbiased estimator Since Y o = z oβ +ɛ o, the variance of the forecast is σ 2 (1+z o(z Z) 1 z o ), where we use the property β and ɛ o are uncorrelated Therefore, a 100(1 α)% prediction interval for Y o is z β o ± t n r 1 (α/2) s 2 (1 + z o(z Z) 1 z o ) 5 Model checking Studentized residuals: From ɛ = [I H]Y, we have Cov( ɛ) = σ 2 [I H] In particular, Var(ˆɛ j ) = σ 2 (1 h jj ), for j = 1,, n The studentized residuals are ˆɛ j = ˆɛ j, j = 1,, n s2 (1 h jj ) If the fitted regression model is adequate,we expected the studentized residuals to look like independent draws from an N(0, 1) 6

7 51 Residual plots The residuals ˆɛ j or studentized residuals ˆɛ j are used to obtain various residual plots for model checking: 1 Plot ˆɛ j against the fitted model ŷ j = Z j β, where Z j is the jth row of the design matrix Z This plot can be used to check (a) validity of linear model assumtion and (b) constant variance of ɛ j 2 Plot ˆɛ j against individual explanatory variable, eg Z 1 This is less common when the number of regressors is large 3 QQ-plot of ˆɛ j to check the normality assumption and possible outliers 4 Time plot of ˆɛ j to check for serial correlations This is often accompanied by the Durbin-Watson statistic nj=2 (ˆɛ j ˆɛ j 1 ) 2 DW = nj=1 2(1 ˆρ ˆɛ 2 1 ), j where ˆρ 1 is the lag-1 autocorrelation function of the residuals defined as ˆρ 1 = nj=2 ˆɛ jˆɛ j 1 nj=1 ˆɛ 2 j The range of DW-statistic is [0,4] with 2 as the ideal value A DW statistic greater than 2 indicates negative correlation between the residuals In practice, when the data have time or spatial charactersitics, one should also check higher lags of autocorrelations of the residuals 52 High leverage points and influential observations From Ŷ = HY, we have n ŷ j = h ji y i = h jj y j + h ji y i i=1 i j In addition, it can be shown that 0 < h jj < 1 for all j In fact, n j=1 h jj = tr(h) = r + 1, under the assumption of Z is full rank r + 1 Thus, if h jj is large relatively to other h ji (in magnitude), then y j will be a major contributor to the fitted value ŷ j Consequently, h jj is called the leverage of the linear regression A large h jj tends to pull the regression line toward the jth data point The leverage h jj has another interpretation It measures the distance of Z j to the center of the explantory variables, where Z j is the jth data point of the design matrix Z For instance, consider the simple linear regression y j = β 0 + β 1 z j + ɛ j It can be shown that h jj = 1 n + (z j z) 2 ni=1 (z i z) 2 7

8 Consequently, if the jth data point of the explanatory variables is far away from the center, then it has a high leverage and pulls the model fit toward itself It is, therefore, useful in linear regression analysis to check the high leverage points Influential observations of a linear regression model are defined as those points that significantly affect the inferences drawn from the data Methods for assessing the influence are often derived from the change in the LSE β if the observations are removed from the data The well-known statistics for assessing influential observations is the Cook s distance The Cook s distance for the ith observation is defined as D i = ( β (i) β) (Z Z)( β (i) β) (r + 1)s 2, (4) where β (i) is the LSE of β with the ith data point removed, and s 2 = ɛ ɛ/(n r 1) is the LSE of σ 2 See Cook (1977,Technometrics) It is the squared distance between β and β (i) relative to the fixed geometry of Z Z A large D i indicates the ith data point is influential, because removing it from the data leads to a substantial change in the parameter estimates It can be shown that D i = 1 ˆɛ i h ii r + 1 s 2 (1 h ii ) = 1 h ii (ˆɛ 2 i ) 2, r h ii where ˆɛ i is the studentized residual This expression leads to several interpretations for the Cook s distance For instance, h ii /(1 h ii ) is a monotonic function of the leverage h ii and ˆɛ i is large for an outlying observation Thus, D i is the product of a random deviation and a leverage measure In addition, h ii /(1 h ii ) = Var(Ŷi)/Var(ˆɛ i ) It can also be shown that z i(z (i)z (i) ) 1 z i = h ii /(1 h ii ) Finally, h ii 1 h ii = nj=1 Var(z j β (i) ) n j=1 Var(z j β) σ 2 Thus, h ii /(1 h ii ) is proportional to the total change in the variance of prediction at z 1,, z n when z i is deleted 6 Variable selection For the MLR model considered, we have r predictors In theory, there are 2 r 1 possible linear regression models available Because the r predictors may be highly correlated, this can lead to the problem of multi-collinearity, which, in turn, results in a unstable or overfitted model An important question then is to select the best MLR model among those candidate models This is the variable selection problem in MLR analysis The problem of variable selection has been widely studied in the literature It is an on-going project for many statisticians and researchers The current research is to investigate variable 8

9 selection when both r and n (the sample size) go to infinity See LASSO, boosting and some Bayesian methods in the literature In this lecture, we focus on the traditional approach of variable selection That is, r is fixed, but n may increase The methods we discuss include (a) Stepwise regression, (b) Mallow s C p criterion, (c) Information criteria such as AIC and BIC, and (d) Stochastic search variable selection by George and McCulloch (1993, JASA, Variable selection via Gibbs sampling) 61 Stepwise regression It is an iterative procedure alternating between forward selection and backward elimination The procedure requires two prespecified thresholds, which I shall denote by F in and F out Typically, F in and F out are squares of some critical values of a Student-t distribution with n 1 degrees of freedom In some cases, one may use F in = F out = 4 To describe the procedure, we divide the r regressors into two subsets, say S in and S out, where S in contains all the regressors already selected and S out denotes the remaining regressors S in and S out may be the empty set Suppose that S in has p variables and denote the corresponding MLR model as Y i = β 0 + β 1 Z i1 + + β p Z pi + e i, i = 1,, n (5) Forward-selection step: The goal of forward selection to expand the regression by adding a regressor to the model Let ê i be the residuals of the MLR in (5) 1 Find the variable, say Z p+1, in S out that has the maximum correlation with ê i 2 Perform the MLR Y i = β 0 + β 1 Z i1 + + β p Z pi + β p+1 Z (p+1)i + e i, i = 1,, n 3 Let t p+1 be the t-ratio of the LSE ˆβ p+1 of the prior regression Compare t 2 p+1 with F in If t 2 p+1 F in, then move Z p+1 from S out to S in and go to the backward-elimination step If t 2 p+1 < F in, stop the procedure Backward-elimination step The goal of backword elimination is to simplify the existing model by removing one variable out of S in Again, consider the existing model in (5) 1 Find the regressor in S in that has the smallest t-ratio, in absoluate value, of the MLR in (5) Denote the regressor by Zj and the corresponding t-ratio by t j Compare t 2 j with F out If t 2 j < F out, move Zj from S in to S out and go to the forward-selection step On the other hand, if t 2 j F out, do not move Zj and go to the forward selection In practice, one starts with S in being the empty set and S out containing all regressors One can also start the backward-elimination step with S in containing all regressors The stepwise procedure will stop after certain iterations Obviously, the result would depend on the choices of F in and F out 9

10 62 Mallow s C p Mallow (1973, Technometrics, Vol 15) proposes the following criterion for model selection: C p = SSE p s 2 F n + 2(p + 1), where p denotes the number of variables in the regression Y i = β 0 + β 1 Z i1 + + β p Z ip + ɛ i, i = 1,, n, (6) where {Z1,, Zp} is a subset of the available regressors {Z 1,, Z r }, SSE p = n i=1 (Y i Ŷ pi ) 2 with Ŷpi denoting the fitted value of the MLR in (6), and s 2 F is the residual mean squared errors of the MLR model in (1), ie, the residual mean squared error when ALL predictors are used Here we use the subscript F to denotes the FULL model Suppose that the subset model in (6) is the true model Then, s 2 F converges to σ 2 (variance of ɛ i ) as n increases and E(C p ) = E(SSE p) (n p 1)σ2 n + 2(p + 1) = n + 2(p + 1) = p + 1 σ 2 σ 2 Therefore, in practice, one selects a model such that C p is less than, but close to (p + 1) Remark: In the discussion of C p, I use a constant term in the MLR If no constant term is used or the constant term is treated as a predictor, then C p is defined as C p = SSEp n + 2p s 2 F 63 Information criteria Assume that ɛ i is normally distributed For a given subset of regressors, say S in = {Z 1,, Z p}, one estimates the model parameters of MLR (6) by the maximum likelihood method As shown before, the estimate of the coefficient vector β = (β 0, β 1,, β p ) is the same as the LSE However, the MLE of the variance of ɛ t is σ 2 = SSE n, where SSE denotes the sum of squared errors The AIC criterion for the fitted MLR model is then AIC(p) = ln( σ 2 ) + 2p n (7) The first term of the AIC is a goodness-of-fit statistic measuring the fidelity of the model to the data and the second term is a penalty function From the definition, AIC uses a penalty of 2 for any parameter used Note that the constant term is assumed to be included for all model so that it is not counted In practice, one entertains a set of possible MLR models for Y i and selects the model that has the smallest AIC values If one change the penalty term to p ln(n)/n, then the criterion becomes the BIC 10

11 Example Consider silver-zinc battery data set of the textbook The dependent variable is the logarithm of the cycles to failure of the battery The five predictors are (1) charge rate (Z 1 ), (2) discharge rate (Z 2 ), (3) depth of discharge (Z 3 ), (4) temperature (Z 4 ) and end of charge voltage (Z 5 ) We shall use the leaps package and the command step in R to carry out the computation Since r = 5 is small, we have = 31 possible regressions and may entertain all of them in model selection When r is large, some procedure will become useful in reducing the number of MLR models entertained The leaps package can be used to consider all possible regressions > setwd("c:/users/rst/teaching/ama") > da=readtable("t7-5dat",header=t) > da z1 z2 z3 z4 z5 y > y=log(da$y) > x=asmatrix(da[,1:5]) > x z1 z2 z3 z4 z5 [1,] [2,] [3,] [4,] [5,] [6,] [7,]

12 [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] > library(leaps) > help(leaps) > > leaps(x,y,nbest=3) $which FALSE FALSE FALSE TRUE FALSE 1 FALSE TRUE FALSE FALSE FALSE 1 TRUE FALSE FALSE FALSE FALSE 2 FALSE TRUE FALSE TRUE FALSE 2 FALSE FALSE FALSE TRUE TRUE 2 FALSE FALSE TRUE TRUE FALSE 3 FALSE TRUE FALSE TRUE TRUE 3 TRUE TRUE FALSE TRUE FALSE 3 FALSE TRUE TRUE TRUE FALSE 4 TRUE TRUE FALSE TRUE TRUE 4 FALSE TRUE TRUE TRUE TRUE 4 TRUE TRUE TRUE TRUE FALSE 5 TRUE TRUE TRUE TRUE TRUE $label [1] "(Intercept)" "1" "2" "3" "4" [6] "5" $size [1] $Cp [1] [8] > > leaps(x,y,nbest=1) 12

13 $which FALSE FALSE FALSE TRUE FALSE 2 FALSE TRUE FALSE TRUE FALSE 3 FALSE TRUE FALSE TRUE TRUE 4 TRUE TRUE FALSE TRUE TRUE 5 TRUE TRUE TRUE TRUE TRUE $label [1] "(Intercept)" "1" "2" "3" "4" [6] "5" $size [1] $Cp [1] > x1=dataframe(x) > nn=lm(y~,data=x1) > step(nn) Start: AIC=758 y ~ z1 + z2 + z3 + z4 + z5 Df Sum of Sq RSS AIC - z z <none> z z z Step: AIC=618 y ~ z1 + z2 + z4 + z5 Df Sum of Sq RSS AIC - z <none> z z z Step: AIC=48 y ~ z2 + z4 + z5 13

14 Df Sum of Sq RSS AIC <none> z z z Call: lm(formula = y ~ z2 + z4 + z5, data = x1) Coefficients: (Intercept) z2 z4 z > > m1=step(nn,trace=f) > m1 Call: lm(formula = y ~ z2 + z4 + z5, data = x1) Coefficients: (Intercept) z2 z4 z > m2=lm(y~z2+z4+z5,data=x1) > summary(m2) Call: lm(formula = y ~ z2 + z4 + z5, data = x1) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std Error t value Pr(> t ) (Intercept) z z *** z Residual standard error: 1032 on 16 degrees of freedom Multiple R-squared: 06421, Adjusted R-squared: 0575 F-statistic: 9568 on 3 and 16 DF, p-value:

15 64 Stochastic search variable selection This approach uses a hierarchical model and Gibbs sampling to perform variable selection Readers are referred to George and McCulloch (1993, JASA) for details Given the observed response Y and the design matrix Z, the approach assumes that Y β, σ 2 N(Zβ, σ 2 I), where I is the n n identity matrix The key concept of this hierarchical approach is that each component of β = (β 0,, β r ) follows a mixture of two normal distributions with mean zero, but different variances Specifically, β i (1 γ i )N(0, τ 2 i ) + γ i N(0, c i τ 2 i ), i = 0,, r (8) where c i is a sufficiently large positive real number, and γ i is a Bernoulli variable such that P (γ i = 1) = 1 P (γ i = 0) = p i (9) The idea is as follows: First, one sets τ i close to zero so that if γ i = 0, then β i would probably be so small that it could be safely estimated by 0 Second, one sets c i large, (in practice c i > 1 always) so that if γ i = 1, then a non-zero estimate of β i should probably be included in the final model See George and McCulloch (1993) for discussion of selecting c i and τ i Based on the setup, p i can be regarded as the prior probability that β i will require a non-zero estimate, ie, the predictor Z i should be selected From (8), β γ N(0, D γ RD γ ),, where γ = (γ 0, γ 1,, γ r ), R is a prior correlation matrix, and D γ = diag{a 0 τ 0, a 1 τ 1,, a r τ r }, where a i = 1 if γ i = 0 and a i = c i if γ i = 1 Finally, the prior distribution of σ 2 given γ is inverse Gaussian as IG(v γ /2, v γ λ γ /2) This says that v γ λ γ /σ 2 is distributed as χ 2 v γ Under the framewotk described ealier, the variable selection is governed by the posterior distribution of γ given the data We denote it as f(γ Y ) The Markov chain Monte Carlo (MCMC) methods with Gibbs samplign can be used to obtain this posterior distribution What happen when r is really large and n is not large? For instance, r = 3000 and n = Boosting: Bühlmann, P (2006) Boosting for high-dimensional linear models Annals of Statistics, 34, Bühlmann and Yu (2003) Boosting with the L2-loss: regression and classification Journal of the American Statistical Association, 98, L2 Boosting: For simplicity, assume that the data are mean-corrected so that there is no constant term in the linear regression Boosting is an iterated procedure as follows Let m = 0 Define Ŷ (0) = 0 15

16 (a) Construct the residuals of the mth iteration as R (m) = Y Ŷ (m) (b) Fit r simple linear regression ˆR (m) i = β j Z ij + ɛ ij and compute the resulting sum of squares of residuals (c) Let j m be the explanatory variable that has the smallest sum of squares of residuals, ie the maximum R 2 among the r simple linear regression Compute the fitted value of this simple linear regression Ŷ (j m) = ˆβ jm Z jm (d) Update the fit as Ŷ (m+1) = Ŷ (m) +vŷ (j m), where v (0, 1] is a tuning parameter (e) Advance m by 1 and go to Step 1 The iteration is stopped by some information criteria such as AIC A smaller v requires longer iteration, but limited experience shows that it works better For example, v = 005 has been used 2 LASSO: Tibshirani (1996) Regression shrinkage and selection via the lasso Journal of the Royal Statistical Society, Series B, 58, The LASSO estimate of β, denoted by β l, is obtained by β l n = argmin β Y i β 0 i=1 subject to r β j s, j=1 2 r Z ij β j, j=1 where s is a tuning parameter In practice, s can be chosen by cross-validation Typically, a 10-fold cross-validation is used 7 Omitted variables In linear regression analysis, if some relevant predictors are omitted, then the coefficient estimates may contain bias and the residuals may have some serial dependence Suppose that the true model is Y = Z 1 β (1) + Z 2 β (2) + ɛ, where the dimension of Z 1 is q + 1 < r + 1 Suppose that the investigator unknowingly fits the model Y = Z 1 β (1) + ɛ (1) 16

17 The LSE of β (1) is β (1) = (Z 1Z 1 ) 1 Z 1Y In this case, E( β (1) ) = (Z 1Z 1 ) 1 Z 1E(Y ) = (Z 1Z 1 ) 1 Z 1(Z 1 β (1) + Z 2 β (2) + E(ɛ)) = β (1) + (Z 1Z 1 ) 1 Z 1Z 2 β (2) Consequently, β (1) is a biased estimate of β (1) unless Z 1Z 2 = 0 Appendix In this appendix, we consider some useful matrix properties for linear regression model Property 1 Assume that the inverse matrices exist, then [ ] 1 [ A B A 1 + F E 1 F F E 1 ] B = D E 1 F E 1, where E = D B A 1 B and F = A 1 B Peoperty 2 Assume that the design matrix Z n (r+1) is of full rank (r + 1) The matrix H = Z(Z Z) 1 Z and I H are idempotent matrices Property 3 Partition the design matrix as Z = [Z 1, Z 2 ], where Z 1 is an n q matrix of full rank q Define the hat matrix of the design matrix Z 1 as H 1 = Z 1 (Z 1Z 1 ) 1 Z 1 Regress Z 2 on Z 1 The residuals of such a linear regression are W 2 = [I H 1 ]Z 2 Now, consider W 2 as a design matrix and contruct the associated hat matrix T as Then, we have Proof: From the partition, we have T = W 2(W 2W 2 ) 1 W 2 Z Z = H = H 1 + T [ Z 1Z 1 Z ] 1Z 2 Z 2Z 1 Z 2Z 2 To obtain (Z Z) 1, we apply Property 1 with A = Z 1Z 1 and use the property that I H 1 is an idempotent matrix In particular, E = D B A 1 B = Z 2Z 2 (Z 2Z 1 )(Z 1Z 1 ) 1 (Z 1Z 2 ) = Z 2[I Z 1 (Z 1Z 1 ) 1 Z 1]Z 2 = Z 2(I H 1 )Z 2 = W 2W 2, F = (Z 1Z 1 ) 1 Z 1Z 2, 17

18 and Finally, F E 1 = (Z 1Z 1 ) 1 Z 1Z 2 [Z 2(I H 1 )Z 2 ] 1, Z 1 (A 1 + F E 1 F )Z 1 = H 1 + H 1 Z 2 E 1 Z 2H 1 H = Z 1 (A 1 + F E 1 F )Z 1 Z 2 E 1 F Z 1 Z 1 F E 1 Z 2 + Z 2 E 1 Z 2 = H 1 + H 1 Z 2 E 1 Z 2H 1 Z 2 E 1 Z 2H 1 H 1 Z 2 E 1 Z 2 + Z 2 E 1 Z 2 = H 1 (I H 1 )Z 2 E 1 Z 2H 1 + (I H 1 )Z 2 E 1 Z 2 = H 1 + (I H 1 )Z 2 E 1 Z 2(I H 1 ) = H 1 + W 2 (W 2W 2 ) 1 W 2 This property says that we can project on the subspace of Z 1, then project on the subspace of Z 2 that is orthogonal to Z 1 Property 4 Let Z 1 = 1 be the n-dimensional vector of ones Applying Property 3, we have H = 11 /n + H, where H is the hat matrix of the centered design matrix Z n r given by Z = Z 11 Z 1 Z 12 Z 2 Z 1r Z r Z 21 Z 1 Z 22 Z 2 Z 2r Z r Z n1 Z 1 Z n2 Z 2 Z nr Z r where Z i = n j=1 Z ji /n is the mean of the ith column of Z Property 5 Let z i be the ith row of the design matrix Z and Z (i) be the matrix Z with the z i removed Then, (Z (i)z (i) ) 1 = (Z Z z i z i) 1 = (Z Z) 1 + (Z Z) 1 z i z i(z Z) 1 1 z i(z Z) 1 z i, Proof: Let A = Z Z, a = z i and b = z i The result then follows the identity (A + a b) 1 = A 1 A 1 a (I + ba 1 a ) 1 ba 1, where A is a k k symmetric and invertiable matrix and a and b are q k matrices of rank q Note that z i β = Ŷi is the fitted value of the ith observation so that the ith residual is ˆɛ i = Y i z i β Also, from the definition h ii = z i(z Z) 1 z i 18

19 Property 6 β (i) β = (Z Z) 1 z iˆɛ i 1 h ii Proof Note that Z Y = Z (i)y (i) + z i Y i, where Y i is the ith element of Y and Y (i) is the vector Y with Y i removed Using Property 5, we have β (i) = (Z (i)z (i) ) 1 Z (i)y (i) [ = (Z Z) 1 + (Z Z) 1 z i z i(z Z) 1 ] (Z Y z i Y i ) 1 h ii = β + (Z Z) 1 z i z i β 1 h ii (Z Z) 1 z i Y i (Z Z) 1 z i h ii Y i 1 h ii = β + (Z Z) 1 z i Ŷ i 1 h ii (Z Z) 1 z i Y i 1 h ii = β + (Z Z) 1 z iˆɛ i 1 h ii Consequently, β (i) β = (Z Z) 1 z iˆɛ i 1 h ii Property 7 The Cook s distance can be written as D i = 1 ˆɛ 2 i h ii r + 1 s 2 (1 h ii ) 2 19

THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay

THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay Lecture 5: Multivariate Multiple Linear Regression The model is Y n m = Z n (r+1) β (r+1) m + ɛ