Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is Q = (Y i β 0 β 1 X i1 β 2 X i2 β p 1 X i.p 1 ) 2, which in matrix notation is Q = (Y Xβ) (Y Xβ) = Y Y 2β X Y + β X Xβ. Now consider the partial differential vector operator β (Q) = Q β 0 Q β 1. Q β p 1. Then β (Q) = 2X Y + 2X Xβ. Note: This results was verified in the context of simple linear regression (where it is easy to verify), but it holds for multiple regression as wee. Setting β (Q) = 0 and replacing β with its least square estimator b leads to the normal equations whose solution is (X X)b = X Y, b = (X X) 1 (X Y ). Note: The matrix solution looks the same as in the case of simple linear regression. However, now b is a p 1 column vector of least square estimators, (X X) 1 is a symmetric p p constant matrix and (X Y )is a p 1 vector. In the general multiple regression case with p 1 predictor variables, it is not possible to obtain nice algebraic expressions for the elements of b as in the case of simple regression. Instead the least squares estimates are found numerically by first finding (X X) 1 and then multiplying on the right by (X Y ). All matrices involved in these computations can be recovered from SAS. Refer to the SAS output that accompanies these notes. 1
Properties of the Least Square Estimators 1. The least square estimators are unbiased, i.e. E(b) = β. 2. The variance-covariance matrix of b is V (b) = σ 2 (X X) 1. 3. The estimated variance-covariance matrix of b is ˆV (b) = MSE(X X) 1. In the notation of the text, this estimated p p variance-covariance matrix is written V (b) = E(b β)(b β) = S 2 (b 0 ) S(b 0, b 1 ) S(b 0, b p 1 ) S(b 1, b 0 ) S 2 (b 1 ) S(b 1, b p 1 ).... S(b p 1, b 0 ) S(b p 1, b 1 ) S 2 (b p 1 ) which is output by SAS when using PROC REG with option COVB. Also, the estimated standard error of b k is Std{b k } = S(b k ) = S 2 (b k ), k = 0, 1,, p 1, which are automatically output by SAS when PROC REG is used. Remark: By the Guass-Markov Theorem, least squares estimators b are the best (in the sense of smallest variance) linear unbiased estimators (BLUE) of β. Predicted Value and Residuals, AS in simple linear regression, Ŷ = Xb where Ŷ 1 1 X 11 X 1,p 1 Ŷ 2 Ŷ =., X = 1 X 21 X 2,p 1...., b = 1 X n1 X n,p 1 Ŷ n b 0 b 2. b p 1. Thus, Since Ŷ i = b 0 + b 1 X i1 + b 2 X i2 + + b p 1 X i,p 1. b = (X X) 1 (X Y ). 2
That is, Ŷ = HY, H = X(X X) 1 X. Remark: The so-called hat matrix H Is n n no matter how many predictor variables are involved in the regression. Of course it is more difficult to compute when there are p 1 predictor variables.as in the case of simple linear regression, And the residual vector is H = H, H 2 = H. e = Y Ŷ = Y HY = (I n H)Y. Note: H transforms Y into the estimated mean response vector Ŷ while I n H transforms Y into e, the vector of residuals. Variance-covariance Matrix of the residuals The variance-covariance matrix of the Predicted Ŷ is computed as follows: V (Ŷ ) = σ 2 H. And since σ 2 is not known, the estimated variance-covariance matrix of Ŷ is ˆV (Ŷ ) = MSEH. Similarly, V (e) = σ 2 (I n H), ˆV (e) = MSE(I n H). Note: The mean square error MSE will be shown later. Analysis of Variance As in simple linear regression, the fundamental identity on which the analysis of variance is based on: SST O = SSR + SSE, where and SST O = (Y i Ȳ )2 = Y (I n 1 n J)Y, J = 11, SSE = e 2 i = e e = Y (I n H)Y, SSR = (Ŷi Ȳ )2 = Y (H 1 n J)Y. 3
The degrees of freedom associated with above sum of squares are: df SST O = n 1, df SSE = n p, df SSR = p 1. Thus the corresponding mean squares are: MSE = SSE n p, SSR MSR = p 1. And by Cochran s Theorem, SSR and SSE are independent with the following χ 2 distribution: SSE χ 2 SSR (n p), χ 2 (p 1, θ/σ 2 ), σ 2 σ 2 where θ = [β 1 (X i1 X 1 ) + β 2 (X i2 X 2 ) + + β p 1 (X i,p 1 X p 1 )] 2. Remark: E(MSE) = σ 2, E(MSR) = σ 2 + θ p 1. We see the consistency when it is compared to E(MSR) = σ 2 + β1 2 (X i X) 2, for p 1 = 1. ANOVA Table: Source df Sum Squares Mean Squares Expected MS F Ratio Regression/model p 1 SSR MSR = SSR σ 2 + θ F = MSR p 1 p 1 MSE Error n p SSE MSE = SSE σ 2 n p Total n 1 SST O F Test for Regression The F-ratio in the above ANOVA table tests the hypotheses: H 0 : β 1 = β 2 = = β p 1 = 0, H a : β k 0 for at least one k = 1, 2,, p 1. The test statistics is then And under H 0, F = MSR MSE. F F (p 1, n p). Thus, the decision rules for an α level test are: Decision rule I Decision rule II Accept H 0 if F F (1 α; p 1, n p), Accept H 0 if P v = P (F (p 1, n p) > F ) α, Reject H 0 if F > F (1 α; p 1, n p). Reject H 0 if P v = P (F (p 1, n p) > F ) < α. 4
Coefficient of Multiple Determination R 2 = SSR SST O measures the proportion of the total variation in response variable Y which is due to its linear relationship on explanatory variables X 1, X 2,, X p 1. Thus R 2 plays the same role in multiple regression that R 2 does in simple regression. Comments A large value of R 2 does not necessarily imply that the fitted model is a useful one. For instance, 1. The nonlinearity may exist even if the R 2 is large. 2. Most of observation may have been taken at certain ranges of the predictor variables. Despite a high R 2 in this case, the fitted model may not be useful if most prediction require extrapolations outside the region of observations. 3. Even though R 2 is large, MSE may still be too large for inferences to be useful when high precision is required. 4. The above F-test statistics can also be written in terms of R 2 : F = MSR ( ) n p R 2 MSE = p 1 1 R. 2 Adjusted R 2 Recall R 2 = SSR SST O = 1 SSE SST O. The adjusted R 2 is obtained by dividing SSE and SST O by their respective degrees of freedom. That is, ( ) Ra 2 SSE/(n p) n 1 SSE = 1 SST O/(n 1) = 1 n p SST O. Remark. Adding another explanatory variable to the multiple regression will always decrease SSE thus increase R 2. However, Ra 2 may actually decrease when another explanatory variable is added to the model, because the decrease in SSE may be more offset by the loss of a degree of freedom in the denominator (i.e. n p). Coefficient of Multiple Correlation R = R 2. R dose not have direct interpretation in terms of reduction in the variability of the dependent variables as dose R 2. It is not often used. 5