Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are highly inter-correlated. Multicollinearity leads to poor estimates of the regression coefficients and negatively affects the applicability of the regression model. To illustrate the effect of multicollinearity, consider a regression with two independent variables, x 1 and x 2, and assume the X X matrix has been written in correlation form. Then, it can be shown that Note that the variances of the correlation coefficients, MS E C jj, j =1, 2, as the correlation between x 1 and x 2 increases, r 12 1. In general, with k > 2 independent variables, it can be shown, when C = X X has been written in correlation form that C jj = 1 / (1 R 2 j ). Here, R j is the coefficient of multiple determination resulting from regressing x j on the other k 1 regressor variables. 1
The term 1 / (1 R j 2 ) is called the variance inflation factor, VIF j of ˆ, j ˆ. In brief, multicollinearity leads to high variances of the correlation coefficient j and damages the applicability of the regression model. Specifically, multicollinearity significantly affects the ability of the model to extrapolate and predict responses for x values outside the range of the data. Multicollinearity can be detected in the following ways, assuming the X matrix is written in regression form. 1. Large VIF, specifically, VIF j ˆ C 10. 2. Low determinant of X, specifically, X 0, when X is written in correlation form. 3. Small eigenvalues of C = X X, one or more eigenvalues are close to 0, or max / min > 10, where max and min are the largest and smallest eigenvalues. 4. High correlation coefficients, r ij 1. 5. F-test for significance of regression is significant, but individual regression coefficients are not significant. Some remedial measures for multicollinearity include augmenting the data, when possible, and deleting some independent variables. jj 2
Some more advanced remedies involve using more robust estimates of the correlation coefficients, with methods other than least square. The text presents one such method, ridge regression. Example on detecting multicollinearity (Ex. 15-14) Consider the data on the heat generated (in calories per gram) from cement function of the quantities of four additives. First, the data is coded in standard form by applying the transformation x ( z z ) / S. 1/2 ij ij j jj Then, The X X matrix has several large correlation coefficients. In addition, three of the four VIFs, the diagonal entries of (X X) 1, are larger than 10. 3
Finally, the eigenvalues of X X are 3.657, 0.2679, 0.07127, 0.004014, implying that max / min = 3.657 / 0.00414 = 911.06 > 10. Therefore, multicollinearity problems are likely present. Influential Observations Sometimes, a small subset of data significantly affects the parameters and the quality of the regression model. This typically happens with observations which are from the range of the data. Then, it becomes important to identify these points in order to eliminate them if they were collected by mistake. Even if the influential observations are correct, it is good to detect them to understand what drives the model. A measure which is used to detect influential points is Cook distance. For an observation i, this distance measures in the regression coefficients if Observation i is removed. 4
Letting ˆβ and βˆ () i be the estimate of the regression coefficient with the full data and after removing Observation i, respectively, Cook s distance is given by It can be shown that D i can be written as It can be also shown that the matrix H relates the response variable, y, to the fits, ŷ, as yˆ Hy. So, the distance D i reflects how well the model fits Observation i. A value of D i > 1 indicates an influential point. Example on influential observations (Ex. 15-15) For the peach damage example, Cook distance measures are as shown below. All values are significantly below 1, which implies that the data has no influential observations. 5
Autocorrelation If the error terms i are correlated, then the regression models discussed thus far are not applicable. Correlation may occur in time series observations where observations at time t are related to those at time t 1. One test for correlation is the Durbin-Watson test, which applies to simple regression and assumes a first-order autoregressive model, Here is the autocorrelation coefficient and a t are IID normal random variables. 6
The test checks the significance of based on the statistic where e t is the residual at time t. If autocorrelation is present, then e t and e t 1 would be close, and D would be small, which leads to rejecting the hypothesis = 0. The text describes two one-sided tests on, and present tabulates critical values of D at different significance levels. Model-building: Selecting variables Typically, not all candidate independent variables are necessary to adequately model the response variable. One is then interested in screening the candidate independent variables to obtain the best regression model. A good model balances performance (e.g., prediction accuracy) with tractability (ease of estimation and usage). There is no straightforward technique for selecting the best model. Search techniques are used that require interaction and judgment by the analyst. 7
All possible regressions This a technique that finds a good regression model based on k candidate variables by exploiting all possible subsets of independent variables. E.g., with three variables, x 1, x 2, and x 3 this technique explores regression models with subsets {x 1 }, {x 2 }, {x 3 }, {x 1, x 2 },{x 1, x 3 }, {x 2, x 3 }, and {x 1, x 2, x 3 }, and no regression. In general, with k candidate independent variables, this method explores 2 k possibilities, which tends to be too large. Several criteria are used to compare the candidates. With p 1 variables in the model, the most common criteria are the coefficient of determination R 2 p, the mean square error MS E (p), and the C p statistic. C p statistic The C p statistic is a measure of the total mean square error. It is an estimate of the total standardized mean square error p, 8
C p is given by SSE ( p) CP n 2 p, 2 ˆ where the variance is estimated based on the full term model, 2 ˆ MSE ( k 1). In a model with no bias, it can be shown that E[C p ] = p. So, models with C p close to p are desired. Model selection in all possible regressions A model with small R 2 p, small MS E (p), and small C p which is close to p, is desired. A small R 2 adj is also. But this is equivalent to having a small MS E (p). One could also use the F-score in the significance of regression test. A large F-score is desirable. Typically, R 2 p, MS E (p), and C p decrease as p increases. One chooses a value of p where further increase of p yields insignificant improvement. 9
Example of all possible regressions Consider an augmented peach damage model with five candidate independent variables. Values of R 2 p, MS E (p), and C p have been computed for all possible regression models 31, 2 5 1, in total. These are tabulated on the next page. Also shown are plots for R 2 p, MS E (p), and C p versus p where values corresponding to the best model with p 1 variables have been chosen. These plots indicate that model with four variables, {x 1, x 2, x 3, x 4 } is an appropriate choice. 10
Example of all possible regressions 11
Stepwise regression This is a widely used variable selection technique. It consist of an iterative procedure where variables are added or removed one at a time, that continues until no variables can be added or removed. Specifically, critical values of the F-score, F in and F out, should be chosen, such that F in F out. The procedure starts by selecting the single-variable model with the highest F-score greater than F in, if any. Then, a second variable is chosen to enter the model, having on the highest F-score greater than F in, if any. E.g., if x j has been chosen at the first step, then the second variable chosen is x l having the highest F l > F in, where SS Fl MS ( x, x ) R( l 0, j ). E j l Then, a third variable, x m is added in the same way. After this, the procedure test whether one of two variables x j and x j added at the first two steps should be deleted from the model, based on the lowest F-score smaller than F out, if any. And so on, until the F-scores for adding a variable become all < F in and those for deleting a variable become all > F out. 12
Variants of stepwise regression Forward regression is the same as stepwise regression, but without variable deletion. Backward regression begins with all candidate variables in the model and eliminates variables one at a time. 13