CHAPTER 3: Multicollinearity and Model Selection

Size: px

Start display at page:

Download "CHAPTER 3: Multicollinearity and Model Selection"

Sibyl Green
5 years ago
Views:

1 CHAPTER 3: Multicollinearity and Model Selection Prof. Alan Wan 1 / 89

2 Table of contents 1. Multicollinearity 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression 2 / 89

3 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity There is another serious consequence of adding too many variables to a model besides depleting the model s d.o.f. If a model has many variables, it is likely that some of the variables will be strongly correlated. 3 / 89

4 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity There is another serious consequence of adding too many variables to a model besides depleting the model s d.o.f. If a model has many variables, it is likely that some of the variables will be strongly correlated. It is not desirable for strong relationships to exist among the explanatory variables. This problem, known as multicollinearity, can drastically alter the results from one model to another, making them harder to interpret. 3 / 89

5 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity How serious the problem is depends on the degree of multicollinearity. Low correlations among the explanatory variables generally do not result in serious deterioration of the quality of O.L.S. results, but high correlations may result in highly unstable estimates. 4 / 89

6 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity How serious the problem is depends on the degree of multicollinearity. Low correlations among the explanatory variables generally do not result in serious deterioration of the quality of O.L.S. results, but high correlations may result in highly unstable estimates. The most extreme form of multicollinearity is perfect multicollinearity. It refers to the situation where an explanatory variable can be expressed as an exact linear combination of some of the others. Under perfect multicollinearity, O.L.S. fails to produce estimates of the coefficients ((X X ) 1 becomes non-invertible due to linear dependency in the columns of X ). A classic example of perfect multicollinearity is the dummy variable trap. 4 / 89

7 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Dummy Variable Trap A dummy variable takes on two values, either 0 and 1, to indicate whether a sample observation does or does not belong in a certain category. For example, a dummy variable could be used to indicate when an individual was employed by constructing the variable as D i = 1 if the individual i is employed = 0 if the individual i is unemployed 5 / 89

8 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Dummy Variable Trap A dummy variable takes on two values, either 0 and 1, to indicate whether a sample observation does or does not belong in a certain category. For example, a dummy variable could be used to indicate when an individual was employed by constructing the variable as D i = 1 if the individual i is employed = 0 if the individual i is unemployed One can also define the dummy variable in the opposite way, i.e., D i = 1 if the individual i is unemployed = 0 if the individual i is employed 5 / 89

9 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Dummy Variable Trap Obviously, one cannot use D i and D i simultaneously in the same regression because D i + D i = 1. The vector containing this sum is perfectly correlated with the intercept term (also a vector of ones). 6 / 89

10 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Dummy Variable Trap Obviously, one cannot use D i and D i simultaneously in the same regression because D i + D i = 1. The vector containing this sum is perfectly correlated with the intercept term (also a vector of ones). For the same reason, in seasonal analysis, we use m 1, instead of m, dummy variables to represent the m seasons. The default season is inherently defined within the m 1 dummy variables (zero value of all m 1 dummy variables indicate the default season). 6 / 89

11 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity (Imperfect) multicollinearity is also known as near collinearity: the explanatory variables are linearly correlated but they do not obey an exact linear relationship. 7 / 89

12 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity (Imperfect) multicollinearity is also known as near collinearity: the explanatory variables are linearly correlated but they do not obey an exact linear relationship. Consider the following three models that explain the relationship between HOUSING (number of housing starts (in thousands) in the U.S., and POP (U.S. population in millions), GDP (U.S. Gross Domestic Product in billions of dollars) and INTRATE (new home mortgage interest rate) between 1963 and 1985: 1)HOUSING i = β 1 + β 2 POP i + β 3 INTRATE i + ɛ i 2)HOUSING i = β 1 + β 4 GDP i + β 3 INTRATE i + ɛ i 3)HOUSING i = β 1 + β 2 POP i + β 3 INTRATE i + β 4 GDP i + ɛ i 7 / 89

13 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Results for the first model: The REG Procedure Model: MODEL1 Dependent Variable: housing Number of Observations Read 23 Number of Observations Used 23 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept pop intrate / 89

14 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Results for the second model: The REG Procedure Model: MODEL1 Dependent Variable: housing Number of Observations Read 23 Number of Observations Used 23 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept gdp intrate / 89

15 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Results from Models 1) and 2) both make sense - estimates of the coefficients are of the expected signs: β 2 > 0, β 3 < 0 and β 4 > 0 and the coefficients are all highly significant. 10 / 89

16 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Results from Models 1) and 2) both make sense - estimates of the coefficients are of the expected signs: β 2 > 0, β 3 < 0 and β 4 > 0 and the coefficients are all highly significant. Consider the third model that combines regressors of the first and second models: The REG Procedure Model: MODEL1 Dependent Variable: housing Number of Observations Read 23 Number of Observations Used 23 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Estimate Error Variable DF t Value Pr > t Intercept pop gdp intrate / 89

17 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity In the third model, POP and GDP change to becoming insignificant although they are both significant when entering separately in the first and second models. This is because the three explanatory variables are strongly correlated. The pairwise sample correlations of the three variables are as follows: r GDP,POP = 0.99, r GDP,INTRATE = 0.88 and r POP,INTRATE = / 89

18 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Consider another example that relates EXPENSES, cumulative expenditure on the maintenance of an automobile, to MILES, the cumulative mileage in thousand of miles, and WEEKS, the automobile s age in weeks since first purchase, for 57 automobiles. The following three models are considered: 1)EXPENSES i = β 1 + β 2 WEEKS i + ɛ i 2)EXPENSES i = β 1 + β 3 MILES i + ɛ i 3)EXPENSES i = β 1 + β 2 WEEKS i + β 3 MILES i + ɛ i 12 / 89

19 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Consider another example that relates EXPENSES, cumulative expenditure on the maintenance of an automobile, to MILES, the cumulative mileage in thousand of miles, and WEEKS, the automobile s age in weeks since first purchase, for 57 automobiles. The following three models are considered: 1)EXPENSES i = β 1 + β 2 WEEKS i + ɛ i 2)EXPENSES i = β 1 + β 3 MILES i + ɛ i 3)EXPENSES i = β 1 + β 2 WEEKS i + β 3 MILES i + ɛ i A priori, we expect β 2 > 0 and β 3 > 0; a car that is driven more should have a greater maintenance expense; similarly, the older the car the greater the cost of maintaining it. 12 / 89

20 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Consider results for the three models: The REG Procedure Model: MODEL1 Dependent Variable: expenses Number of Observations Read 57 Number of Observations Used 57 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 weeks < / 89

21 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity The REG Procedure Model: MODEL1 Dependent Variable: expenses Number of Observations Read 57 Number of Observations Used 57 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 miles < / 89

22 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity The REG Procedure Model: MODEL1 Dependent Variable: expenses Number of Observations Read 57 Number of Observations Used 57 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept weeks <.0001 miles < / 89

23 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity It is interesting to note that even though the coefficient estimate for MILES is positive in the second model, it is negative in the third model. Thus there is a reversal in sign. 16 / 89

24 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity It is interesting to note that even though the coefficient estimate for MILES is positive in the second model, it is negative in the third model. Thus there is a reversal in sign. The magnitude of the coefficient estimate for WEEKS also changes substantially. 16 / 89

25 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity It is interesting to note that even though the coefficient estimate for MILES is positive in the second model, it is negative in the third model. Thus there is a reversal in sign. The magnitude of the coefficient estimate for WEEKS also changes substantially. The t-statistics for MILES and WEEKS are also much lower in the third model even though both variables are still significant. 16 / 89

26 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity It is interesting to note that even though the coefficient estimate for MILES is positive in the second model, it is negative in the third model. Thus there is a reversal in sign. The magnitude of the coefficient estimate for WEEKS also changes substantially. The t-statistics for MILES and WEEKS are also much lower in the third model even though both variables are still significant. The problem is again due to the high correlation between WEEKS and MILES. 16 / 89

27 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity To explain, consider the model Y i = β 1 + β 2 X 2i + β 3 X 3i + ɛ i It can be shown that var(b 2 ) = σ 2 n i=1 (X 2i X 2 ) 2 (1 r 2 23 ) and var(b 3 ) = σ 2 n i=1 (X 3t X 3 ) 2 (1 r 2 23 ), where r 23 is the sample correlation between X 2i and X 3i. 17 / 89

28 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity The effects of increasing r 23 on var(b 3 ): r 23 var(b 3 ) 0 σ 2 ni=1 (X 3i X 3 ) 2 = V V V V V V V V V V The sign reversal and decrease in t values (in absolute terms) are caused by the inflated variances of the estimators. 18 / 89

29 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Common consequences of multicollinearity: The standard errors associated with coefficient estimates are disproportionately large, leading to wide confidence intervals and insignificant t statistics even when the associated variables are important in explaining the variations in Y High R 2 and consequently F can convincingly reject H 0 : β 2 = β 3 = = β k = 0, but few significant t values. O.L.S. estimates are unstable and very sensitive to a small change in the data because of the high standard errors. 19 / 89

30 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Multicollinearity Common consequences of multicollinearity: The standard errors associated with coefficient estimates are disproportionately large, leading to wide confidence intervals and insignificant t statistics even when the associated variables are important in explaining the variations in Y High R 2 and consequently F can convincingly reject H 0 : β 2 = β 3 = = β k = 0, but few significant t values. O.L.S. estimates are unstable and very sensitive to a small change in the data because of the high standard errors. Multicollinearity is very much a norm in regression analysis involving non-experimental data. It can never be eliminated. The question is not about the existence or non-existence of multicollinearity, but how serious the problem is. 19 / 89

31 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity How to identify multicollinearity? High R 2 (and significant F value) but low values of t statistics. This method is not always effective because multicollinearity can result in some, but not all, of the t values being small. The question of whether the variable is genuinely unimportant or it just appears so due to multicollinearity cannot be answered. 20 / 89

32 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity How to identify multicollinearity? High R 2 (and significant F value) but low values of t statistics. This method is not always effective because multicollinearity can result in some, but not all, of the t values being small. The question of whether the variable is genuinely unimportant or it just appears so due to multicollinearity cannot be answered. Coefficient estimates are sensitive to small changes in model specification. 20 / 89

33 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity How to identify multicollinearity (continued)? High pairwise correlations between the explanatory variables, but the converse need not be true. In other words, multicollinearity can still be a problem even though the correlation between two variables is not high. It is possible for three or more variables to be strongly correlated with low pairwise correlations, for example, X 1 may be highly correlated with a 2 X 2 + a 3 X 3 even though the pairwise correlations between X 1 and each of X 2 and X 3 may be small. 21 / 89

34 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity How to identify multicollinearity (continued)? One rule of thumb that has been suggested as an indication of serious multicollinearity is when any of the pairwise correlations among the X variables is larger than the largest of the correlations between Y and the X variables. But this approach still suffers from the same limitation concerning more complex relationships among the X variables. 22 / 89

35 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity How to identify multicollinearity (continued)? variance inflation factor (VIF): The VIF for the variable X j is VIF j = 1, 1 Rj 2 where Rj 2 is the coefficient of determination of the regression of X j on the remaining explanatory variables. The VIF is a measure of the strength of the relationship between each explanatory variable and all other explanatory variables. 23 / 89

36 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity How to identify multicollinearity (continued)? variance inflation factor (VIF): The VIF for the variable X j is VIF j = 1, 1 Rj 2 where Rj 2 is the coefficient of determination of the regression of X j on the remaining explanatory variables. The VIF is a measure of the strength of the relationship between each explanatory variable and all other explanatory variables. Relationship between Rj 2 and VIF j : Rj 2 VIF j / 89

37 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity Rule of thumb for using VIF: - An individual VIF j larger than 10 indicates that multicollinearity may be seriously influencing the least squares estimates of regression coefficients. - If the average of the VIF j s of the model exceeds 5 then muilticollinearity is considered to be serious. - If the VIF are less than 1/(1 R 2 ) then multicollinearity is not strong enough to affect the coefficient estimates. In this case, the independent variables are more strongly related to the Y variable than they are to each other. 24 / 89

38 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Identifying multicollinearity For the HOUSING example, The REG Procedure Model: MODEL1 Dependent Variable: housing Number of Observations Read 23 Number of Observations Used 23 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > t Inflation Intercept pop gdp intrate / 89

39 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Solutions to multicollinearity Solutions to multicollinearity: Benign neglect: If an analyst is less interested in interpreting individual coefficients but more interested in forecasting then multicollinearity may not a serious concern. Even with high correlations among independent variables, if the regression coefficients are significant and have meaningful signs and magnitudes, one need not be too concerned with multicollinearity. 26 / 89

40 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Solutions to multicollinearity Eliminating Variables: Remove the variable with strong correlation with the rest would generally improve the significance of other variables. There is a danger, however, in removing too many variables from the model because that would lead to bias in the estimates. Another drawback of this approach is that no information is obtained on the omitted variable. 27 / 89

41 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Solutions to multicollinearity Respecify the model: For example, in the housing regression, we can include the variables as per capita rather than including population as an explanatory variable, leading to HOUSING i /POP i = β 1 + β 2 GDP i /POP i + β 3 INTRATE i + ɛ i The REG Procedure Model: MODEL1 Dependent Variable: phousing Number of Observations Read 23 Number of Observations Used 23 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variance Estimate Error t Inflation Variable DF Value Pr > t Intercept pgdp intrate / 89

42 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Solutions to multicollinearity Increase the sample size if additional information is available. 29 / 89

43 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Solutions to multicollinearity Increase the sample size if additional information is available. Use alternative estimation techniques such as Ridge regression and principal component analysis. We will touch on Ridge regression but principal component analysis is beyond the scope of this course. 29 / 89

44 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Introduced by Hoerl and Kennard (1970, Technometrics) Motivation: If the b j s are unconstrained, they can explode and susceptible to very high variance. To control the variance, we consider regularising the coefficients, i.e., controlling how large the coefficient estimates can grow. 30 / 89

45 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Introduced by Hoerl and Kennard (1970, Technometrics) Motivation: If the b j s are unconstrained, they can explode and susceptible to very high variance. To control the variance, we consider regularising the coefficients, i.e., controlling how large the coefficient estimates can grow. Ridge regression is based on a minimisation of the usual least squares criterion plus a penalty term. As such, it shrinks the coefficient estimates towards zero. This introduces bias but reduces the variance. b ridge = argmin β R k = argmin β R k n (y i β 1 β 2 x i2 β k x ik ) 2 + λ i=1 (Y X β) (Y X β) + λβ β k i=1 β 2 k 30 / 89

46 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression This results in the solution b ridge = (X X + λi ) 1 X Y 31 / 89

47 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression This results in the solution b ridge = (X X + λi ) 1 X Y This is, in general, a biased estimator of β but more efficient than b: E(b ridge ) = (X X + λi ) 1 X X β Cov(b ridge ) = (X X + λi ) 1 X X (X X + λi ) 1 This bias is zero if λ = 0 but Cov(b) Cov(b ridge ) is a positive definite matrix for λ > 0. Over some range of λ, b ridge has smaller mean square error (MSE) than b. 31 / 89

48 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Here, λ 0 is a tuning parameter that controls the strength of the penalty term (or the amount of regularisation). Large λ means more shrinkage. Note that When λ = 0, we obtain the O.L.S. estimator When λ =, b ridge = 0 For λ in between, we are balancing two ideas: fitting a linear model of Y on X and shrinking the coefficients. For each λ, we have a solution. Hence the λ s trace out a path of solutions. 32 / 89

49 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression We need to tune the value of λ. In their original paper, Hoerl and Kennard (1970) introduced ridge traces : 33 / 89

50 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression We need to tune the value of λ. In their original paper, Hoerl and Kennard (1970) introduced ridge traces : Plot the ridge estimates against λ. Choose λ for which the coefficient estimates are not rapidly changing and have sensible signs. 33 / 89

51 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression We need to tune the value of λ. In their original paper, Hoerl and Kennard (1970) introduced ridge traces : Plot the ridge estimates against λ. Choose λ for which the coefficient estimates are not rapidly changing and have sensible signs. No objective basis; heavily criticised by others. 33 / 89

52 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression We need to tune the value of λ. In their original paper, Hoerl and Kennard (1970) introduced ridge traces : Plot the ridge estimates against λ. Choose λ for which the coefficient estimates are not rapidly changing and have sensible signs. No objective basis; heavily criticised by others. Hoerl and Kennard (1970) also suggested estimating λ using the O.L.S. coefficient and variance estimates. This leads to a feasible generalised ridge regression estimator. 33 / 89

53 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression We need to tune the value of λ. In their original paper, Hoerl and Kennard (1970) introduced ridge traces : Plot the ridge estimates against λ. Choose λ for which the coefficient estimates are not rapidly changing and have sensible signs. No objective basis; heavily criticised by others. Hoerl and Kennard (1970) also suggested estimating λ using the O.L.S. coefficient and variance estimates. This leads to a feasible generalised ridge regression estimator. However, if λ is estimated this introduces a new stochastic element into the estimator. Consequently, the MSE of b ridge is not necessarily smaller than that of the O.L.S. estimator and resulting tests based on t and F distributions are not valid (and may therefore be misleading). 33 / 89

54 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Example 3.1 The following example comes from a study of manpower needs for operating a U.S. Navy Bachelor Officers Quarters (BOQ). The observations are recorded for 24 establishments. The response variable represents the monthly man-hours (MANH) required to operate each establishment, and the independent variables are: OCCUP = CHECKIN = HOURS = COMMON = WINGS = CAP = average daily occupancy monthly average number of check-ins weekly hours of service desk operation square feet of common use area number of building wings operational berthing capacity ROOMS = number of rooms 34 / 89

55 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Results obtained using SAS: The SAS System The REG Procedure Model: MODEL1 Dependent Variable: MANH Number of Observations Read 24 Number of Observations Used 24 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Variance Inflation Intercept OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS / 89

56 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression The O.L.S. results clearly indicate serious multicollinearity. R 2 = is rather high. Since 1/(1 R 2 ) = 68.5, any variables associated with VIF values exceeding 68.5 are more closely related to other explanatory variables than they are to the dependent variable. 36 / 89

57 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression The O.L.S. results clearly indicate serious multicollinearity. R 2 = is rather high. Since 1/(1 R 2 ) = 68.5, any variables associated with VIF values exceeding 68.5 are more closely related to other explanatory variables than they are to the dependent variable. ROOMS has a VIF that exceeds 68.5, and it is not statistically significant. 36 / 89

58 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression The O.L.S. results clearly indicate serious multicollinearity. R 2 = is rather high. Since 1/(1 R 2 ) = 68.5, any variables associated with VIF values exceeding 68.5 are more closely related to other explanatory variables than they are to the dependent variable. ROOMS has a VIF that exceeds 68.5, and it is not statistically significant. OCCUP has a VIF that exceeds 10 and a t stat with a very small p value. One may conclude that multicollinearity exists that decreases the reliability of estimates. 36 / 89

59 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression The O.L.S. results clearly indicate serious multicollinearity. R 2 = is rather high. Since 1/(1 R 2 ) = 68.5, any variables associated with VIF values exceeding 68.5 are more closely related to other explanatory variables than they are to the dependent variable. ROOMS has a VIF that exceeds 68.5, and it is not statistically significant. OCCUP has a VIF that exceeds 10 and a t stat with a very small p value. One may conclude that multicollinearity exists that decreases the reliability of estimates. CAP has a large VIF and is not significant; this variable might have been useful had it not been associated with multicollinearity. 36 / 89

60 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression The ridge plot is given as follows: The SAS System Coefficient Estimate Ridge k Plot COMMON OCCUP CHECKIN HOURS WINGS CAP ROOMS 37 / 89

61 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression VIFs of estimates at varying values of λ: The SAS System Obs _RIDGE_ OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS / 89

62 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Coefficient estimates at varying values of λ: The SAS System Obs _RIDGE_ Intercept OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS / 89

63 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression Picking a value of λ is quite subjective. It appears that the coefficient estimates stabilise at around λ = The VIF s are also reasonably small at λ = 0.65 and the coefficient estimates have the right signs. The SAS System Obs _TYPE_ Intercept OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS 1 PARMS SEB RIDGEVIF RIDGE RIDGESEB / 89

64 1.1 What is Multicollinearity? 1.2 Consequences and Identification of Multicollinearity 1.3 Solutions to multicollinearity 1.4 Ridge Regression Ridge Regression DATA BOQ; INPUT id $ OCCUP CHECKIN HOURS COMMON WINGS CAP ROOMS MANH; DROP ID; CARDS; A B C : : X Y ; proc reg data=boq; model manh=occup checkin hours common wings cap rooms/vif; run; proc reg data=boq outvif outseb outest=bfout ridge=0 to 0.2 by 0.02; model manh=occup checkin hours common wings cap rooms/noprint; plot/ridgeplot nomodel nostat; run; proc print data =bfout; var _RIDGE_ cap rooms; where _TYPE_='RIDGEVIF'; run; proc print data=bfout; where _type_='ridge'; var _RIDGE rmse_ intercept occup checkin hours common wings cap rooms; where _TYPE_='RIDGE'; run; data fbout2; set bfout; if _RIDGE_=. or _RIDGE_=0.06; proc print data=fbout2; var _type rmse_ Intercept -- rooms; run; 41 / 89

65 Model Selection Techniques We consider two broad types of model selection techniques: all possible regressions and penalised regression. 42 / 89

66 Model Selection Techniques We consider two broad types of model selection techniques: all possible regressions and penalised regression. Penalised regression bears a similarity to Ridge regression except that penalised regression actually allows some coefficients to become identically zero. 42 / 89

67 Model Selection Techniques We consider two broad types of model selection techniques: all possible regressions and penalised regression. Penalised regression bears a similarity to Ridge regression except that penalised regression actually allows some coefficients to become identically zero. For all possible regressions, we consider the following commonly used criteria for choosing between models: Adjusted R 2 Criterion - select the model with the highest adjusted R 2 Mallows C p Criterion Information Criterion 42 / 89

68 Model Selection Techniques In the process we will also examine the effects of omitting important regressors and including irrelevant regressors. 43 / 89

69 Model Selection Techniques In the process we will also examine the effects of omitting important regressors and including irrelevant regressors. If important variables are omitted the effects of these variables are not taken into account. The estimators of other coefficients will become biased. 43 / 89

70 Model Selection Techniques In the process we will also examine the effects of omitting important regressors and including irrelevant regressors. If important variables are omitted the effects of these variables are not taken into account. The estimators of other coefficients will become biased. If unimportant variables are included then the variances of coefficient estimators will become inflated. Thus, forecasts and estimates will become more variable than they would be had the irrelevant regressors been excluded. 43 / 89

71 Mallows C p The Mallows C p is one of the most commonly used criteria for choosing between alternative regressions with different combinations of regressors. The C p s formula is given by C p = SSE p MSE F (n 2p) where SSE p is the sum of squared errors of the regression with p coefficients and MSE F is the MSE corresponding to the full model, the model that contains all of the explanatory variables. 44 / 89

72 Mallows C p When the estimated regression has no bias, C p is equal to p. When evaluating which model is best, it is recommended that regression with small C p values and those with values close to p be considered. If C p is substantially larger than p, then there is a large bias component in the model. 45 / 89

73 Mallows C p The development of the Mallows C p is based on the estimation of the trace of the MSE (also known as risk under squared error loss) of an estimator of X β scaled by σ / 89

74 Mallows C p The development of the Mallows C p is based on the estimation of the trace of the MSE (also known as risk under squared error loss) of an estimator of X β scaled by σ 2. Consider the O.L.S. predictor Xb, the risk of Xb is given by R(Xb) =E[(Xb X β) (Xb X β)] =E[(Xb E(Xb) + E(Xb) X β) (Xb E(Xb) + E(Xb) X β)] =E[(Xb E(Xb)) (Xb E(Xb))] + (E(Xb) X β) (E(Xb) X β) 46 / 89

75 Mallows C p The first term is the sum of the variances of the elements of Ŷ = Xb, while the second term is the sum of the bias squares of the elements of Ŷ when E(Y ) = X β is the unknown quantity of interest. 47 / 89

76 Mallows C p The first term is the sum of the variances of the elements of Ŷ = Xb, while the second term is the sum of the bias squares of the elements of Ŷ when E(Y ) = X β is the unknown quantity of interest. Assuming that the model is correctly specified such that E(ɛ) = 0, b is unbiased and the second term vanishes to zero. 47 / 89

77 Mallows C p The first term is the sum of the variances of the elements of Ŷ = Xb, while the second term is the sum of the bias squares of the elements of Ŷ when E(Y ) = X β is the unknown quantity of interest. Assuming that the model is correctly specified such that E(ɛ) = 0, b is unbiased and the second term vanishes to zero. If b is unbiased, the first term (the sum of variances) may be written as E(ɛ X (X X ) 1 X ɛ) =E[tr((X X ) 1 X ɛɛ X )] =σ 2 k 47 / 89

78 Mallows C p Now, suppose that the choice of X is uncertain and X p is used as the regressor matrix instead. So, Y = X p β p + u where u = ɛ + X e β e or u = ɛ X e β e. 48 / 89

79 Mallows C p Now, suppose that the choice of X is uncertain and X p is used as the regressor matrix instead. So, Y = X p β p + u where u = ɛ + X e β e or u = ɛ X e β e. Thus, b p = (X px p ) 1 X py and X p b p E(X p b p ) = X p (β p + (X px p ) 1 X pɛ ± (X px p ) 1 X px e β e ) X p β p (X px p ) 1 X px e β e = X p (X px p ) 1 X pɛ 48 / 89

80 Mallows C p Now, suppose that the choice of X is uncertain and X p is used as the regressor matrix instead. So, Y = X p β p + u where u = ɛ + X e β e or u = ɛ X e β e. Thus, b p = (X px p ) 1 X py and X p b p E(X p b p ) = X p (β p + (X px p ) 1 X pɛ ± (X px p ) 1 X px e β e ) X p β p (X px p ) 1 X px e β e = X p (X px p ) 1 X pɛ Hence E[(X p b p E(X p b p )) (X p b p E(X p b p ))] = σ 2 p 48 / 89

81 Mallows C p So, the sum of the variances of ŷ 1,, ŷ n changes from σ 2 k to σ 2 p as the number of coefficients changes from k to p. Thus, if the model is under-fitted (i.e., p < k), the sum of the variances actually decreases whereas if the model is over-fitted (i.e., p > k), this sum increases. 49 / 89

82 Mallows C p So, the sum of the variances of ŷ 1,, ŷ n changes from σ 2 k to σ 2 p as the number of coefficients changes from k to p. Thus, if the model is under-fitted (i.e., p < k), the sum of the variances actually decreases whereas if the model is over-fitted (i.e., p > k), this sum increases. However, when the model is misspecified, the second term in the MSE expression related to the bias is not always zero. Note that E(X p b p ) = X p (X px p ) 1 X px β, which equals X β if X p b p is unbiased. 49 / 89

83 Mallows C p Obviously, when the model is under-fitted, X p (X px p ) 1 X px β X β. That is, a bias is introduced to the estimator of E(Y ). 50 / 89

84 Mallows C p Obviously, when the model is under-fitted, X p (X px p ) 1 X px β X β. That is, a bias is introduced to the estimator of E(Y ). On the other hand, when the model is over-fitted, we can write X = X p Z, where Z = [ I(k k) 0 ((p k) k) leading to X p (X px p ) 1 X px β = X β. That is, the estimator remains unbiased when the model is over-fitted. ] 50 / 89

85 Mallows C p So, when the model is under-fitted, the O.L.S. estimator of E(Y ) is biased but has a reduced sum of variances. When the model is over-fitted, the O.L.S. estimator of E(Y ) remains unbiased but its sum of variances increases. 51 / 89

86 Mallows C p So, when the model is under-fitted, the O.L.S. estimator of E(Y ) is biased but has a reduced sum of variances. When the model is over-fitted, the O.L.S. estimator of E(Y ) remains unbiased but its sum of variances increases. Now, the sum of bias squares of Ŷ is (E(X p b p ) X β) (E(X p b p ) X β) = (X p (X px p ) 1 X px β X β) (X p (X px p ) 1 X px β X β) = β X (I X p (X px p ) 1 X p)x β 51 / 89

87 Mallows C p To estimate the bias which is unobservable, note that (Y X p b p ) (Y X p b p ) = Y (I X p (X px p ) 1 X p)y = SSE p is the sum of squared errors in the observation sample based on the estimated model with p coefficients. Using Theorem 1.17 of Seber (2008): A Matrix Handbook for Statisticians, E(Y AY ) = E(Y )AE(Y ) + tr(σa), where Σ is the Cov(Y ), we can write E((Y X p b p ) (Y X p b p )) = E(Y )(I X p (X px p ) 1 X p)e(y ) + tr(i X p (X px p ) 1 X p)σ 2 = β X (I X p (X px p ) 1 X p)x β + σ 2 n σ 2 p 52 / 89

88 Mallows C p The first term is the sum of the bias squares. We can estimate this sum by SSE p σ 2 (n p). 53 / 89

89 Mallows C p The first term is the sum of the bias squares. We can estimate this sum by SSE p σ 2 (n p). Recall that the Mallows C p is defined as R(X p b p )/σ 2. By our derivation, R(X p b p )/σ 2 = p + SSE p σ 2 (n p) σ 2 53 / 89

90 Mallows C p The first term is the sum of the bias squares. We can estimate this sum by SSE p σ 2 (n p). Recall that the Mallows C p is defined as R(X p b p )/σ 2. By our derivation, R(X p b p )/σ 2 = p + SSE p σ 2 (n p) σ 2 σ 2 is unknown but can be estimated by e e/(n k), the MSE in the ANOVA table of the full model. This yields the formula: C p = p + SSE p MSE F n + p = SSE p MSE F (n 2p) 53 / 89

91 Mallows C p Thus, if the model is correctly specified, the bias is zero and the (scaled) sum of variances (and hence the C p ) should equal p. 54 / 89

92 Mallows C p Thus, if the model is correctly specified, the bias is zero and the (scaled) sum of variances (and hence the C p ) should equal p. For two models with C p s that are both close to their respective p s, the model with the smaller C p is preferred to the model with the larger C p because a large C p probably indicates over-fitting which results in a larger sum of variances than otherwise. 54 / 89

93 Mallows C p Thus, if the model is correctly specified, the bias is zero and the (scaled) sum of variances (and hence the C p ) should equal p. For two models with C p s that are both close to their respective p s, the model with the smaller C p is preferred to the model with the larger C p because a large C p probably indicates over-fitting which results in a larger sum of variances than otherwise. When the model is grossly under-fitted, the bias term will dominate the reduced variance, leading to a value of C p substantially larger than p. 54 / 89

94 Information Criteria Information Criteria use the observed data to give a candidate model a certain score; this then leads to a fully ranked list of candidate models from worst to best. 55 / 89

CHAPTER 4: Forecasting by Regression

CHAPTER 4: Forecasting by Regression Prof. Alan Wan 1 / 57 Table of contents 1. Revision of Linear Regression 3.1 First-order Autocorrelation and the Durbin-Watson Test 3.2 Correction for Autocorrelation