School of Mathematical Sciences. Question 1. Best Subsets Regression

School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 9 and Assignment 8 Solutions Question 1 Best Subsets Regression Response is Crime I n W c e I P a n A E P U U l e Mallows g E P - L o E E t q Vars R-Sq R-Sq(adj) Cp S e d E 1 F M p 1 2 h u 1 47.3 46.1 36.1 28.393 X 2 58.0 56.1 22.0 25.619 X X 3 66.6 64.2 11.2 23.131 X X X 4 70.0 67.2 8.0 22.154 X X X X 5 73.0 69.7 5.6 21.301 X X X X X 6 74.8 71.0 4.9 20.827 X X X X X X 7 75.4 71.0 6.0 20.843 X X X X X X X 8 76.4 71.4 6.5 20.680 X X X X X X X X 9 76.6 70.9 8.1 20.847 X X X X X X X X X 10 76.7 70.2 10.0 21.110 X X X X X X X X X X 11 76.7 69.4 12.0 21.409 X X X X X X X X X X X Figure 1: Plots of the model evaluation measures. The model with six variables (Age, Ed, PE, UE2, Wealth and IncInequ) seems to have the best values of the measures. 1

The regression fit for the full model. Figure 2: There are no clear indications against the model assumptions of normality and constant variance of the errors. The regression equation is Crime = - 684 + 1.01 Age + 1.79 Ed + 1.62 PE - 0.67 PE-1-0.001 LF + 0.147 M - 0.035 Pop - 0.512 UE1 + 1.68 UE2 + 0.130 Wealth + 0.736 IncInequ Constant -684.2 151.2-4.52 0.000 Age 1.0061 0.3803 2.65 0.012 2.293 Ed 1.7872 0.6250 2.86 0.007 4.905 PE 1.624 1.027 1.58 0.123 93.572 PE-1-0.675 1.098-0.61 0.543 94.566 LF -0.0006 0.1286-0.00 0.996 2.711 M 0.1472 0.1973 0.75 0.461 3.393 Pop -0.0350 0.1260-0.28 0.783 2.308 UE1-0.5116 0.3966-1.29 0.206 5.131 UE2 1.6841 0.8153 2.07 0.046 4.758 Wealth 0.1300 0.1004 1.29 0.204 9.416 IncInequ 0.7364 0.2070 3.56 0.001 6.846 S = 21.4095 R-Sq = 76.7% R-Sq(adj) = 69.4% PRESS = 30579.5 R-Sq(pred) = 55.56% Analysis of Variance Source DF SS MS F P Regression 11 52766.5 4797.0 10.47 0.000 Residual Error 35 16042.8 458.4 Total 46 68809.3 Source DF Seq SS Age 1 550.8 Ed 1 7259.7 PE 1 31738.5 PE-1 1 1243.6 LF 1 225.6 M 1 1056.1 Pop 1 409.7 UE1 1 1.2 UE2 1 3697.7 Wealth 1 783.1 IncInequ 1 5800.4 Variables PE and PE-1 have very large VIF. Also, Wealth and IncInequ have somewhat enlarged VIF. Another interesting thing to notice is that the sequential sum of squares for UE1 is very small, while for UE2 it is very large. This suggests that the unemployment of younger men is not that important in respect to fitting the crime rate model as is the unemployment of middle-aged men. 2

Matrix Plot Figure 3: The matrix plot suggests that PE and PE-1 are strongly linearly positively related; Wealth and IncInequ are quite strongly negatively related; some positive relationship is also seen between UE1 and UE2. There are mild relationships among many other pairs of predictors. The predicted R 2 = 55.56 is much smaller than the adjusted R 2 = 69.4. This means that the model may be over-fitted. From MNITAB s Help: Predicted R 2 Used in regression analysis to indicate how well the model predicts responses for new observations, whereas R 2 indicates how well the model fits your data. Predicted R 2 can prevent overfitting the model and can be more useful than adjusted R 2 for comparing models because it is calculated using observations not included in model estimation. Overfitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations. Predicted R 2 is calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation. Predicted R 2 ranges between 0 and 100% and is calculated from the PRESS statistic. Larger values of predicted R 2 suggest models of greater predictive ability. For example, you work for a financial consulting firm and are developing a model to predict future market conditions. The model you settle on looks promising because it has an R 2 of 87%. However, when you calculate the predicted R 2 you see that it drops to 52%. This may indicate an overfitted model and suggests that your model will not predict new observations nearly as well as it fits your existing data. 3

The regression fit for the model with six best variables. Figure 4: There are no clear indications against the model assumptions of normality and constant variance of the errors. The regression equation is Crime = - 619 + 1.13 Age + 1.82 Ed + 1.05 PE + 0.828 UE2 + 0.160 Wealth + 0.824 IncInequ Constant -618.5 108.2-5.71 0.000 Age 1.1252 0.3509 3.21 0.003 2.062 Ed 1.8179 0.4803 3.79 0.001 3.061 PE 1.0507 0.1752 6.00 0.000 2.876 UE2 0.8282 0.4274 1.94 0.060 1.382 Wealth 0.15956 0.09390 1.70 0.097 8.706 IncInequ 0.8236 0.1815 4.54 0.000 5.560 S = 20.8273 R-Sq = 74.8% R-Sq(adj) = 71.0% PRESS = 25837.8 R-Sq(pred) = 62.45% Analysis of Variance Source DF SS MS F P Regression 6 51458.2 8576.4 19.77 0.000 Residual Error 40 17351.1 433.8 Total 46 68809.3 Source DF Seq SS Age 1 550.8 Ed 1 7259.7 PE 1 31738.5 UE2 1 2173.9 Wealth 1 803.0 IncInequ 1 8932.3 Unusual Observations Obs Age Crime Fit SE Fit Residual St Resid 11 124 167.40 112.86 8.24 54.54 2.85R 29 119 104.30 142.61 11.55-38.31-2.21R The variable Wealth has rather large VIF (although not bigger than 10) and it also has the p-value suggesting that it may not be strongly significant in the presence of the other variables. Both, the adjusted and predicted R 2 have improved and the predicted value R 2 = 62.45 is now a bit closer to the adjusted R 2 = 71.0. 4

The regression fit for the model with best five variables. Figure 5: There is no apparent contradiction to the model assumptions of normality and constant variance. The regression equation is Crime = - 524 + 1.02 Age + 2.03 Ed + 1.23 PE + 0.914 UE2 + 0.635 IncInequ Constant -524.37 95.12-5.51 0.000 Age 1.0198 0.3532 2.89 0.006 1.998 Ed 2.0308 0.4742 4.28 0.000 2.853 PE 1.2331 0.1416 8.71 0.000 1.796 UE2 0.9136 0.4341 2.10 0.041 1.363 IncInequ 0.6349 0.1468 4.32 0.000 3.480 S = 21.3013 R-Sq = 73.0% R-Sq(adj) = 69.7% PRESS = 24971.3 R-Sq(pred) = 63.71% Analysis of Variance Source DF SS MS F P Regression 5 50206 10041 22.13 0.000 Residual Error 41 18604 454 Total 46 68809 Source DF Seq SS Age 1 551 Ed 1 7260 PE 1 31739 UE2 1 2174 IncInequ 1 8483 Unusual Observations Obs Age Crime Fit SE Fit Residual St Resid 11 124 167.40 104.44 6.74 62.96 3.12R 19 130 75.00 118.39 6.22-43.39-2.13R 29 119 104.30 149.64 11.02-45.34-2.49R Here we see that each explanatory variable is significant in the presence of the other variables and the inflation factors are not large, although VIF for IncInequ is slightly increased. The adjusted R 2 = 69.7% value is not much smaller than for the model with six regressors and the predicted R 2 = 63.71% is still better. We may say that crime rate in the USA in 1960 depended on age distribution, years of schooling, police expenditure, unemployment rate of middle-aged men and the income inequality. Other variables, not included in the model, are either not relevant or are correlated with those which are in the model. For example, Wealth is not included in the model because it does not contribute much to the SS R given IncInequ is already in the model. It does not mean, however, that crime rate is not related to the variable Wealth. 5

Question 2 2.1 The regression equation is Crime = - 524 + 1.02 Age + 2.03 Ed + 1.23 PE + 0.914 UE2 + 0.635 IncInequ Constant -524.37 95.12-5.51 0.000 Age 1.0198 0.3532 2.89 0.006 1.998 Ed 2.0308 0.4742 4.28 0.000 2.853 PE 1.2331 0.1416 8.71 0.000 1.796 UE2 0.9136 0.4341 2.10 0.041 1.363 IncInequ 0.6349 0.1468 4.32 0.000 3.480 (a) Predictor PE shows strong positive linear relationship with PE-1. This will effect in nearlinear columns of matrix X and so, make the determinant of X T X close to zero. This would give an inflated standard errors of the parameter estimates, hence decrease of the value of T-statistics which have the standard errors in the denominator. Eliminating the main near-linear column from matrix X, which is PE-1, decreases the multi-collinearity effect and the standard errors are not that large any longer. That is the T-statistics are larger. (b) From MINITAB Predicted Values for New Observations NewObs Fit SE Fit 95% CI 95% PI 1 91.70 3.12 (85.40, 97.99) (48.22, 135.17) Values of Predictors for New Observations NewObs Age Ed PE UE2 IncInequ 1 139 106 85.0 34.0 194 We may say with 95% confidence that the expected crime rate in an average state is between (85.40, 97.99). It means that the expected number of offences per 1 million population known to police is between 85 and 98. The 95% prediction interval is (48.22, 135.17). We predict with 95% confidence, that the observed crime rate in an average state may be between 48 and 136 offences per one million of population. (c) The formulae for the PI and CI are following: A 100(1 α)% prediction interval for a new observation Y 0 at x 0 is given by Ŷ 0 ± t α 2,n p S 2 {1 + x T 0 (XT X) 1 x 0 }. A 100(1 α)% confidence interval for the expected response E(Y 0 ) at x 0 is given by Ŷ 0 ± t α 2,n p S 2 x T 0 (XT X) 1 x 0. Here, Ŷ0 is the fitted response, S 2 is the MS E of the fitted model, x 0 is the vector of the predictor values and X is the design matrix. Also, t α 2,n p denotes the α/2 percentile of t-distribution with n p degrees of freedom, where n is the number of observations and p is the number of model parameters. 6

2.2 The graph in Figure 1 can help to choose a candidate set of explanatory variables. It presents values of the measures described above for best models out of each subset of variables of fixed size. The point where the values of these measures level off or have minimum (or maximum) shows the potentially best set of candidate explanatory variables. This graph can not show which variables in the model fit will be significant, which variables may be related to each other, nor can it tell anything about the residuals diagnostics. 7