Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 984. y ˆ = a + b x + b 2 x 2K + b n x n where n is the number of variables Example: In an earlier bivariate regression example we attempted to predict the state homicide rate (Homicide) using only poverty (). Our model could explain approximately 30% of the variation in homicide rates using poverty as the only explanatory variable. Now we want to improve our earlier model by incorporating additional explanatory (independent) variables. In this example we have included percent minority population (), and the per capita alcohol consumption level (). Our hypothesis is that these three independent variables will explain a significant portion of state homicide rates. However, we are concerned that the independent variables may be correlated, which would violate the independence assumption of multivariate regression. In multiple regression there is the assumption that the independent variables are not correlated. When the independent variables are themselves related, it is termed multicollinearity. Remember that in regression we partition out the explanatory power of one variable while holding the others constant. Think of multicollinearity as overlap in explanatory power among the independent variables. This overlap makes it impossible to determine which of the independent variables is explaining the dependent variable. In other words, we can not hold the independent variables constant since they are associated with each other. Luckily there are several tools available in SPSS that allow us to gauge the severity of any collinearity among the independent variables. Since this process is very similar to that of bivariate linear regression, we will let SPSS do the calculations. For our example: y = Homicide x = x 2 = x 3 = State Homicide AL 6. 28.9.555 4.4 AZ 3.9 24.54 2.2 0. AR 5.8 20.06.455 3.2 CA 4.2 40.59.837 4.5 CO 9.3 7.27 2.204 6.2 CT 7.9 8.43.853 7.3 DE 9.2 25.38 2.649 5 D.C. 20.2 69.36 3.47 78. FL 2.5 22.02 2.297.2 GA 3 34.93.774 2.4 ID.8 9.07.942 3.4 IL 0.7 26.54.907 3.2 IN 9.5 2.55.602 7.2 IA 9. 6.03.692 2.3 KS 9.9 4.53 8.6 KY 5.8 9.96.444 6.7 LA 9.6 36.09.924 22.7 ME 0.9 3.02 2.033.9 MD 8.5 35.98.775 4.6 MA 9.3 5.5 2.066 4.4 MI 0.5 9.9.739.9 MN 7.9 0.52 2.03 3.4 MS 9.9 38.64.728 9.2 MO.7 5.6.88 2.6 MT 4.6 9.38 2.85 4.7 NE 9.7 0.37.828 4. NV 0.5 24.78 3.232.9 NH 6.5 3.99 3.454 2.3 NJ 8.5 27.5.855 5.9 NM 8.4 33.22.976 0.8 NY 4.6 32.07.598 4.5 NC 2.3 27.92.687 2.6 ND.9 7.54 2.097 3.4 OH 0.6 5.08.656 5.9 OK 4.7 23.94.584 0. OR.6 3.56.996 5 PA 4.62.83 7.9 RI.9 5.03 2.054 4.9 SC 4. 32.8.967.5 SD 3.2.3.966 4.4 TN 3.5 9.8.638.6 TX 5.4 29.03.83 3 UT 9.4 0.82.033 3.6 VT 9.4 3.28 2.0 2.6 VA 9.6 27.7.77 8.8 WA 0.6 8.3.848 5.7 WV 7.9 5.02.437 7.9 WI 8.7 2.34 5 WY.4 8.04 2.357 3.2
SPSS Output for the State Homicide Rates Example: First Try Descriptive Statistics Mean Std. Deviation N 9.94.0069 49 2.88 3.3408 49 20.26 2.47224 49.94733.443452 49 Summary b Adjusted Std. Error of Durbin- R R Square R Square the Estimate Watson.853 a.728.70 5.9252 2.327 a. Predictors: (Constant),, PCTMinority, PCTPov b. Dependent Variable: The multivariate model has substantially more explanatory power than the earlier model (0.70 vs 0.32). Another means of assessing the power of the model is by comparing the standard deviation of the dependent variable () to the standard error of the estimate. Without prior knowledge of poverty, minority population, and alcohol consumption the best guess at the homicide rate is 9.94, with a standard deviation of.00. Note that the standard error of the estimate is only 5.9252, or about half of the standard deviation of homicides. This indicates that the predicted values from our model have a much lower error level (amount of deviation). Correlations Pearson Correlation Sig. (-tailed) N.000.566.82.265.566.000.53 -.39.82.53.000.28.265 -.39.28.000..000.000.033.000..000.70.000.000..9.033.70.9. You can have SPSS print out a correlation matrix for the regression variables by accessing Analyze > Regression > Linear > Statistics, then click the Part and partial correlations radio button. There appears to be some correlation among the independent variables, the most potentially problematic is between and (r = 0.53, p = 0.000). The other correlations are not significant: and (r = 0.28, p = 0.9), and and (r = -0.39, p = 0.70).
Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 4235.465 3 4.822 40.23.000 a 579.875 45 35.08 585.340 48 a. Predictors: (Constant),,, b. Dependent Variable: From the F test we see that the slope of the line is significantly different than zero, so the model has meaning. (Constant) Unstandardized Standardized a Correlations t Sig. Zero-order Partial Part Collinearity Statistics VIF B Std. Error Beta Tolerance -22.282 5.500-4.05.000.828.32.25 2.656.0.566.368.206.675.482.574.083.65 6.89.000.82.77.535.677.478 5.390 2.006.27 2.686.00.265.372.209.924.082 a. Dependent Variable: The table above gives the collinearity diagnostics and can be accessed through Analyze > Regression > Linear > Statistics, and then click on the Collinearity diagnostics radio button. Tolerance is the amount of the variance in a given independent variable what can not be explained by other independent variables. In this case 67.5% of the variance in, 67.7% of the variance in, and 92.4% of the variance in can not be explained by the others meaning that very little of the variance in each independent variable can be explained by others. Also, VIFs (variance inflation factors) higher than 2 are considered problematic and our VIFs are all less than.5. These independent variables are not highly correlated with each other, and therefore multicollinearity should not be a problem. Also from the above table we get the regression model. = 22.282+ 0.828( ) + 0.574( ) + 5.390( ) From this model we can see that all of the variables are positively related to homicide, meaning that as they increase, homicides increase. and have approximately the same influence on the model (0.828 and 0.574) since they are both in percent units. is measured in different units (gallons per year) and so its slope parameter is not directly comparable to the others.
Dependent Variable: Regression Standardized Predicted Value 5 4 3 2 0 - -2 8-2 0 Regression Standardized Residual 2 4 The residual plot reveals two important considerations. The extreme outlier (Washington, D.C.) and that as the residual values increase, the predicted values decrease. 50.00000 8 Unstandardized Predicted Value 25.00000 0.00000 R Sq Linear = 0.728 0.0 20.0 40.0 60.0 80.0 The Washington D.C. observation is having a substantial influence on the regression line. Give its undue influence, it might be best to remove it from the analysis and rerun the model.
SPSS Output for the State Homicide Rates Example: Second Try Summary b Adjusted Std. Error of R R Square R Square the Estimate.885 a.784.769 2.293 a. Predictors: (Constant),,, b. Dependent Variable: The explanatory power of the model has now increased from 0.70 to 0.769 and the standard error of the estimate has decreased from 5.9252 to 2.293. The decrease in the standard error of the estimate means that on average the predicted values are much closer to the observed values. It appears that the model has improved. Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 837.89 3 279.273 53.0.000 a 23.369 44 5.258 069.88 47 a. Predictors: (Constant),,, b. Dependent Variable: The F test tell us that the variation explained by the model is not due to chance. (Constant) Unstandardized a. Dependent Variable: a Standardized B Std. Error Beta t Sig. -2.446 2.463 -.993.326.470.23.32 3.832.000.32.036.695 8.948.000 -.458.858 -.040 -.534.596 Note that the y-intercept (constant) is not significant (0.326). This is a product of the variable, which is also not significant (0.596). This variable should be dropped from the analysis since it is not helpful in predicting homicides. Dependent Variable: Regression Standardized Predicted Value 2 0-2 Note that the predicted vs. residual plot now appears to be randomly distributed. This signals that we are on the right path to perfecting our model. -2.5 0.0 Regression Standardized Residual 2.5
SPSS Output for the State Homicide Rates Example: Third Try Summary b Adjusted Std. Error of R R Square R Square the Estimate.884 a.782.773 2.2748 a. Predictors: (Constant),, b. Dependent Variable: The explanatory power of the model has again increased from 0.769 to 0.773 and the standard error of the estimate has decreased from 2.293 to 2.2748. It appears that the model has continued to improve. Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig. 836.322 2 48.6 80.807.000 a 232.866 45 5.75 069.88 47 a. Predictors: (Constant),, b. Dependent Variable: Again, the F test tell us that the variation explained by the model is not due to chance. (Constant) Unstandardized a. Dependent Variable: a However, now the y-intercept and independent variables are significant, resulting in the final model: = 3.556+ 0.490( ) + 0.32( ) Standardized B Std. Error Beta t Sig. -3.556.306-2.724.009.490.6.325 4.23.000.32.036.695 9.06.000 Unstandardized Predicted Value 20.00000 5.00000 0.00000 5.00000 Mississippi New Mexico California South Carolina Georgia New York North Carolina Alabama Arizona Texas Maryland Oklahoma New Jersey Florida Illinois Virginia Nevada Delaware Washington Tennessee Rhode Island Michigan Montana West Virginia Oregon Pennsylvania Missouri South Dakota Colorado Wyoming Utah Kansas Indiana North Dakota Wisconsin Minnesota Iowa Maine Vermont Louisiana R Sq Linear = 0.782 The observed vs. predicted plot shows that the model fits the data well. There does not appear to be any spatial bias to the predicted values (e.g. a single region does appear to be over or under predicted). Given the variables in our data set, this is the best model for explaining homicides. It took several attempts to develop the final model. Very rarely are models perfect the first time. Use the tools available to produce the best possible model. 0.00000 New Hampshire 0.0 5.0 0.0 5.0 20.0 25.0