Multiple Regression Analysis
Where as simple linear regression has 2 variables (1 dependent, 1 independent): y ˆ = a + bx Multiple linear regression has >2 variables (1 dependent, many independent): ˆ 2 y = a + b1 x1 + b2 x +... b n x n The problems and solutions are the same as bivariate regression, except there are more parameters to estimate.
In bivariate regression we fit a line through points plotted in 2- dimenstional space:
In multiple regression with 3 variables we fit a plane through points plotted in 3-dimenstional space: Additional variables add additional dimensions to the variable space.
In addition to the assumptions of bivariate regression, multiple regression has the assumption of no multicollinearity among the independent variables. Multicollinearity when two or more of the independent variables are highly correlated, making it difficult to separate their effects on the dependent variable.
Example: Determine the strength of the relationship between native American male standing height, average yearly minimum temperature, and annual temperature range. Variables: MHT Male Standing Height(cm) Dependent AnnMinTemp Annual Minimum Temp (ºF) Independent AnnRange Annual Temp Range (ºF) Independent
Model 1 Model Summary b Adjusted Std. Error of Durbin- R R Square R Square the Estimate Watson.654 a.428.416 30.04066 1.683 a. Predictors: (Constant), AnnRange, AnnMinTemp b. Dependent Variable: MHT 41.6% of height explained by min temp and range. ANOVA b Model 1 Regres sion Residual Total Sum of Squares df Mean Square F Sig. 63546.875 2 31773.438 35.208.000 a 84829. 502 94 902.442 148376.4 96 a. Predic tors: (Constant), AnnRange, AnnMinTemp b. Dependent Variable: MHT Model is significant. Model 1 (Constant) AnnMinTemp AnnRange a. Dependent Variable: MHT Unstandardized Coeffic ients Coefficients a Standardiz ed Coeffic ients Collinearity Statistics B Std. Error Beta t Sig. Tolerance VIF 1665.620 15.964 104.334.000 4.492.603.855 7.446.000.462 2.166 1.565.552.325 2.834.006.462 2.166 Slopes are not zero. Some collinearity.
We interpret the regression equation as follows: Male Standing Height = 1665.6 + 4.49(º F min temp) + 1.57(º F temp range) The equation can be interpreted as follows: Every 1ºF increase in minimum temperature adds 4.49 centimeters in male standing height holding constant the temperature range. Conversely, every increase of 1ºF in the annual temperature range adds 1.57 centimeters in male standing height, holding constant the minimum temperature.
Normality of the residuals is one of the most important assumptions of linear regression. In this case the residuals are normally distributed.
The observed and predicted residuals do not display any systematic bias, which would indicate that the independent variables vary systematically with each other.
Coefficients a Unstandardized Coeffic ients Standardiz ed Coeffic ients Collinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (Constant) 1665.620 15.964 104.334.000 AnnMinTemp 4.492.603.855 7.446.000.462 2.166 AnnRange 1.565.552.325 2.834.006.462 2.166 a. Dependent Variable: MHT Tolerance is the amount of the variance in a given independent variable that can not be explained by other independent variables. In this case 46.2% of the variance in one can not be explained by the other meaning that 53.8% of the variance IS shared or collinear. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate Durbin- Watson 1.654 a.428.416 30.04066 1.683 a. Predictors: (Constant), AnnRange, AnnMinTemp b. Dependent Variable: MHT This is why the standard error of the estimate is so large. The standard error of the estimate is the average error expressed in the original units (e.g. centimeters). 30cm is 1 foot of error... in a person s height.
Coefficients a Unstandardized Coeffic ients Standardiz ed Coeffic ients Collinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (Constant) 1665.620 15.964 104.334.000 AnnMinTemp 4.492.603.855 7.446.000.462 2.166 AnnRange 1.565.552.325 2.834.006.462 2.166 a. Dependent Variable: MHT VIFs (variance inflation factors) higher than 2 are considered problematic (according to SPSS) and our VIFs are over just over 2.1.
Coefficients a Unstandardized Coeffic ients Standardiz ed Coeffic ients Collinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (Constant) 1665.620 15.964 104.334.000 AnnMinTemp 4.492.603.855 7.446.000.462 2.166 AnnRange 1.565.552.325 2.834.006.462 2.166 a. Dependent Variable: MHT The standard beta values indicate the strength of the relationship between the independent and dependent variables. Minimum temperature is a much stronger predictor of height than annual range.
The question becomes: do these collinearity statistics rise to the level of indicating multicollinearity among the independent variables? In this example they do. Correlations AnnMinTemp AnnRange AnnMinTemp Pearson Correlation Sig. (2-tailed) 1 -.734**.000 N 97 97 AnnRange Pearson Correlation -.734** 1 Sig. (2-tailed) N.000 97 97 **. Correlation is s ignificant at the 0.01 level (2-tailed).
Misspecification an error in the regression equation due to the exclusion of an independent variable that influences the dependent variable OR the inclusion of an independent variable that does not influence the dependent variable. Misspecification errors are common since it is difficult to know a priori what factors influence the dependent variable. Misspecification is a hypothesis, not statistical issue.
Data Transformation
Often the association between two variables in not linear. Data transformation (log, etc ) is perfectly acceptable. The type of transformation must be stated in your summary statement.
In this case, log transforming the population data created a linear relationship.
Converting to natural log is easy. For example, the mining town of Argentine has a population of 100, its natural log would be: pop(ln) = ln(100) = 4.60517 Converting back to the original units is also easy: pop = e 4.60517 = 100
Calculator transformations: Converting to a log: use the ln key. Converting from a log: use the e x key. SPSS transformations: Converting to a log: Transform>Compute variable> Arithmetic>Ln Converting from a log: Transform>Compute variable> Arithmetic>Exp
Population and Elevation in Colorado Mining Towns. The model is significant. What is the standard error of the estimate telling us? What are the units? Population = 46852.9 + (-4.238)(Elevation)
Population and Elevation in Colorado Mining Towns: Log Transformation The model is significant. What is the standard error of the estimate telling us? What are the units? Ln(population) = 33.108 0.003(elevation)
Town Population Elevation (ft) (ln)population (ln)predicted (ln)residual Argentine 100 11161 4.61 4.90195 -.29678 Boreas 200 11535 5.30 3.95677 1.34155 Breckenridge 8000 9597 8.99 8.85453.13267 Buckskin Joe 500 10860 6.21 5.66264.55196 Chihuahua 200 10571 5.30 6.39301-1.09469 Dudley 200 10400 5.30 6.82517-1.52685 Fairplay 8000 9931 8.99 8.01043.97676 Hamilton 3000 9997 8.01 7.84364.16273 Horseshoe 800 10544 6.68 6.46125.22337 Lamartine 500 10485 6.21 6.61035 -.39574 Lincoln 1500 10384 7.31 6.86560.44762 Montezuma 800 10358 6.68 6.93131 -.24670 Mosquito 250 10720 5.52 6.01645 -.49499 Park City 300 10587 5.70 6.35258 -.64879 Parkville 10000 9944 9.21 7.97758 1.23276 Quartzville 200 11424 5.30 4.23729 1.06103 Rexford 50 11201 3.91 4.80086 -.88884 Sacramento 100 11398 4.61 4.30300.30217 Saints John 200 10798 5.30 5.81933 -.52101 Silverheels 150 10771 5.01 5.88757 -.87693 Swandyke 200 11093 5.30 5.07380.22452 Silver Plume 5500 9825 8.61 8.27832.33418
Converting to Original Units from a Log Transformation Town = Horseshoe Population = 800 Elevation = 10,544ft Predicted ln(population) = 6.46125 Calculated ln(residual) = 0.22337 Converting to original units (people): population = e (6.46125) = 640 Converting the residual: residual = e 0.22337 =1.25028 A A This is the ratio of the difference between the actual and predicted value. Original population = (640)(1.25028) = 800.2 Observed Predicted = residual Residual in original units (people): difference = 800 640 = 160 e.g. the equation under-predicted Horseshoe s population by 160 people.
Town Population Elevation (ft) (ln)population (ln)predicted (ln)residual Argentine 100 11161 4.61 4.90195 -.29678 Observed Population = 100 (ln)predicted Population = 4.90195 (ln)residual = -0.29678 Predicted population ( e predicted ) Residual = observed predicted = What are the predicted population and residual values, in the original units?
Iterative Regression
If you are exploring a data base for associations, one method is to use iterative regression. Iterative Regression a iterative procedure which either adds or removes variables from a regression model based on their significance.
IMPORTANT: The SPSS stepwise procedure give results that are inconsistent with the other methods. Due to this inconsistency it is recommended that the stepwise procedure not be used. A better method of performing iterative regression is to use all variables with the enter procedure, then remove insignificant variables individually. OR use the backwards or forwards procedures.
Types of Iterative Regression: Enter all variables are entered in a single step. Stepwise independent variables are entered based on the smallest F probability. Variables already in the equation are removed if their probability of F becomes too large. Backward all variables are entered into the equation and then sequentially removed based on the smallest partial correlation. Forward - A stepwise variable selection procedure in which variables are sequentially entered into the model.
Harrisburg Housing Value (Iterative using the Enter procedure) Not significant Not significant
With insignificant variables removed. No changes here. All slopes are significant. Predicted value ($) = -233435.212 + 19.515(Square Feet) + 143.475(Year Built) 3848.55 (Bedrooms) + 10101.928(Half Baths) + 4.545(Parcel Size) 12.126(Distance to Front St)
Standardized Coefficients Standardized or beta coefficients are slope values that have been standardized so that their variances are 1. They can be used to determine which of the independent variables have a greater effect on the dependent variable when the variables are measured in different units of measurement. In this case, Square Feet and Distance to Front Street are having the greatest effect.
705 ½ South Front Street Value = $133,900 Square Feet = 2380 Parcel Size = 2975 Distance to Front Street = 84 Year Built = 1900 Bedrooms = 3 Half Baths = 1 Predicted value ($) = -233435.212 + 19.515(2380) + 143.475(1900) 3848.55 (3) + 10101.928(1) + 4.545(2975) 12.126(84) Predicted value ($) = -233435.212 + 46445.7 + 272602.5 11545.65 +10101.928 + 13521.375 1018.884 Predicted value ($) =109236.3 Residual ($) = 109236.3 133900 = -24663.7 This is not surprising considering that the r 2 was 0.591. Over 40% of the variation in housing value is not explained by this model.
Mapping Regression Residuals
Temperature Recording Sites Kyrgystan Region
Average yearly temperature is influenced by: Elevation: 6.4 C per 1000 m elevation change. Latitude: 4.0 C per 1000 km latitude change. To what degree can we predict temperature based on both elevation and latitude?
Elevation Latitude
Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.824 a.679.677 3.48693 a. Predictors: (Constant), Elevation b. Dependent Variable: Average Temperature Model: Elevation ANOVA a Model 1 Sum of Mean Square Squares df F Sig. Regression 4936.797 1 4936.797 406.031.000 b Residual 2334.466 192 12.159 Total 7271.264 193 a. Dependent Variable: Average temperature b. Predictors: (Constant), Elevation Coefficients a Model 1 a. Dependent Variable: Average Temperature Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 14.683.390 37.691.000 Elevation -.005.000 -.824-20.150.000 Temperature P = 14.683 (0.005)Elevation
The standard error of the estimate is about 3.5 C, which is half of the number of degree change per 1000m elevation. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.824 a.679.677 3.48693 a. Predictors: (Constant), Elevation b. Dependent Variable: Average Temperature This model is not very accurate.
Unknown Missing explanatory variable.
Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.254 a.065.060 5.95185 a. Predictors: (Constant), Latitude b. Dependent Variable: Average Temperature This R 2 is really low. Model: Latitude ANOVA a Model 1 Sum of Mean Square Squares df F Sig. Regression 469.747 1 469.747 13.260.000 b Residual 6801.516 192 35.425 Total 7271.264 193 a. Dependent Variable: Average Temperature b. Predictors: (Constant), Latitude Coefficients a Model 1 a. Dependent Variable: Average Temperature Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 30.470 6.002 5.077.000 Latitude -.531.146 -.254-3.641.000
The standard error of the estimate is about 6 C, which is nearly the number of degree change per 1000m elevation. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.254 a.065.060 5.95185 a. Predictors: (Constant), Latitude b. Dependent Variable: Average Temperature This model is also not very accurate. By itself, the variable latitude is not a good predictor of temperature.
This similarity in pattern suggests that together elevation and latitude combined may produce a strong predictive model.
Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.952 a.907.906 1.88058 a. Predictors: (Constant), Elevation, Latitude b. Dependent Variable: Average Temperature Model: Elevation + Latitude ANOVA a Model 1 Sum of Mean Square Squares df F Sig. Regression 6595.775 2 3297.887 932.505.000 b Residual 675.489 191 3.537 Total 7271.264 193 a. Dependent Variable: Average Temperature b. Predictors: (Constant), Elevation, Latitude Coefficients a Model Unstandardized Coefficients Standardized Coefficients Correlations Collinearity Statistics Std. Error Zero-order Tolerance VIF B Beta t Sig. Partial Part (Constant) 57.936 2.008 28.852.000 1 Elevation -.005.000 -.949-41.620.000 -.824 -.949 -.918.936 1.068 Latitude -1.032.048 -.494-21.658.000 -.254 -.843 -.478.936 1.068 a. Dependent Variable: Average Temperature
The standard error of the estimate is less than 2 C, which is by far the best estimator. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.952 a.907.906 1.88058 a. Predictors: (Constant), Latitude b. Dependent Variable: Average Temperature This model is very accurate (90%).
Outlier?
Significantly over/under predicted locations.
There does not appear to be any spatial pattern to the distribution of residuals. The residuals appear to be spatially random. The number of large over/under predictions is about equal. It might be a good idea to examine large over/under predicted locations in greater detail.
Over-prediction Name Lon Lat Elev Temp Resid Humrogi 71.33 38.28 1737 12.17 3.10 Dzhergetal 73.1 41.57 1800 10.43 5.10 Gasan-kuli 39.22 52.22 23 16.06 12.13 Under-prediction Name Lon Lat Elev Temp Resid Kushka 62.35 35.28 57 15.23-5.99 Susamyr 74 42.2 2087-1.95-5.06 Aksai 76.49 42.07 3135-7.27-4.86 An initial inspection does not show any locational influences, with the exception of Gasan-kuli, which is located far from the other sites.
Gasan-kuli Dzhergetal Susamyr Aksai Kushka Humrogi
Key Points: 1. Let theory drive your selection of independent variables. Individual variable analyses (regressions) were misleading. 2. Use the tools available. Both statistics and graphs. 3. Map residuals and look for patterns. Patterns may be of interest. The absence of patterns is NOT a failure.