Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Multiple Regression and Model Building 11.220 Lecture 20 1 May 2006 R. Ryznar

Building Models: Making Sure the Assumptions Hold 1. There is a linear relationship between the explanatory (independent) variable(s) and the response (dependent variable). Check a scatterplot of your data. 2. The error terms are normally distributed with a mean of zero. In the regression model Y=α + βx + ε, we assume that the values of the error term ε (i.e., observed - predicted values) are distributed normally with a mean of zero. We can check this by plotting the error values from our data. Why is this important? It can be shown that, if these errors are normally distributed, then our model s b is an unbiased estimator of the true slope in the population, β. 3. The errors have equal variance for all values of X. This property is called homoscedasticity. In plain language, this means that the variance of the error term does not change systematically with changes in the value of X. 4. The error values are independent. The value of any given error term is independent of the value of any other error term. This is most frequently a problem with data collected over time. 5. The data are ratio or interval. All dependent and independent variables are continuous and measured on ratio or interval scales. This is technically correct; in practice, however, this rule is regularly compromised. For example, using dummy variables in order to incorporate categorical data into regression models.

Dummy variables a way to use nominal (qualitative) data in the regression equation We create one or more variables, each of which takes on the values of 0 or 1 only. The number of dummy variables we need is equal to k-1, where k is the number of categories in your original nominal variable. The regression coefficient for your dummy variable can be interpreted as the predicted change in Y when an observation is a member of the particular category, as compared to the reference category (explained shortly).

How NOT to use dummy variables: Let RACE = 1 if African American 2 if Asian American 3 if Caucasian 4 if Hispanic 5 if Other

The correct way is to use a set of indicator ( dummy ) variables and code them in this manner: Let AFRAMER = Let ASIAMER = Let CAUCAS = Let HISPAN = Let OTHER = 1 if African American and 0 otherwise 1 if Asian American and 0 otherwise 1 if Caucasian and 0 otherwise 1 if Hispanic and 0 otherwise 1 if Other and 0 otherwise

Suppose our conceptual model is: Y = α + β 1 X 1 + β 2 X 2 + e Income = α + β 1 Race +β 2 Education + e Income = a +b 1 ASIAMER+ b 2 CAUCAS + b 3 HISPAN + b 4 OTHER+ b 5 EDUC

Possible model results for Income = 5.41+ 1.9* ASIAMER + 2.5* CAUCAS + 0.7* HISPAN + 2.2* OTHER +.95*12 Thus, to find the predicted income for individuals of different races, each with 12 years of schooling Asian American = a + b 1 + (12 X b 5 ) = 5.41 + 1.9 + (12 X 0.95) = 18,710 Caucasian = a + b 2 + (12 X b 5 ) = 5.41 + 2.5 + (12 X 0.95) = 19,310 Hispanic = a + b 3 + (12 X b 5 ) = 5.41-0.7 + (12 X 0.95) = 16,110 Other = a + b 4 + (12 X b 5 ) = 5.41-2.2 + (12 X 0.95) = 14,610 African American =? = 5.41 + (12 X 0.95) = 16,810 Do you know another way of determining the effect of race?

The category that is not coded is the category to which all others will be compared. It is called the omitted or reference group. How do you interpret the intercept? The intercept is the mean of the omitted group. How do you interpret the other beta coefficients? The b 1 coefficient is the mean of the Asiamer group minus the mean of the Aframer group. The b 2 coefficient is the mean of the Caucas group minus the mean of the AfrAmer group and so on.

Building Models: Making Sure the Assumptions Hold 1.There is a linear relationship between the explanatory (independent) variable(s) and the response (dependent variable). 2.The error terms are normally distributed with a mean of zero. 3.The errors have equal variance for all values of X. This property is called homoscedasticity. 4.The error values are independent. 5.The data are ratio or interval.

Linearity (and how to get it) Monthly Electrical Usage and Size of Home Size of Home Monthly Electrical Usage (Square Feet) (Kilowatt Hours) 1,290 1,182 1,350 1,172 1,470 1,264 1,600 1,493 1,710 1,571 1,840 1,711 1,980 1,804 2,230 1,840 2,400 1,956 2,930 1,954

2000 1800 Energy Use (kilowatt hours) 1600 1400 1200 1000 1500 2000 2500 3000 Home Size in Square Feet Electrical usage appears to increase in a curvilinear manner with the size of the home.

Transformations for Nonlinear Relation Only A X' = log 10 X X' = X B X' = X 2 X' = exp(x) C X' = 1/X X' = exp(-x) Prototype Regression Pattern Transformations of X Figure by MIT OCW.

2000 2000 1800 1800 EnergyUse 1600 EnergyUse 1600 1400 1400 1200 1200 1000 1500 2000 2500 3000 HomeSize 7.00 7.20 7.40 7.60 7.80 8.00 lnx

Prototype Regression Patterns with Unequal Error Variances and Simple Transformations of Y A Y' = Y B Y' = log 10 Y C Y' = 1/Y Prototype Regression Pattern Transformations of Y Note: A simultaneous transformation on X may also be helpful or necessary. Figure by MIT OCW.

EnergyUse 2000 Observed Linear Quadratic 1800 1600 1400 1200 1000 1500 2000 2500 3000 HomeSize = β + β x + ε y 0 1 y = 2 β + β x + β + ε 0 1 2 x

Building Models: Making Sure the Assumptions Hold 1.There is a linear relationship between the explanatory (independent) variable(s) and the response (dependent variable). 2.The error terms are normally distributed with a mean of zero. 3.The errors have equal variance for all values of X. This property is called homoscedasticity. 4.The error values are independent. 5.The data are ratio or interval.

Residual Plots against Energy Use Before inclusion of X 2 After inclusion of X 2 Scatterplot Scatterplot Dependent Variable: EnergyUse Dependent Variable: EnergyUse Regression Standardized Residual 1 0-1 Regression Standardized Residual 1 0-1 -2-2 1200 1400 1600 1800 2000 EnergyUse 1200 1400 1600 1800 2000 EnergyUse

Normal P-P Plot of Regression Standardized Residual Normal P-P Plot of Regression Standardized Residual 1.0 Dependent Variable: EnergyUse 1.0 Dependent Variable: EnergyUse 0.8 0.8 Expected Cum Prob 0.6 0.4 Expected Cum Prob 0.6 0.4 0.2 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Observed Cum Prob 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Observed Cum Prob y = β + β1x + ε 0 2 y = β + β + β + ε 0 1x 2 x

Homoscedasticity Dispersion of residual errors around a regression. The graph on the left shows a homoscedastic regression the variance of residuals given x is a constant. The graph on the right shows a heteroscedastic regression the variance of residuals increases with x. In this case, higher x values predict y with less certainty. A B y y x x Figure by MIT OCW.

Linear Regression Statistics: Model Fit Tests for serial correlation among residuals. The test value ranges from 0 to 4. Values close to 0 indicate positive correlation. Values close to 4 indicate negative correlation. Values between 1.5 and 2.5 are expected, i.e., indicate no correlation. Model 1 Model Summary b Adjusted Std. Error of Durbin- R R Square R Square the Estimate Watson.991 a.982.977 46.801 2.079 a. Predictors: (Constant), SizeSquared, HomeSize b. Dependent Variable: EnergyUse

Model Utility R 2 = SSR/SST F test F= R 2 /k (1-R 2 )/[n-(k+1)] Where n= the number of observations and k= the number of independent (predictor) variables. The F test tests the global utility of the model, i.e., at least one of the coefficients is nonzero. Find critical value for F α in table with k df in the numerator and [n-(k+1)] df in the denominator. Rejection region is where F > F α.

Model Summary b 1-[(SSE/n-k+1)/(SST/n-1)] Model 1 Adjusted Std. Error of R R Square R Square the Estimate.991 a.982.977 46.801 a. Predictors: (Constant), SizeSquared, HomeSize SSE Model 1 Regression Residual Total b. Dependent Variable: EnergyUse R 2 =SSR/SST or 1-(SSE/SST) ANOVA b Sum of Squares df Mean Square F Sig. 831069.5 2 415534.773 189.710.0001 a 15332.554 7 2190.365 846402.1 9 a. Predictors: (Constant), SizeSquared, HomeSize b. Dependent Variable: EnergyUse Coefficients a S 2 = SSE/n (k + 1) Sometimes called MSE F= R 2 /k (1-R 2 )/[n-(k+1)] Model 1 (Constant) HomeSize SizeSquared a. Dependent Variable: EnergyUse Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. -1216.1438870 242.80636850-5.009.00155 2.39893018.24583560 4.049 9.758.00003 -.00045004.00005908-3.161-7.618.00012 y = 2 β + β x + β + ε 0 1 2 x K=number of X variables

For the home energy example the critical value of F was 4.74 with 2 df in the numerator and 7 df in the denominator. Since the computed F=189.71 we reject H 0 and conclude that at least one of the model coefficients β 1 and β 2 is nonzero.

y 0 1 = β + β x + ε Model 1 Model Summary b Adjusted Std. Error of R R Square R Square the Estimate.912 a.832.811 133.438 a. Predictors: (Constant), HomeSize b. Dependent Variable: EnergyUse ANOVA b Model 1 Regression Residual Total Sum of Squares df Mean Square F Sig. 703957.2 1 703957.183 39.536.000 a 142444.9 8 17805.615 846402.1 9 a. Predictors: (Constant), HomeSize b. Dependent Variable: EnergyUse Model 1 (Constant) HomeSize Unstandardized Coefficients a. Dependent Variable: EnergyUse Coefficients a Standardized Coefficients B Std. Error Beta t Sig. 578.928 166.968 3.467.008.540.086.912 6.288.000