Ch14. Multiple Regression Analysis 1
Goals : multiple regression analysis Model Building and Estimating More than 1 independent variables Quantitative( 量 ) independent variables Qualitative( ) independent variables: dummy variables Regression coefficients Multiple standard error of estimate Model Evaluation Goodness-of-fit: global and individual linearity Multi-collinearity Model assumptions diagnostic: analysis of residuals, residual plot 2
略 remedies. If not ok, try a new model. If not ok, need some adequate 立 立 度 1. Linearity 2. Multi-collinearity 1. 立 2. 3. 異數 3
A multiple regression analysis : When there are k independent variables, X1,X2,,Xk, the multiple regression equation is : µ = α + β X + β X + L+ Y 1 1 2 2 β k X k 4
A multiple regression analysis : Where µ Y = the Y-intercept = when X1=X2= =0. 1, 2,, k = net/partial regression coefficients, 1 = the net change in mean of Y for each unit change in X1 when other variables X2,,Xk are kept constants. 數 X1 Y 數 量 5
If there are k=2 independent variables, see Chart 14-1 6
For qualitative X Recall that a regression model establishes systematic( ) relationship between two continuous variables, independent and dependent variable. What if some independent variables are nominal-scale/ qualitative( )? Ans. Using a dummy variable( 數 ) to replace the original variable. 7
Example. X1=X2= Y= X1, Y are continuous variables, while X2 is nominal. If X2=male, y = 0.6+0.9X1 If X2=female, y = 5.6 + 0.9X1 How to express such a model? 8
Dummy variable : a variable with only two possible outcomes, 0 or 1 I=1, if success ; I=0, if failure. 9
Example. Let I2 = 1, if X2=female, I2=0, if X2=male, Multiple regression model : previous model is expressed as 0.6 + 0.9X 1, X2 = male µ Y = 5.6 + 0.9X 1, X2 = female 0.6 + 0.9X 1, X2 = male, I2 = 0 = 0.6 + 0.9X1+ 5, X2 = female, I2 = 1 = 0.6 + 0.9X + 5I = α + β X + β I 1 2 1 1 2 2 =0.6 : I2=0, 0 數 0.6 不 女 數 1=0.9 2=5 X1 女 (I=1)(I=0) 數 2=5 10
female male 11
數 數 Ex. 0.6 + 0.9X 1, X2 = male, I2 = 0 µ Y = 0.6 + 0.9X1+ 0.3X 1, X2 = female, I2 = 1 = 0.6 + 0.9X + 0.3X I = α + β X + β X I 1 1 2 1 1 2 2 2 µ Y = 0.6 + 1.2x female male 0.6 µ Y = 0.6 + 0.9x 12
(1%) Bonus 1 : 數 數 數 µ Y 0.6 + 0.9X 1, X2 = male,i2 = 0 = 5.6 + 1.2X 1, X2 = female, I2 = 1 Bonus 2 : X1= X2= ( 金 ) Y= 列 0.6 + 0.9X1 X2 = 金 µ Y = 3.6 + 0.9X1 X2 = 2.6 + 0.9X1 X2 = X Y 13
14 Estimating the regression equation : The multiple regression equation is estimated by where a, b1,, bk are the least squared estimates (LSE). The calculations are tedious as k becomes large. Example. K=2, two independent variable, solving the equations : Many software packages provide LSEs. k k 2 2 1 1 X b X b X b a ' Y + + + + = L + + = + + = + + = 2 2 2 2 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 1 x b x x b x a yx x x b x b x a yx x b x b an y
Example. P477 Salsberry Realty sells homes along the east coast of USA. How much can one expect to pay to heat it during the winter? frequently asked by customers. Independent variables : X s 1. The mean daily outside temperature 2. The number of inches of insulation( ) in the attic( 樓 ) 3. The age of the furnace( 爐 ) Dependent variable : Y = heating cost n=20 houses were sampled and investigated. 15
Answer the following questions : 1. Determine the multiple regression equation 2. Discuss the regression coefficients 3. What does it indicate that some are positive and some are negative? What is the intercept? 4. What is the estimated heating cost for a home if the mean outside temperature is 30 degrees, there are 5 inches of insulation in the attic, and the furnace is 10 years old? 16
Home Heating cost (Y) Mean outside temp. (X1) Attic insulation (X2) Age of furnace (X3) 17
數 數 度 數 Y = 427.19 4.58 X1 14.83 X2 + 6.10 X3 18
Findings : 1. Y = 427.19 4.58 X1 14.83 X2 + 6.10 X3 2. The intercept is 427.19. 3. b1, b2 are negative, X1, X2 have inverse relationship. As the outside temperature X1 increases, the mean heating cost will go down. reasonable. For each degree the mean temperature increases, the mean heating cost decreases 4.58 per month. The more insulation in the attic, the less the heating cost. 4. b3=6.1 > 0, X3 has a direct relationship. An older furnace, more heating cost. 5. If X1=30, X2=5, X3=10, the estimated heating cost is Y = 427.19 4.58 (30) 14.83 (5) + 6.10 (10) = 276.60 19
More on estimation : Multiple standard error of estimate: A measure of the error or variability in the prediction. Formula : S 2 y 12Kk = = (Y Y') = n (k + 1) SSE n (k + 1) MSE Residual = Y-Y = The standard error of estimate helps to construct confidence intervals and prediction intervals. Why the degrees of freedom is n-(k+1)? There are n responses, Y1,,Yn. The Y is determined by the predicted equation with (k+1) estimated coefficients 20
Home Heating cost (Y) (Y-Y') (Y-Y')^2 sum S 2 y 123 = = (Y Y') = n (k + 1) 41695.28 20 (3 + 1) 51.05 21
Or, the estimate can be found in the output of EXCEL 數 數 S y 123 = 51.05 度 S 123 y = MSE = 2606 = 51.05 22
Model fit Model Evaluation 1. There is a linear relationship between each X1, Xk and Y Global test : (X1, Xk) vs. Y Individual regression coefficient : Xi vs. Y 2. There is no correlation among X1,,Xk If there is, multicollinearity exists. Diagnosed by correlation matrix Model assumptions diagnostic 1. All pairs of observation (X1,,Xk, Y) are independent. Residual plots 2. The random error = Y Y ~ Normal(0, 2 ), equal variance Residual plot, normal plot Homoscedasticity = equal variance 23
Linearity Linear relationship between X1, Xk and Y Global linearity : jointly, (X1,, Xk), has linear relationship with Y. Individual linearity : each X1,X2,,Xk, has linear relationship with Y Methods : Subjective : eyeball, r 2 Objective : statistical tests 24
Linearity : subjective methods Individual linearity between Xi and Y. Scatter diagrams : Plots of (X1, Y), (X2, Y),, (Xk, Y) linear relationship : positively linear, negatively linear Correlation matrix : a matrix showing the correlation coefficient r between all pairs of variables. Off-diagonal : correlation coefficients The correlation between (X1, Y),, (Xk, Y) should be r 1 or r -1 25
Example. P485 scatter plots 26
Example. P486 EXCEL : correlation matrix X1, X2 are negatively related to Y, while X3 is positively related. X1 has strongest correlation with Y X2 has weakest correlation with Y. 27
Linearity : subjective methods Global linear relationship between (X1,.., Xk) and Y. Coefficient of multiple determination r 2 : ANOVA table The proportion of the total variation of Y explained by X1,, Xk r 2 = SSR SStotal = 1 SSE SStotal ANOVA Table Source of Variation Sum of Squares Degrees of Freedom Mean Square F Regression SSR k SSR/k=MSR MST/MSE Error SSE n-k-1 SSE/(n-k-1)=MSE Total SS total n-1 28
Example. r 2 can be found in the output of EXCEL 數 數 r 2 = 80.42% 度 or SSR 171220.5 r 2 = = = SStotal 212915.8 0.8042 X1, X2, X3 80% Y 異 29
Linearity : objective methods Objectively, the hypothesis of linearity are tested. Global linearity : whether jointly, (X 1,, X k ), has linear relationship with Y? Whether all population coefficients 1,, k are not 0? H0 : β1 =... = βk = 0 --F test in ANOVA! Individual linearity : whether individually, each X 1,X 2,,X k, has linear relationship with Y? Whether any of the population coefficient 1,, k is not 0? Eg, for X1, testing H : 0 vs H : 0 --t-test! 0 β1 = 0 β1 30
Individual test (P490) Step 1. Hypotheses H : β = 0 vs H : β 0 0 1 1 1 Step 2. Significant level Step 3. Test statistic : t-test statistic b1 0 t = = SE(b ) 1 b s 1 b 1 31
Step 4. Decision rule : A two-sided t-test Since under null hypothesis, t ~ t distribution with d.f. n-(k+1). H0 is rejected if t t Or if p-value ( n (k+ 1), α / 2), t t(n (k+ 1), α / 2) Step 5. Conclusion : 32
Example. P518 Since n-k-1=16, =0.05,critical values = t 16,0.025 = 2.12 數 Conclusion : if =0.05 1. For intercept, a = 427.19, SE(a)=59.6, t=7.17, p-value=0, significant! 2. For X1, b1=-4.58, SE(b1)=0.77, t = -5.93, p-value = 0.00, significant! 3. For X2, b2=-14.83, SE(b2)=4.75, t = -3.12, p-value=0.0066, significant! 4. For X3, b3=6.10, SE(b3)=4.01, t = 1.52, p-value=0.1479 > 0.05, not significant! Recall : in the correlation matrix, r(y,x3)=0.53 is quite large, why the linearity is insignificant here? r(y,x2)=-0.25 is close to 0, why the linearity is significant here? 33
Global test (P487-488) Step 1. Hypotheses H0 : β1 = β2 =... = βk = 0 Step 2. Significant level Step 3. Test statistic : F-test statistic SSR / k F = = SSE /(n (k + 1)) MSR MSE 34
Step 4. Decision rule : A one-sided F-test : significant if F is large Since under null hypothesis, F ~ F distribution with d.f. (k, n-(k+1)). H0 is rejected if F Or if p-value F (k,n (k+ 1), α) Step 5. Conclusion : 35
Example. P488 Since k=3, n-k-1=16, =0.05,critical values =F (3,16,0.05) =3.24 度 Conclusion : at =0.05, H0 is rejected since F=21.9 > 3.24 or p-value = 0.000007 < 0.05, 36
Strategy for model selection : (P490) how many independent variables should be in the model? 1. Develop a multiple regression equation based on all independent variables. 1) Global test : significant? If not : stop and conclude that (X1,,Xk) are uncorrelated with Y. If yes : continue to 1-2). 2) Individual test : significant? If all are, go to 3. If some are, some are not: go to 2. 2. Remove the X with the largest p-value, back to 1. Delete the most insignificant independent variable. 3. The global and individual linearity are significant, check the model assumptions. 37
Is there any nonlinear relationship between X and Y? Residual = e = Y-Y = unexplained error/variation Is there any systematic pattern in a residual plot : (X, e)? If the model is right, Y ~ N( µ e The residuals are around 0 and independent with X. If the model is not right, e.g. Y ~ N( µ e Y = Y Y' Y = α + β x, σ Y µ = α + β x + β = Y (a + bx) 1 1 Y ),Y' = ~ N(0, σ Y ( α + β ),Y' = x) ~ N( β a + bx The residual is a quadratic function of X If the nonlinear relationship exists, the model should be modified. 2 2 x 2, σ 1 2 2 a + bx ) 2 x 2, σ 2 ) 38
39
2. Check the multicollinearity between Xs Multicollinearity : correlation exists among the independent variables Xs. Multicollinearity can distort the SE(b) and lead to incorrect conclusions in hypotheses testing. SE(b) becomes large, the conclusion is insignificant. In previous example, X1, X3 are correlated. Method : check the X part in the correlation matrix Multicollinearity exists if r > 0.7, or r < -0.7 Strategy : If multicollinearity exists, drop one of the independent variables and rebuild the model. 40
Example. P486 EXCEL Slight correlations between (X1, X2) (X2, X3) Moderate negative correlation between (X1, X3) Recall that H 0 : 3 =0 is not rejected. 41
Model assumptions : If the model is correct, Y ~ N( µ = α + β x, σ e = Y Y' Y 1,..,Y n are independent e i,,e n are independent approximately Y 1,..,Y n ~Normal population distribution e i,,e n ~normal Y Y µ ),Y' = ~ N(0, σ a + bx Y 1,..,Y n has constant variance at each level of x 1 Y 2 2 ) 42
Under independency, 3. Assumption of independence 1. The observed value should be independent with the sampling order (i), Residual plot : ( i, e) 2. The successive observations should be uncorrelated. Residual plot : (e(i), e(i+1)) 43
3. Assumption of independence-1 The residuals ei should be independent with the order i Plot ( i, e i ) 44
3. Assumption of independence -2 There should be no systematic pattern between successive obsn s Plot e,e ) ( i i+ 1 45
4. Assumption of normality and equal variance Normal distribution : the residuals, e s ~ normal 1. Histogram of e s. : bell-shaped, symmetric Example. Model : X1, X2, X3 P505 residual 率 率 46
2. Normal probability plot, p-p plot : 率 : linear Example. Model : X1, X2, X3 P505 率 Nearly a straight line, we can conclude that normal is true. 47
Equal variance/homoscedasticity The distributions of Y at different X-levels have equal variances. homoscedasticity If the variances are not equal, SE(regression coeff.) is understated t-statistic is too large incorrectly conclude the significance of X. If the variances are not equal, Select other independent variables Some transformations on X or Y 48
The residuals should have equal variations at different X- levels. Check the residual plot (X, e) or (, e) Ŷ P526 Example 1. Unequal variance : increased. Example 2. Other association, quadratic, may exists. 49
50
Example. An analyst is studying the effect of tire pressure on fuel economy (Mpg) for a fleet of 24 sedans used by regional supervisors. There are four different cars driven with a tire pressure of 30, 31, 32, 33, 34 and 35 pounds per square inch. Develop an appropriate regression model to relate tire pressure to fuel effectiveness. What appears to be the best level for tire pressure? 51
The mileage seems to be curvilinear to the pressure. 52
數 數 The R 2 is low. 度 數 Y =4.53+0.89(Pressure) 53
According to the residual plot, there is a non-linear relation between the residual and the pressure. 54
數 數 度 數 Y =-1208.43+75.74(Pressure)-1.15(pressure)^2 55
56
率 According to the residual plots, there is no severe departure from the model assumptions. 57
EXCEL : 料 數 Exercise : 9, 10, 11, 13, 14, 15 Excel: 17, 21, 23, 25 58
Bonus : (1%) Exercise 14.25 利 EXCEL 料 立 1. Linear relationship between X and Y? Global linearity Individual linearity 2. Multicollinearity? 3. Independent observations (X1,,Xk, Y)? 4. Normal distribution? 5. Equal variance? EXCEL output 59