Model Building Chap 5 p251 Models with one qualitative variable, 5.7 p277 Example 4 Colours : Blue, Green, Lemon Yellow and white Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4 0 0 1 46 5 0 0 1 38 6 0 0 1 47 7 0 0 0 21 8 0 0 0 12 9 0 0 0 14 10 0 0 0 17 11 0 0 0 13 12 0 0 0 17 13 0 1 0 37 14 0 1 0 32 15 0 1 0 15 16 0 1 0 25 17 0 1 0 39 18 0 1 0 41 19 1 0 0 16 20 1 0 0 11 21 1 0 0 20 22 1 0 0 21 23 1 0 0 14 24 1 0 0 7 Descriptive Statistics: Insects Variable Colour N N* Mean Insects B 6 0 14.83 G 6 0 31.50 L 6 0 47.17 W 6 0 15.67 The regression equation is Insects trapped = 15.7-0.83 Blue + 15.8 Green + 31.5 Lemon Predictor Coef StDev T P Constant 15.667 2.770 5.66 0.000 Blue -0.833 3.917-0.21 0.834 Green 15.833 3.917 4.04 0.001 Lemon 31.500 3.917 8.04 0.000 S = 6.784 R-Sq = 82.1% R-Sq(adj) = 79.4% 1
Analysis of Variance Source DF SS MS F P Regression 3 4218.5 1406.2 30.55 0.000 Residual Error 20 920.5 46.0 Estimate the betas using the means (Descriptive statistics) State whether the following statements are true or false. a) The value of the F-statistic for testing any differences among the colours is 30.55. b) We have evidence at p < 0.01 that the means for green and white are different. c) We have evidence at p < 0.01 that means for blue and white are different. d) A 95% confidence interval for the difference between means for lemon yellow and white is (23.3, 39.7) e) We may say that 82.1% of the variation in the number of insects trapped has been accounted for by the above model. 2
Models with two qualitative variables, 5.8 p282 Data Display Row C1 perform F2 F3 B2 F2B2 F3B2 F B 1 F1B1 65 0 0 0 0 0 F1 B1 2 F1B1 73 0 0 0 0 0 F1 B1 3 F1B1 68 0 0 0 0 0 F1 B1 4 F1B2 36 0 0 1 0 0 F1 B2 5 F2B1 78 1 0 0 0 0 F2 B1 6 F2B1 82 1 0 0 0 0 F2 B1 7 F2B2 50 1 0 1 1 0 F2 B2 8 F2B2 43 1 0 1 1 0 F2 B2 9 F3B1 48 0 1 0 0 0 F3 B1 10 F3B1 46 0 1 0 0 0 F3 B1 11 F3B2 61 0 1 1 0 1 F3 B2 12 F3B2 62 0 1 1 0 1 F3 B2 Example 5.10 p286 Main effects model The regression equation is perform = 64.5 + 6.70 F2-2.30 F3-15.8 B2 Predictor Coef StDev T P Constant 64.455 7.180 8.98 0.000 F2 6.705 9.941 0.67 0.519 F3-2.295 9.941-0.23 0.823 B2-15.818 8.291-1.91 0.093 S = 13.75 R-Sq = 36.2% R-Sq(adj) = 12.3% Analysis of Variance Source DF SS MS F P Regression 3 858.3 286.1 1.51 0.284 Residual Error 8 1512.4 189.1 Total 11 2370.7 Source DF Seq SS F2 1 92.0 F3 1 78.1 B2 1 688.1 3
Interaction model Descriptive Statistics: perform Variable C1 N N* Mean StDev perform F1B1 3 0 68.67 4.04 F1B2 1 0 36.000 * F2B1 2 0 80.00 2.83 F2B2 2 0 46.50 4.95 F3B1 2 0 47.00 1.41 F3B2 2 0 61.500 0.707 The regression equation is perform = 68.7 + 11.3 F2-21.7 F3-32.7 B2-0.83 F2B2 + 47.2 F3B2 Predictor Coef StDev T P Constant 68.667 1.939 35.42 0.000 F2 11.333 3.066 3.70 0.010 F3-21.667 3.066-7.07 0.000 B2-32.667 3.878-8.42 0.000 F2B2-0.833 5.130-0.16 0.876 F3B2 47.167 5.130 9.19 0.000 S = 3.358 R-Sq = 97.1% R-Sq(adj) = 94.8% Analysis of Variance Source DF SS MS F P Regression 5 2303.00 460.60 40.84 0.000 Residual Error 6 67.67 11.28 Total 11 2370.67 Source DF Seq SS F2 1 92.04 F3 1 78.13 B2 1 688.09 F2B2 1 491.30 F3B2 1 953.44 4
Interaction Plot - Data Means for perform B 80 B1 B2 70 Mean 60 50 40 F1 F2 F3 F Estimate the regression equation using the means (descriptive statistics) Test whether there is an interaction between brand and fuel type. 5
Variable Screening methods, Chap 6 p321 Stepwise regression p323 A hospital Surgical unit was interested in predicting the survival times of patients undergoing a particular type of liver operation. A random sample of patients was available for analysis. From each patient record, the following info was extracted from the preoperation evaluation: X1 = blood clotting score X2 = prognostic index X3 = enzyme function test score X4 = liver function test score X5 = age in years X6 = indicator variable for gender (0 = M, 1 = F) X7 and X8 = indicator variables for history of alcohol use (categorical: none, moderate, severe) X7 = indicator of moderate X8 = indicator of severe Data Display Row X1 X2 X3 X4 X5 X6 X7 X8 Y lny 1 6.7 62 81 2.59 50 0 1 0 695 6.544 2 5.1 59 66 1.70 39 0 0 0 403 5.999 3 7.4 57 83 2.16 55 0 0 0 710 6.565 4 6.5 73 41 2.01 48 0 0 0 349 5.854 5 7.8 65 115 4.30 45 0 0 1 2343 7.759 6 5.8 38 72 1.42 65 1 1 0 348 5.852 7 5.7 46 63 1.91 49 1 0 1 518 6.25 50 3.9 82 103 4.55 50 0 1 0 1078 6.983 51 6.6 77 46 1.95 50 0 1 0 405 6.005 52 6.4 85 40 1.21 58 0 0 1 579 6.361 53 6.4 59 85 2.33 63 0 1 0 550 6.310 54 8.8 78 72 3.20 56 0 0 0 651 6.478 6
The regression equation is Y = - 1149 + 62.4 X1 + 8.97 X2 + 9.89 X3 + 50.4 X4-0.95 X5 + 15.9 X6 + 7.7 X7+ 321 X8 Predictor Coef StDev T P Constant -1148.8 242.3-4.74 0.000 X1 62.39 24.47 2.55 0.014 X2 8.973 1.874 4.79 0.000 X3 9.888 1.742 5.68 0.000 X4 50.41 44.96 1.12 0.268 X5-0.951 2.649-0.36 0.721 X6 15.87 58.47 0.27 0.787 X7 7.71 64.96 0.12 0.906 X8 320.70 85.07 3.77 0.000 S = 201.4 R-Sq = 78.2% R-Sq(adj) = 74.3% Analysis of Variance Source DF SS MS F P Regression 8 6543615 817952 20.16 0.000 Residual Error 45 1825906 40576 Total 53 8369521 Residuals Versus the Fitted Values (response is Y) 800 Residual 400 0 0 500 1000 1500 Fitted Value 7
Normal Probability Plot of the Residuals (response is Y) 2 Normal Score 1 0-1 -2 0 Residual 400 800 Histogram of the Residuals (response is Y) 15 10 Frequency 5 0 0 Residual 500 1000 8
The regression equation is lny = 4.05 + 0.0685 X1 + 0.0135 X2 + 0.0150 X3 + 0.0080 X4-0.00357 X5 + 0.0842 X6 + 0.0579 X7 + 0.388 X8 Predictor Coef StDev T P Constant 4.0505 0.2518 16.09 0.000 X1 0.06851 0.02542 2.70 0.010 X2 0.013452 0.001947 6.91 0.000 X3 0.014954 0.001809 8.26 0.000 X4 0.00802 0.04671 0.17 0.865 X5-0.003566 0.002752-1.30 0.202 X6 0.08421 0.06075 1.39 0.173 X7 0.05786 0.06748 0.86 0.396 X8 0.38838 0.08838 4.39 0.000 S = 0.2093 R-Sq = 84.6% R-Sq(adj) = 81.9% Analysis of Variance Source DF SS MS F P Regression 8 10.8370 1.3546 30.93 0.000 Residual Error 45 1.9707 0.0438 Total 53 12.8077 Residuals Versus the Fitted Values (response is lny) 0.5 0.4 0.3 0.2 Residual 0.1 0.0-0.1-0.2-0.3-0.4 5.5 6.5 Fitted Value 7.5 9
Normal Probability Plot of the Residuals (response is lny) 2 Normal Score 1 0-1 -2-0.4-0.3-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 Residual Histogram of the Residuals (response is lny) 15 10 Frequency 5 0-0.4-0.3-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 Residual 10
Stepwise Regression F-to-Enter: 4.00 F-to-Remove: 4.00 Response is lny on 8 predictors, with N = 54 Step 1 2 3 4 Constant 5.264 4.351 4.291 3.852 X3 0.0151 0.0154 0.0145 0.0155 T-Value 6.23 8.19 9.33 11.07 X2 0.0141 0.0149 0.0142 T-Value 5.98 7.68 8.20 X8 0.429 0.353 T-Value 5.08 4.57 X1 0.073 T-Value 3.86 S 0.375 0.291 0.238 0.211 R-Sq 42.76 66.33 77.80 82.99 11
Minitab commands for stepwise regression 12
13
All possible Regressions Selection Procedure (6.3) p327 R-sq Criterion: 2 SSR SSE R = = 1 SST SST Response is lny Adj. X X X X X X X X Vars R-Sq R-Sq C-p s 1 2 3 4 5 6 7 8 1 42.8 41.7 117.4 0.37549 X 1 42.2 41.0 119.2 0.37746 X 1 22.1 20.6 177.9 0.43807 X 1 13.9 12.2 201.8 0.46052 X 1 6.1 4.3 224.7 0.48101 X 2 66.3 65.0 50.5 0.29079 X X 2 59.9 58.4 69.1 0.31715 X X 2 54.9 53.1 84.0 0.33668 X X 2 51.6 49.7 93.4 0.34850 X X 2 50.8 48.9 95.9 0.35157 X X 3 77.8 76.5 18.9 0.23845 X X X 3 75.7 74.3 25.0 0.24934 X X X 3 71.8 70.1 36.5 0.26885 X X X 3 68.1 66.2 47.3 0.28587 X X X 3 67.6 65.7 48.7 0.28802 X X X 4 83.0 81.6 5.8 0.21087 X X X X 4 81.4 79.9 10.3 0.22023 X X X X 4 78.9 77.2 17.8 0.23498 X X X X 4 78.4 76.6 19.3 0.23785 X X X X 4 78.0 76.2 20.4 0.23982 X X X X 5 83.7 82.1 5.5 0.20827 X X X X X 5 83.6 81.9 6.0 0.20931 X X X X X 5 83.3 81.6 6.8 0.21100 X X X X X 5 83.2 81.4 7.2 0.21193 X X X X X 5 81.8 79.9 11.3 0.22044 X X X X X 6 84.3 82.3 5.8 0.20655 X X X X X X 6 83.9 81.9 7.0 0.20934 X X X X X X 6 83.9 81.8 7.2 0.20964 X X X X X X 6 83.8 81.8 7.2 0.20982 X X X X X X 6 83.7 81.6 7.6 0.21066 X X X X X X 7 84.6 82.3 7.0 0.20705 X X X X X X X 7 84.4 82.0 7.7 0.20867 X X X X X X X 7 84.0 81.6 8.7 0.21081 X X X X X X X 7 84.0 81.5 8.9 0.21136 X X X X X X X 7 82.1 79.4 14.3 0.22306 X X X X X X X 8 84.6 81.9 9.0 0.20927 X X X X X X X X 14
Best Subsets Regression Response is lny Adj. X X X X X X X X Vars R-Sq R-Sq C-p s 1 2 3 4 5 6 7 8 1 42.8 41.7 117.4 0.37549 X 2 66.3 65.0 50.5 0.29079 X X 3 77.8 76.5 18.9 0.23845 X X X 4 83.0 81.6 5.8 0.21087 X X X X 5 83.7 82.1 5.5 0.20827 X X X X X 6 84.3 82.3 5.8 0.20655 X X X X X X 7 84.6 82.3 7.0 0.20705 X X X X X X X 8 84.6 81.9 9.0 0.20927 X X X X X X X X 85 80 75 70 R-sq 65 60 55 50 45 40 1 2 3 4 5 6 7 8 vars 15
Ex: Response is crimes p b o h d p t p s e o o 1 p g g v u t 8 o r r e n p - p a e r e Adj. o 3 6 d e t m Vars R-Sq R-Sq C-p s p 4 5 s s y p 1 75.4 75.3 23.6 39995 X 2 78.3 78.1-0.2 37660 X X 3 78.4 78.1 1.0 37671 X X X 4 78.5 78.0 2.6 37732 X X X X 5 78.5 78.0 4.1 37784 X X X X X 6 78.5 77.9 6.1 37875 X X X X X X 7 78.5 77.8 8.0 37968 X X X X X X X 78.5 77.5 R-sq 76.5 75.5 1 2 3 4 Vars 5 6 7 16
Other Criteria R-sq (Adj) 2) 2 MSE R = 1 Adj SST /( n 1) 3) C p criterion p328 Cp SSE p = + 2( p+ 1) n MSE k C p criterion selects as the best model, the subset model with 1) a small value of C p 2) value of C p near p + 1 (p is the number of predictors) 17
MINITAB commands 18
19