Models with qualitative explanatory variables p216

Models with qualitative explanatory variables p216 Example gen = 1 for female Row gpa hsm gen 1 3.32 10 0 2 2.26 6 0 3 2.35 8 0 4 2.08 9 0 5 3.38 8 0 6 3.29 10 0 7 3.21 8 0 8 2.00 3 0 9 3.18 9 0 10 2.34 7 0 11 3.08 9 0 218 2.86 9 1 219 3.32 10 1 220 2.07 9 1 221 0.85 7 1 222 1.86 7 1 223 2.59 5 1 224 2.28 9 1 1

gpa = 0.903 + 0.207 hsm + 0.0269 gen Constant 0.9029 0.2447 3.69 0.000 hsm 0.20704 0.02885 7.18 0.000 gen 0.02693 0.09874 0.27 0.785 S = 0.7043 R-Sq = 19.1% R-Sq(adj) = 18.3% Regression 2 25.847 12.923 26.06 0.000 Residual Error 221 109.616 0.496 Total 223 135.463 2

If the qualitative variable had more than two levels (say, l levels) introduce l-1 dummy variables. Example Length = length of stay in hospital (days) Nnurses = Number of nurses Region : There are 4 regions: NC, NE, S and W Row length nnurses region NC NE S 1 7.13 241 W 0 0 0 2 8.82 52 NC 1 0 0 3 8.34 54 S 0 0 1 4 8.95 148 W 0 0 0 5 11.20 151 NE 0 1 0 6 9.76 106 NC 1 0 0 7 9.68 129 S 0 0 1 8 11.18 360 NC 1 0 0 9 8.67 118 S 0 0 1 109 11.80 469 NC 1 0 0 110 9.50 46 S 0 0 1 111 7.70 136 W 0 0 0 112 17.94 407 NE 0 1 0 113 9.41 22 S 0 0 1 length = 7.52 + 0.00401 nnurses + 1.42 NC + 2.80 NE + 1.03 S Constant 7.5218 0.4272 17.61 0.000 nnurses 0.004010 0.001083 3.70 0.000 NC 1.4178 0.4869 2.91 0.004 NE 2.8028 0.4988 5.62 0.000 S 1.0256 0.4744 2.16 0.033 S = 1.585 R-Sq = 33.7% R-Sq(adj) = 31.3% Regression 4 138.000 34.500 13.74 0.000 Residual Error 108 271.211 2.511 Total 112 409.210 3

Predicted Values (at nnurses = 150, NC = 1, NE =0, S = 0) Fit StDev Fit 95.0% CI 95.0% PI 9.541 0.283 ( 8.981, 10.102) ( 6.350, 12.732) MINITAB Commands 4

Example Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4 0 0 1 46 5 0 0 1 38 6 0 0 1 47 7 0 0 0 21 8 0 0 0 12 9 0 0 0 14 10 0 0 0 17 11 0 0 0 13 12 0 0 0 17 13 0 1 0 37 14 0 1 0 32 15 0 1 0 15 16 0 1 0 25 17 0 1 0 39 18 0 1 0 41 19 1 0 0 16 20 1 0 0 11 21 1 0 0 20 22 1 0 0 21 23 1 0 0 14 24 1 0 0 7 Test whether some colors are more attractive than others to beetles. Insects trapped = 15.7-0.83 Blue + 15.8 Green + 31.5 Lemon Constant 15.667 2.770 5.66 0.000 Blue -0.833 3.917-0.21 0.834 Green 15.833 3.917 4.04 0.001 Lemon 31.500 3.917 8.04 0.000 S = 6.784 R-Sq = 82.1% R-Sq(adj) = 79.4% Regression 3 4218.5 1406.2 30.55 0.000 Residual Error 20 920.5 46.0 Total 23 5139.0 6

A test for comparing nested models p231 Definition Two models are nested if one model contains all the terms of the second model and at least one additional term. The model with more terms is called the complete (or full) model. The model with fewer terms is called the reduced (or restricted) model. Example 4.10 p 223 (Data from Table 4.4 p214) Row wt distance cost wt*dist wt**2 diast**2 1 5.90 47 2.6 277.3 34.8100 2209 2 3.20 145 3.9 464.0 10.2400 21025 3 4.40 202 8.0 888.8 19.3600 40804 4 6.60 160 9.2 1056.0 43.5600 25600 5 0.75 280 4.4 210.0 0.5625 78400 6 0.70 80 1.5 56.0 0.4900 6400 7 6.50 240 14.5 1560.0 42.2500 57600 8 4.50 53 1.9 238.5 20.2500 2809 9 0.60 100 1.0 60.0 0.3600 10000 10 7.50 190 14.0 1425.0 56.2500 36100 11 5.10 240 11.0 1224.0 26.0100 57600 12 2.40 209 5.0 501.6 5.7600 43681 13 0.30 160 2.0 48.0 0.0900 25600 14 6.20 115 6.0 713.0 38.4400 13225 15 2.70 45 1.1 121.5 7.2900 2025 16 3.50 250 8.0 875.0 12.2500 62500 17 4.10 95 3.3 389.5 16.8100 9025 18 8.10 160 12.1 1296.0 65.6100 25600 19 7.00 260 15.5 1820.0 49.0000 67600 20 1.10 90 1.7 99.0 1.2100 8100 a) Fit a complete second order model. 7

cost = 0.827-0.609 wt + 0.00402 dist + 0.00733 wt*dist + 0.0898 wt**2 + 0.000015 dist**2 Constant 0.8270 0.7023 1.18 0.259 wt -0.6091 0.1799-3.39 0.004 dist 0.004021 0.007998 0.50 0.623 wt*dist 0.0073271 0.0006374 11.49 0.000 wt**2 0.08975 0.02021 4.44 0.001 dist**2 0.00001507 0.00002243 0.67 0.513 S = 0.4428 R-Sq = 99.4% R-Sq(adj) = 99.2% Regression 5 449.341 89.868 458.39 0.000 Residual Error 14 2.745 0.196 Total 19 452.086 Source DF Seq SS wt 1 270.553 dist 1 143.631 wt*dist 1 31.268 wt**2 1 3.800 dist**2 1 0.088 Test the hypothesis that the terms wt**2 and dist**2 can be dropped from the model. 8

cost = - 0.141 + 0.019 wt + 0.00772 distance + 0.00780 wt*dist Constant -0.1405 0.6481-0.22 0.831 wt 0.0191 0.1582 0.12 0.905 distance 0.007721 0.003906 1.98 0.066 wt*dist 0.0077957 0.0008977 8.68 0.000 S = 0.6439 R-Sq = 98.5% R-Sq(adj) = 98.3% Regression 3 445.45 148.48 358.15 0.000 Residual Error 16 6.63 0.41 Total 19 452.09 Ex 5.14, p270, 5.15, p271 9

Examples (STA221 Apr 98 Final Exam) Weight = 0.0265-0.0729 Diameter + 0.0628 diam**2 Constant 0.02652 0.02133 1.24 0.240 Diameter -0.07287 0.01553-4.69 0.001 diam**2 0.062755 0.002609 24.06 0.000 S = 0.01117 R-Sq = 99.9% R-Sq(adj) = 99.9% Regression 2 1.32300 0.66150 5299.38 0.000 Residual Error 11 0.00137 0.00012 Total 13 1.32437 Source DF Seq SS Diameter 1 1.25077 diam**2 1 0.07223 The test of significance for the contribution of the second order term in diameter has an F-value of (to the nearest 50) A) 7600 B) 5300 C) 2650 D) 600 E) 350 Weight = - 0.237 + 0.447 Diameter - 0.150 Height Constant -0.23658 0.06340-3.73 0.003 Diameter 0.44689 0.03921 11.40 0.000 Height -0.15043 0.03622-4.15 0.002 S = 0.05104 R-Sq = 97.8% R-Sq(adj) = 97.4% Regression 2 1.29571 0.64786 248.65 0.000 Residual Error 11 0.02866 0.00261 Total 13 1.32437 10

Weight = 0.0216-0.151 Diameter + 0.0467 Height + 0.0721 diam**2-0.00290 ht**2 Constant 0.02156 0.03595 0.60 0.563 Diameter -0.15141 0.04792-3.16 0.012 Height 0.04666 0.03762 1.24 0.246 diam**2 0.072104 0.006467 11.15 0.000 ht**2-0.002898 0.004179-0.69 0.505 S = 0.01057 R-Sq = 99.9% R-Sq(adj) = 99.9% Regression 4 1.32336 0.33084 2958.44 0.000 Residual Error 9 0.00101 0.00011 Total 13 1.32437 Source DF Seq SS Diameter 1 1.25077 Height 1 0.04494 diam**2 1 0.02760 ht**2 1 0.00005 Residuals Versus Weight (response is Weight) 0.02 0.01 Residual 0.00-0.01-0.02 0.0 0.5 Weight 1.0 11

Residuals Versus Height (response is Weight) 0.02 0.01 Residual 0.00-0.01-0.02 2 3 4 5 6 Height Residuals Versus the Fitted Values (response is Weight) 0.02 0.01 Residual 0.00-0.01-0.02 0.0 0.5 1.0 Fitted Value Normal Probability Plot of the Residuals (response is Weight) 2 1 Normal Score 0-1 -2-0.02-0.01 0.00 0.01 0.02 Residual 12

Histogram of the Residuals (response is Weight) 4 3 Frequency 2 1 0-0.015-0.010-0.005 0.000 0.005 0.010 0.015 0.020 Residual Weight = 0.117 + 0.0982 Diameter - 0.159 Height + 0.0513 diam*ht Constant 0.11742 0.08189 1.43 0.182 Diameter 0.09820 0.07567 1.30 0.224 Height -0.15942 0.02090-7.63 0.000 diam*ht 0.05133 0.01063 4.83 0.001 S = 0.02934 R-Sq = 99.4% R-Sq(adj) = 99.2% Regression 3 1.31577 0.43859 509.61 0.000 Residual Error 10 0.00861 0.00086 Total 13 1.32437 7) Which of the following are true? I) If we test the extra contribution of both height and height squared to the model with only diameter and diameter squared, the calculated F-statistics would be lass than 2. II) If we test the extra contribution of adding both height squared and diameter squared to the to the first order model with just height and diameter, the calculated F-statistic is lass than 200 III) If we assume the appropriateness of the model with diameter, height and their product, we see that the effect on dry weight of an increase in diameter is not independent of the height of the trees. IV) Residual plots indicate problems with the second order model containing diameter, height and their respective squares. 13

-Sequential Sums of Squares Regression Analysis: cost versus wt, distance, wt*dist, wt**2, dist**2 cost = 0.827-0.609 wt + 0.00402 distance + 0.00733 wt*dist + 0.0898 wt**2 + 0.000015 dist**2 Predictor Coef SE Coef T P Constant 0.8270 0.7023 1.18 0.259 wt -0.6091 0.1799-3.39 0.004 distance 0.004021 0.007998 0.50 0.623 wt*dist 0.0073271 0.0006374 11.49 0.000 wt**2 0.08975 0.02021 4.44 0.001 dist**2 0.00001507 0.00002243 0.67 0.513 S = 0.442778 R-Sq = 99.4% R-Sq(adj) = 99.2% Regression 5 449.341 89.868 458.39 0.000 Residual Error 14 2.745 0.196 Total 19 452.086 Source DF Seq SS wt 1 270.553 distance 1 143.631 wt*dist 1 31.268 wt**2 1 3.800 dist**2 1 0.088 Regression Analysis: cost versus wt cost = 0.28 + 1.49 wt Predictor Coef SE Coef T P 14

Constant 0.276 1.368 0.20 0.842 wt 1.4932 0.2883 5.18 0.000 S = 3.17571 R-Sq = 59.8% R-Sq(adj) = 57.6% Regression 1 270.55 270.55 26.83 0.000 Residual Error 18 181.53 10.09 Total 19 452.09 Regression Analysis: cost versus wt, distance cost = - 4.67 + 1.29 wt + 0.0369 distance Predictor Coef SE Coef T P Constant -4.6728 0.8911-5.24 0.000 wt 1.2924 0.1378 9.38 0.000 distance 0.036936 0.004602 8.03 0.000 S = 1.49314 R-Sq = 91.6% R-Sq(adj) = 90.6% Regression 2 414.18 207.09 92.89 0.000 Residual Error 17 37.90 2.23 Total 19 452.09 Regression Analysis: cost versus wt, distance, wt*dist cost = - 0.141 + 0.019 wt + 0.00772 distance + 0.00780 wt*dist 15

Predictor Coef SE Coef T P Constant -0.1405 0.6481-0.22 0.831 wt 0.0191 0.1582 0.12 0.905 distance 0.007721 0.003906 1.98 0.066 wt*dist 0.0077957 0.0008977 8.68 0.000 S = 0.643880 R-Sq = 98.5% R-Sq(adj) = 98.3% Regression 3 445.45 148.48 358.15 0.000 Residual Error 16 6.63 0.41 Total 19 452.09 Regression Analysis: cost versus wt, distance, wt*dist, wt**2 cost = 0.475-0.578 wt + 0.00908 distance + 0.00726 wt*dist + 0.0867 wt**2 Predictor Coef SE Coef T P Constant 0.4747 0.4585 1.04 0.317 wt -0.5782 0.1707-3.39 0.004 distance 0.009078 0.002654 3.42 0.004 wt*dist 0.0072587 0.0006176 11.75 0.000 wt**2 0.08674 0.01934 4.49 0.000 S = 0.434604 R-Sq = 99.4% R-Sq(adj) = 99.2% Regression 4 449.25 112.31 594.62 0.000 Residual Error 15 2.83 0.19 Total 19 452.09 16

Regression Analysis: cost versus wt, distance, wt*dist, wt**2, dist**2 cost = 0.827-0.609 wt + 0.00402 distance + 0.00733 wt*dist + 0.0898 wt**2 + 0.000015 dist**2 Predictor Coef SE Coef T P Constant 0.8270 0.7023 1.18 0.259 wt -0.6091 0.1799-3.39 0.004 distance 0.004021 0.007998 0.50 0.623 wt*dist 0.0073271 0.0006374 11.49 0.000 wt**2 0.08975 0.02021 4.44 0.001 dist**2 0.00001507 0.00002243 0.67 0.513 S = 0.442778 R-Sq = 99.4% R-Sq(adj) = 99.2% Regression 5 449.341 89.868 458.39 0.000 Residual Error 14 2.745 0.196 Total 19 452.086 17