Stat 401B Exam 3 Fall 2016 (Corrected Version)

Stat 401B Exam 3 Fall 2016 (Corrected Version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning will receive NO partial credit. Correct numerical answers to difficult questions unaccompanied by supporting reasoning may not receive full credit. SHOW YOUR WORK/EXPLAIN YOURSELF! 1

1. There are data on the UCI Machine Learning Repository due originally to P. Tüfekci and H. Kaya concerning the running of a power plant. Hourly information on atmospheric conditions and a plant operating variable were collected over a number of years, along with the hourly energy output of the plant. This question concerns MLR analyses of a random sample of 200 of the hourly periods made treating mean "PE" (electrical power) as a function of the variables "AT" (ambient temperature in C ), "AP" (ambient pressure in milibars), "RH" (relative humidity in %), and "V" (exhaust vacuum in cm Mg). 5 pts a) Below is a graphic from the "leaps" function regsubsets() for the n = 200 periods. Which 2 predictors seem to be most effective in predicting PE? What fraction of the raw variability in PE do they account for? 8 pts b) Give the value of and degrees of freedom for an F statistic for comparing the full model involving all predictors to the best 2-predictor model. (While it is not really needed to answer this question, SSTot = 54670 for these n = 200 cases.) F = d.f. =, 2

Below are some results (cross-validation root mean squared prediction error) from repeated 10-fold cross-validation, and values of 2 MSE and R for several MLR models for PE. Model Predictors Included CV-RMSPE RMSE R-Squared 1 V 8.92 8.81.720 2 AT 5.30 5.24.901 3 AT,V 4.66 4.70.921 4 AT,RH 4.56 4.60.925 5 AT,AP,RH 4.61 4.60.925 6 AT,V,RH 4.31 4.32.934 7 AT,V,AP,RH 4.36 4.34.934 4 pts c) Which of models 1-7 is most attractive on the basis of the table above? Explain. 4 pts d) What about the table above suggests that none of the models fit there suffers dramatic overfitting? Below are some scatterplots of the data from the 200 sample hours. 4 pts d) Is there evidence of multicollinearity in these plots? If so, what is it? 3

2. There is an interesting "Banknote Authentication" data set on the UCI Machine Learning repository that consists of 4 numerical features extracted from 400 400 grey scale images of real and counterfeit banknotes. There are 610 counterfeit and 762 real notes represented in the data set. There is a printout beginning on Page 8 of this exam from an attempt to model the probability that a note is counterfeit (V5=1) as a function of the features (V1,V2,V3,V4). Use it to answer the following questions. a) Which of the features V1,V2,V3,V4 appears to be least important in modeling the probability that V5=1 (the note is counterfeit)? Explain. ( ( )) b) Recall that if p ( u) = exp ( u) / ( 1+ exp( u) ) then the "log odds" are u ln p( u) / 1 p( u) =. Give approximately 95% confidence limits for the increase in log odds that a banknote is counterfeit accompanying a unit increase in V1 if the other features V2,V3,V4 are held fixed. c) Give 2-sided approximately 95% confidence limits for the probability that a banknote with features V1=.2,V2=.8,V3=.4,V4=-.6 is counterfeit. 4

3. A data set in the book Regression Analysis by Graybill and Iyer concerns how an optical reading, y, measuring light transmitted through a chemical solution depends upon the concentration of a chemical, x (in mg/l). A possible nonlinear (in coefficients β1, β2, and β 3) form for the relationship between x and mean y is μyx = β1+ β2exp( β3x) (*) A printout beginning on Page 9 summarizes an analysis of the n = 12 pairs in the data set. a) Suppose relationship (*) above holds and that for a given concentration the optical reading is normally distributed with standard deviation σ. Give approximate 95% two-sided confidence limits for this model parameter. 5 pts b) According to the relationship (*), as concentration, x, goes from 0 to, the mean light transmitted goes from β1+ β2 to β1. The value of concentration, x, at which half of the decrease in light transmission has been realized might be of interest. What is this in terms of the model parameters? Give 95% two-sided confidence limits for this value of x. 4. On page 217 of the white Vardeman and Jobe text there are data of Koh, Morden, and Ogbourne that concern axial breaking strengths of wooden dowel rods of 3 different lengths and 3 different diameters. A printout beginning on Page 9 of this exam summarizes some computations with these data. a) What about the printed analyses of dowel strength makes direct analysis of y under the usual one-way normal model assumptions seem inappropriate? Instead we will henceforth consider analysis of y' ln( y) =. 5

b) Make an interaction plot enhanced with error bars based on 95% confidence limits for combination mean log strengths. What are your "margins of error" for this plotting? (Give a number.) + / margin: c) Based on the plot above, which effects appear to be both statistically detectable and most important? (Consider diameter and length main effects and interactions. List an order of importance.) d) What items on the printout support your judgment in c)? Explain how they lend support. 6

5. Beginning on Page 12 there is R code and output corresponding to a balanced 3 2 3 experiment on paper airplane flight distances (carried out in an undergraduate engineering statistics class). There are 3 levels of the factor "Design," 2 levels of the factor (nose) "Weight," and 3 levels of the factor "Paper" (type) in the study. Use the R output to answer the rest of the questions on this exam. a) What is the value of s pooled for this data set? (Say where you found your value.) What does this measure in the present context? b) What is the relatively simple interpretation that is possible for these data? (What factorial effect(s) dominate(s) and what does that mean about the flying of paper airplanes?) What on the output tells you that this is so? c) What type or types of airplanes fly furthest (according to the outcome of this study)? Explain. d) What do you predict for the average flight distance of the type or types of planes you identified in part c) based on a good simple model here? 7

R Code and OutPut for the Banknote Data > Banknote[1:5,] V1 V2 V3 V4 V5 1 3.62160 8.6661-2.8073-0.44699 0 2 4.54590 8.1674-2.4586-1.46210 0 3 3.86600-2.6383 1.9242 0.10645 0 4 3.45660 9.5228-4.0112-3.59440 0 5 0.32924-4.4552 4.5718-0.98880 0 > summary(banknote) V1 V2 V3 V4 V5 Min. :-7.0421 Min. :-13.773 Min. :-5.2861 Min. :-8.5482 Min. :0.0000 1st Qu.:-1.7730 1st Qu.: -1.708 1st Qu.:-1.5750 1st Qu.:-2.4135 1st Qu.:0.0000 Median : 0.4962 Median : 2.320 Median : 0.6166 Median :-0.5867 Median :0.0000 Mean : 0.4337 Mean : 1.922 Mean : 1.3976 Mean :-1.1917 Mean :0.4446 3rd Qu.: 2.8215 3rd Qu.: 6.815 3rd Qu.: 3.1793 3rd Qu.: 0.3948 3rd Qu.:1.0000 Max. : 6.8248 Max. : 12.952 Max. :17.9274 Max. : 2.4495 Max. :1.0000 > bank.out<-glm(as.factor(v5)~v1+v2+v3+v4,data=banknote,family=binomial()) Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred > summary(bank.out) Call: glm(formula = as.factor(v5) ~ V1 + V2 + V3 + V4, family = binomial(), data = Banknote) Deviance Residuals: Min 1Q Median 3Q Max -1.70001 0.00000 0.00000 0.00029 2.24614 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 7.3218 1.5589 4.697 2.64e-06 *** V1-7.8593 1.7383-4.521 6.15e-06 *** V2-4.1910 0.9041-4.635 3.56e-06 *** V3-5.2874 1.1612-4.553 5.28e-06 *** V4-0.6053 0.3307-1.830 0.0672. (Dispersion parameter for binomial family taken to be 1) Null deviance: 1885.122 on 1371 degrees of freedom Residual deviance: 49.891 on 1367 degrees of freedom AIC: 59.891 Number of Fisher Scoring iterations: 12 > unknown<-data.frame(v1=.2,v2=.8,v3=.4,v4=-.6) > predict(bank.out,newdata=unknown,se.fit=true) $fit 1 0.6453872 $se.fit [1] 0.4574428 $residual.scale [1] 1 8

R Code and OutPut for the Optical Data > optical.out<-nls(y~b1+b2*exp(-b3*x),start=c(b1=0,b2=3,b3=1),trace=t) 1.142691 : 0 3 1 0.4897814 : 0.08377278 2.66283847 0.67210762 0.4604279 : 0.02919644 2.72294772 0.68326005 0.4604271 : 0.02874071 2.72328367 0.68274958 0.4604271 : 0.02875388 2.72327628 0.68276289 > summary(optical.out) Formula: y ~ b1 + b2 * exp(-b3 * x) Parameters: Estimate Std. Error t value Pr(> t ) b1 0.02875 0.17152 0.168 0.870571 b2 2.72328 0.21054 12.935 4.05e-07 *** b3 0.68276 0.14166 4.820 0.000947 *** Residual standard error: 0.2262 on 9 degrees of freedom Number of iterations to convergence: 4 Achieved convergence tolerance: 7.998e-07 > confint(optical.out) Waiting for profiling to be done... 0.5970897 : 2.7232763 0.6827629 0.4785793 : 2.8120293 0.6082035... 1.042523 : 0.3770538 2.4924962 1.041343 : 0.3664283 2.4959833 2.5% 97.5% b1-0.5093296 0.3499205 b2 2.2623059 3.2411076 b3 0.4017537 1.0651215 > predict(optical.out) [1] 2.7520302 2.7520302 1.4046053 1.4046053 0.7238604 0.7238604 0.3799351 0.3799351 [9] 0.2061774 0.2061774 0.1183916 0.1183916 R Code and OutPut for the Dowel Strength Data > cbind(type,diam,length,strength) type diam length strength [1,] 1 0.1250 4 51.5 [2,] 1 0.1250 4 37.4 [3,] 1 0.1250 4 59.3 [4,] 1 0.1250 4 58.5 [5,] 2 0.1250 8 5.2 [6,] 2 0.1250 8 6.4 [7,] 2 0.1250 8 9.0 [8,] 2 0.1250 8 6.3 [9,] 3 0.1250 12 2.5 [10,] 3 0.1250 12 3.3 [11,] 3 0.1250 12 2.6 [12,] 3 0.1250 12 1.9 [13,] 4 0.1875 4 225.3 [14,] 4 0.1875 4 233.9 [15,] 4 0.1875 4 211.2 [16,] 4 0.1875 4 212.8 [17,] 5 0.1875 8 47.0 [18,] 5 0.1875 8 79.2 9

[19,] 5 0.1875 8 88.7 [20,] 5 0.1875 8 70.2 [21,] 6 0.1875 12 18.4 [22,] 6 0.1875 12 22.4 [23,] 6 0.1875 12 18.9 [24,] 6 0.1875 12 16.6 [25,] 7 0.2500 4 358.8 [26,] 7 0.2500 4 309.6 [27,] 7 0.2500 4 343.5 [28,] 7 0.2500 4 357.8 [29,] 8 0.2500 8 127.1 [30,] 8 0.2500 8 158.0 [31,] 8 0.2500 8 194.0 [32,] 8 0.2500 8 133.0 [33,] 9 0.2500 12 68.9 [34,] 9 0.2500 12 40.5 [35,] 9 0.2500 12 50.3 [36,] 9 0.2500 12 65.6 > > options(contrasts = rep("contr.sum", 2)) > > aggregate(strength,by=list(type),fun=mean) Group.1 x 1 1 51.675 2 2 6.725 3 3 2.575 4 4 220.800 5 5 71.275 6 6 19.075 7 7 342.425 8 8 153.025 9 9 56.325 > aggregate(strength,by=list(type),fun=sd) Group.1 x 1 1 10.1411291 2 2 1.6111590 3 3 0.5737305 4 4 10.7706391 5 5 17.8593346 6 6 2.4267605 7 7 22.9722115 8 8 30.4237161 9 9 13.3027253 > summary(lm(strength~as.factor(type))) Call: lm(formula = strength ~ as.factor(type)) Residuals: Min 1Q Median 3Q Max -32.825-3.363-0.125 7.025 40.975 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 102.656 2.592 39.604 < 2e-16 *** as.factor(type)1-50.981 7.331-6.954 1.79e-07 *** as.factor(type)2-95.931 7.331-13.085 3.34e-13 *** as.factor(type)3-100.081 7.331-13.651 1.23e-13 *** as.factor(type)4 118.144 7.331 16.115 2.24e-15 *** as.factor(type)5-31.381 7.331-4.280 0.00021 *** as.factor(type)6-83.581 7.331-11.400 7.95e-12 *** as.factor(type)7 239.769 7.331 32.704 < 2e-16 *** as.factor(type)8 50.369 7.331 6.870 2.21e-07 *** Residual standard error: 15.55 on 27 degrees of freedom Multiple R-squared: 0.9848, Adjusted R-squared: 0.9803 F-statistic: 219 on 8 and 27 DF, p-value: < 2.2e-16 10

> > logstrength<-log(strength) > logstrength [1] 3.9415818 3.6216707 4.0826093 4.0690268 1.6486586 1.8562980 2.1972246 1.8405496 [9] 0.9162907 1.1939225 0.9555114 0.6418539 5.4174328 5.4548937 5.3528056 5.3603528 [17] 3.8501476 4.3719763 4.4852599 4.2513483 2.9123507 3.1090610 2.9391619 2.8094027 [25] 5.8827651 5.7352811 5.8391871 5.8799742 4.8449742 5.0625950 5.2678582 4.8903491 [33] 4.2326562 3.7013020 3.9180051 4.1835757 > > aggregate(logstrength,by=list(type),fun=mean) Group.1 x 1 1 3.9287221 2 2 1.8856827 3 3 0.9268946 4 4 5.3963712 5 5 4.2396830 6 6 2.9424941 7 7 5.8343019 8 8 5.0164441 9 9 4.0088847 > aggregate(logstrength,by=list(type),fun=sd) Group.1 x 1 1 0.21433043 2 2 0.22813681 3 3 0.22618831 4 4 0.04852411 5 5 0.27669684 6 6 0.12433499 7 7 0.06895316 8 8 0.19204237 9 9 0.24728988 > summary(lm(logstrength~as.factor(type))) Call: lm(formula = logstrength ~ as.factor(type)) Residuals: Min 1Q Median 3Q Max -0.38954-0.09291 0.00828 0.13430 0.31154 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.79772 0.03269 116.162 < 2e-16 *** as.factor(type)1 0.13100 0.09247 1.417 0.168 as.factor(type)2-1.91204 0.09247-20.677 < 2e-16 *** as.factor(type)3-2.87083 0.09247-31.046 < 2e-16 *** as.factor(type)4 1.59865 0.09247 17.288 3.95e-16 *** as.factor(type)5 0.44196 0.09247 4.780 5.51e-05 *** as.factor(type)6-0.85523 0.09247-9.249 7.38e-10 *** as.factor(type)7 2.03658 0.09247 22.024 < 2e-16 *** as.factor(type)8 1.21872 0.09247 13.180 2.82e-13 *** Residual standard error: 0.1962 on 27 degrees of freedom Multiple R-squared: 0.9878, Adjusted R-squared: 0.9842 F-statistic: 273.8 on 8 and 27 DF, p-value: < 2.2e-16 > > summary(lm(logstrength~as.factor(diam)*as.factor(length))) Call: lm(formula = logstrength ~ as.factor(diam) * as.factor(length)) Residuals: Min 1Q Median 3Q Max -0.38954-0.09291 0.00828 0.13430 0.31154 11

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.79772 0.03269 116.162 < 2e-16 *** as.factor(diam)1-1.55062 0.04624-33.538 < 2e-16 *** as.factor(diam)2 0.39513 0.04624 8.546 3.69e-09 *** as.factor(length)1 1.25541 0.04624 27.153 < 2e-16 *** as.factor(length)2-0.08378 0.04624-1.812 0.08111. as.factor(diam)1:as.factor(length)1 0.42621 0.06539 6.518 5.46e-07 *** as.factor(diam)2:as.factor(length)1-0.05189 0.06539-0.794 0.43435 as.factor(diam)1:as.factor(length)2-0.27763 0.06539-4.246 0.00023 *** as.factor(diam)2:as.factor(length)2 0.13062 0.06539 1.998 0.05593. Residual standard error: 0.1962 on 27 degrees of freedom Multiple R-squared: 0.9878, Adjusted R-squared: 0.9842 F-statistic: 273.8 on 8 and 27 DF, p-value: < 2.2e-16 > anova(lm(logstrength~as.factor(diam)*as.factor(length))) Analysis of Variance Table Response: logstrength Df Sum Sq Mean Sq F value Pr(>F) as.factor(diam) 2 46.748 23.3742 607.462 < 2.2e-16 *** as.factor(length) 2 35.470 17.7348 460.900 < 2.2e-16 *** as.factor(diam):as.factor(length) 4 2.081 0.5202 13.518 3.579e-06 *** Residuals 27 1.039 0.0385 R Code and OutPut for the Paper Airplane Data > cbind(design,weight,paper,dist) design weight paper dist [1,] 1 1 1 5.00 [2,] 1 1 2 6.00 [3,] 1 1 1 6.25 [4,] 1 1 2 7.00 [5,] 1 1 1 4.75 [6,] 1 1 2 4.50 [7,] 1 2 1 6.75 [8,] 1 2 2 7.25 [9,] 1 2 1 7.00 [10,] 1 2 2 10.00 [11,] 1 2 1 4.50 [12,] 1 2 2 4.50 [13,] 2 1 1 10.00 [14,] 2 1 2 8.50 [15,] 2 1 1 15.50 [16,] 2 1 2 10.00 [17,] 2 1 1 5.50 [18,] 2 1 2 6.00 [19,] 2 2 1 10.00 [20,] 2 2 2 14.75 [21,] 2 2 1 16.50 [22,] 2 2 2 16.00 [23,] 2 2 1 6.00 [24,] 2 2 2 5.75 [25,] 3 1 1 4.50 [26,] 3 1 2 4.50 [27,] 3 1 1 5.75 [28,] 3 1 2 4.50 [29,] 3 1 1 4.50 [30,] 3 1 2 4.25 [31,] 3 2 1 4.50 [32,] 3 2 2 5.00 12

[33,] 3 2 1 5.25 [34,] 3 2 2 4.50 [35,] 3 2 1 5.75 [36,] 3 2 2 4.25 > > Design<-as.factor(design) > Weight<-as.factor(weight) > Paper<-as.factor(paper) > > summary(lm(dist~design*weight*paper)) Call: lm(formula = dist ~ Design * Weight * Paper) Residuals: Min 1Q Median 3Q Max -6.4167-0.6042 0.0417 0.8542 5.6667 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.09028 0.48422 14.643 1.83e-13 *** Design1-0.96528 0.68479-1.410 0.171 Design2 3.28472 0.68479 4.797 6.96e-05 *** Weight1-0.59028 0.48422-1.219 0.235 Paper1 0.02083 0.48422 0.043 0.966 Design1:Weight1 0.04861 0.68479 0.071 0.944 Design2:Weight1-0.53472 0.68479-0.781 0.443 Design1:Paper1-0.43750 0.68479-0.639 0.529 Design2:Paper1 0.18750 0.68479 0.274 0.787 Weight1:Paper1 0.34028 0.48422 0.703 0.489 Design1:Weight1:Paper1-0.17361 0.68479-0.254 0.802 Design2:Weight1:Paper1 0.53472 0.68479 0.781 0.443 Residual standard error: 2.905 on 24 degrees of freedom Multiple R-squared: 0.5392, Adjusted R-squared: 0.328 F-statistic: 2.553 on 11 and 24 DF, p-value: 0.02659 > anova(lm(dist~design*weight*paper)) Analysis of Variance Table Response: dist Df Sum Sq Mean Sq F value Pr(>F) Design 2 205.212 102.606 12.1557 0.0002259 *** Weight 1 12.543 12.543 1.4860 0.2346823 Paper 1 0.016 0.016 0.0019 0.9660381 Design:Weight 2 6.295 3.148 0.3729 0.6926604 Design:Paper 2 3.469 1.734 0.2055 0.8156812 Weight:Paper 1 4.168 4.168 0.4938 0.4889847 Design:Weight:Paper 2 5.358 2.679 0.3174 0.7310780 Residuals 24 202.583 8.441 13