Statistics GIDP Ph.D. Qualifying Exam Methodology

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26, 2017, 9:00am-1:00pm Instructions: Put your ID (not your name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have graded. Each question, but not necessarily each part, is equally weighted. Provide answers on the supplied pads of paper and/or use a Microsoft word document or equivalent to report your software code and outputs. Number each problem. You may turn in only one electronic document. Embed relevant code and output/graphics into your word document. Write on only one side of each sheet if you use paper. You may use the computer and/or a calculator. Stay calm and do your best. Good luck! 1. A process engineer is testing the yield of a product manufactured on five machines. Each machine has two operators, one for the day shift and one for the night shift. Assume the operator factor is random. We take five samples from each machine for each operator and obtain the following data ( machine.csv ): Machine 1 2 3 4 5 Day Operator 0.125 0.118 0.123 0.126 0.118 0.127 0.122 0.125 0.128 0.129 0.125 0.120 0.125 0.126 0.127 0.126 0.124 0.124 0.127 0.120 0.128 0.119 0.126 0.129 0.121 Night Operator 0.124 0.116 0.122 0.126 0.125 0.128 0.125 0.121 0.129 0.123 0.127 0.119 0.124 0.125 0.114 0.126 0.125 0.126 0.130 0.124 0.129 0.120 0.125 0.124 0.117 (a) What design is this? (b) State the statistical model and assumptions. (c) Analyze the data and draw a conclusion. (d) If these five machines were randomly selected from many machines in the factory, would the conclusion be same as the one obtained in (3)? Explain (no calculation needed). (e) Attach your SAS code here. 2. A nickel-titanium alloy is used to make components for jet turbine aircraft engines. Cracking is a potentially serious problem, as it can lead to non-recoverable failure. A test is run at the parts producer to determine the effects of four factors on cracks. The four factors are pouring temperature (A), titanium content (B), heat treatment method (C), and the amount of grain

refiner used (D). Each factor contains two levels and 16 runs are performed. Two operators need to take care of these 16 runs. There might be some variation between the two operators. A B C D Operator - - - - + - - - - + - - + + - - - - + - + - + - - + + - + + + - - - - + + - - + - + - + + + - + - - + + + - + + - + + + + + + + (a) Help them to divide the workload equally by filling in the table above. (b) Assume the response measurements in the above table (from top to bottom) are 25, 71, 48, 45, 68, 40, 60, 65, 43, 80, 25, 104, 55, 86, 70, 76 and the dataset is given in aircraft.csv. Use SAS code to estimate the factor effects. Which factor effects appear to be large? Is there a large variation between the two operators? (c) Conduct an analysis of variance to verify the conclusion of (b). (d) Attach your SAS code here.

4. In a study of carbohydrate uptake (Y) as a function of other factors in male diabetics, observations were taken as follows: Y Age, x 1 Weight, x 2 Dietary Protein, x 3 33 33 100 14 40 47 92 15 37 49 135 18 27 35 144 12 25 46 140 15 43 52 101 15 34 62 95 14 48 23 101 17 30 32 98 15 38 42 105 14 50 31 108 17 51 61 85 19 30 63 130 19 36 40 127 20 41 50 109 15 42 64 107 16 46 56 117 18 24 61 100 13 35 48 118 18 37 28 102 14 Analyze these data to determine which (if any) of the predictor variables (including any appropriate interactions) appear to significantly affect carbohydrate uptake. Throughout, set α = 0.05, but for simplicity do not employ any adjustments for multiplicity/multiple inferences when assessing the effects of the predictor variables. Remember to assess the quality of the fit via standard diagnostics. Attach supporting components of your computer code. Report your findings. The data are found in the file diet.csv. 5. A large dataset of n = 1030 samples of concrete involving a total of p 1 = 8 predictor variables was collected: x 1 = Age x 2 = Cement x 3 = Furnace Slag x 4 = Superplasticizer x 5 = Water x 6 = Fly ash x 7 = Coarse Aggregate x 8 =Fine Aggregate along with Y = Compressive Strength The data appear in the file concrete.csv. Consider a multiple linear regression (MLR) model for these data, with E[Y] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + β 6 X 6 + β 7 X 7 + β 8 X 8. Conduct a variable selection search to identify a possible reduced model among the eight predictor variables with this data set. Employ backward elimination and take minimum-bic

as your selection criterion. Attach supporting components of your computer code. What is the recommended set of variables for further study? 6. Consider the simple linear regression model: Y i ~ indep. N(β 0 +β 1 X i, σ 2 ), i = 1,..., n, where in particular it is known that β 0 = 1, so that E[Y] = 1 + β 1 X. Suppose interest exists in estimating the X values at which E[Y] = 0. Let this target parameter be ξ. a) Find ξ as a function of β 1. Also find the maximum likelihood estimator for ξ. Call this ˆξ. b) Recall from Section 5.5.4 in Casella & Berger that the Delta Method can be used to determine the asymptotic features of a function of random variables. In particular, for a random variable U and a differentiable function g(u), where E[U] = θ, a first-order approximation to E[g(U)] is E[g(U)] g(θ) + { g(θ)}e(u θ) θ Use this to find a first-order approximation for E[ˆξ]. c) In part (b), a second-order approximation to E[g(U)] is also available from Casella & Berger s book: E[g(U)] g(θ) + { g(θ) θ }E(U θ) + g(θ) ½{ 2 }E[(U θ) 2 ] θ 2 Use this to find a second-order approximation for E[ˆξ].

2017 May - method 1. A process engineer is testing the yield of a product manufactured on five machines. Each machine has two operators, one for the day shift and one for the night shift. Assume the operator factor is random. We take five samples from each machine for each operator and obtain the following data ( machine.csv ): Machine 1 2 3 4 5 Day Operator 0.125 0.118 0.123 0.126 0.118 0.127 0.122 0.125 0.128 0.129 0.125 0.120 0.125 0.126 0.127 0.126 0.124 0.124 0.127 0.120 0.128 0.119 0.126 0.129 0.121 Night Operator 0.124 0.116 0.122 0.126 0.125 0.128 0.125 0.121 0.129 0.123 0.127 0.119 0.124 0.125 0.114 0.126 0.125 0.126 0.130 0.124 0.129 0.120 0.125 0.124 0.117 (a) What design is this? Nested design (operator is nested within the machine). (b) State the statistical model and assumptions. y!"# = μ + τ! + β!! + ε!!"! and! τ! = 0, and β!! ~N(0, σ!! ), ε!!" ~N(0, σ! ) (c) Analyze the data and draw a conclusion. From the SAS output of the above model it is obvious that the machine has a significant effect as the p-value is 0.0027 while the operator does not. Ty pe 1 Ana l y s i s o f Va r i a nc e Source D F Sum of Squares Me a n Square Ex pe c t e d Me a n Square Er r o r Te r m Er r o r DF F Va l u e Pr > F ma c h i n e 4 0. 000303 0. 00007583 0 Va r ( Re s i dua l ) + 5 Va r ( ope r a t or ( machi ne )) + Q(machine) MS ( o p e r a t o r ( ma c h i n e )) 5 20. 38 0. 0027 operat or( machi ne ) 5 0. 00001860 0 0. 00000372 0 Va r ( Re s i dua l ) + 5 Va r ( ope r a t or ( ma c hi ne )) MS ( Re s i d u a l ) 40 0. 43 0. 8249 Re s i dua l 40 0. 000346 0. 00000865 0 Va r ( Re s i dua l ).... Or use the default setting method=reml, we get:

Ty pe 3 Te s t s o f Fi x e d Ef f e c t s Ef f e c t Num DF De n DF F Val ue Pr > F ma c h i n e 4 5 9. 36 0. 0153 (d) If these five machines were randomly selected from many machines in the factory, would the conclusion be same as the one obtained in (3)? Explain (no calculation needed). Yes, as the F-value for testing the machine effect is the same as the F-value in (3). (e) Attach your SAS code here. data Q1; input operator machine y @@; datalines; 1 1 0.125 1 2 0.118 1 3 0.123 1 4 0.126 1 5 0.118 1 1 0.127 1 2 0.122 1 3 0.125 1 4 0.128 1 5 0.129 1 1 0.125 1 2 0.12 1 3 0.125 1 4 0.126 1 5 0.127 1 1 0.126 1 2 0.124 1 3 0.124 1 4 0.127 1 5 0.12 1 1 0.128 1 2 0.119 1 3 0.126 1 4 0.129 1 5 0.121 2 1 0.124 2 2 0.116 2 3 0.122 2 4 0.126 2 5 0.125 2 1 0.128 2 2 0.125 2 3 0.121 2 4 0.129 2 5 0.123 2 1 0.127 2 2 0.119 2 3 0.124 2 4 0.125 2 5 0.114 2 1 0.126 2 2 0.125 2 3 0.126 2 4 0.13 2 5 0.124 2 1 0.129 2 2 0.12 2 3 0.125 2 4 0.124 run; 2 5 0.117 proc mixed method=typ1 data=q1; class operator machine; model y=machine; random operator(machine);

run; 2. A nickel-titanium alloy is used to make components for jet turbine aircraft engines. Cracking is a potentially serious problem, as it can lead to non-recoverable failure. A test is run at the parts producer to determine the effects of four factors on cracks. The four factors are pouring temperature (A), titanium content (B), heat treatment method (C), and the amount of grain refiner used (D). Each factor contains two levels and 16 runs are performed. Two operators need to take care of these 16 runs. There might be some variation between the two operators. A B C D Operator - - - - + - - - - + - - + + - - - - + - + - + - - + + - + + + - - - - + + - - + - + - + + + - + - - + + + - + + - + + + + + + + (a) Help them to divide the workload equally by filling in the table above. A B C D Operator - - - - 1 + - - - 2 - + - - 2 + + - - 1 - - + - 2 + - + - 1 - + + - 1 + + + - 2 - - - + 2

+ - - + 1 - + - + 1 + + - + 2 - - + + 1 + - + + 2 - + + + 2 + + + + 1 (b) Assume the response measurements in the above table (from top to bottom) are 25, 71, 48, 45, 68, 40, 60, 65, 43, 80, 25, 104, 55, 86, 70, 76 and the dataset is given in aircraft.csv. Use SAS code to estimate the factor effects. Which factor effects appear to be large? Is there a large variation between the two operators? It seems that the A, C, D, AC, and AD have large effect, as well as the operator factor. _NAME_ COL1 effect operator -9.3125-18.625 AC -9.0625-18.125 BCD -1.3125-2.625 ACD -0.8125-1.625 CD -0.5625-1.125 BD -0.1875-0.375 AB 0.0625 0.125 ABC 0.9375 1.875 BC 1.1875 2.375 B 1.5625 3.125 ABD 2.0625 4.125 C 4.9375 9.875 D 7.3125 14.625 AD 8.3125 16.625 A 10.8125 21.625 (c) Conduct an analysis of variance to verify the conclusion of (b). The ANOVA result below shows that the factors A, D, and interaction AC and AD are significant, as well as the operator.

Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F A 1 1870. 562500 1870. 562500 89. 76 <. 0001 C 1 390. 062500 390. 062500 18. 72 0. 0019 D 1 855. 562500 855. 562500 41. 05 0. 0001 AC 1 1314. 062500 1314. 062500 63. 05 <. 0001 AD 1 1105. 562500 1105. 562500 53. 05 <. 0001 operat or 1 1387. 562500 1387. 562500 66. 58 <. 0001 (d) Attach your SAS code here. data Q2; input A B C D operater y; datalines; -1-1 -1-1 1 25 1-1 -1-1 -1 71-1 1-1 -1-1 48 1 1-1 -1 1 45-1 -1 1-1 -1 68 1-1 1-1 1 40-1 1 1-1 1 60 1 1 1-1 -1 65-1 -1-1 1-1 43 1-1 -1 1 1 80-1 1-1 1 1 25 1 1-1 1-1 104-1 -1 1 1 1 55 1-1 1 1-1 86-1 1 1 1-1 70 1 1 1 1 1 76 ; run; data inter; set Q2; AB=A*B; AC=A*C; AD=A*D; BC=B*C; BD=B*D; CD=C*D;ABC=AB*C; ABD=AB*D; ACD=AC*D;BCD=BC*D; block=abc*d; proc reg outest=effects data=inter; model y=a B C D AB AC AD BC BD CD ABC ABD ACD BCD block;

run; data effect2; set effects; drop y Intercept _RMSE_; run; proc transpose data=effect2 out=effect3; data effect4; set effect3; effect=col1*2; proc sort data=effect4; by effect; proc print data=effect4; run; data effect5; set effect4; where _NAME_^='block'; proc print data=effect5; run; proc rank data=effect5 normal=blom; var effect; ranks neff; run; proc gplot; plot effect*neff=_name_; run; proc glm data=inter; class A C D AC AD; model y=a C D AC AD; run;

3. A study was carried out to compare the writing lifetime of four premium brands of pens. It was thought that the writing surface will affect lifetime, so three different surfaces were used and the data are given as below. Surface 1 2 3 average Brand 1 734 724 756 738 2 659 688 699 682 3 646 670 697 671 4 676 675 680 677 average 678.75 689.25 708 692 (a) What design is this? Randomized complete block design (RCBD), because the surfaces represent a known source of variation, and therefore they represent the block factor. (b) State the statistical model with assumptions. y!" = μ + τ! + β! + ε!", τ! = 0, β! = 0, and ε!" ~N(0, σ! ) (c) How would you check whether there exists any significant interaction between the surfaces and brands of pens? State your hypothesis in mathematical notation. Use one-degree freedom Tukey s method to check the interaction between these two factors. y!" = μ + τ! + β! + γτ! β! + ε!" H! : γ = 0 vs. H! : γ 0 (d) Analyze the data using the given dataset pen.csv. One-degree freedom Tukey s test shows that the interaction is not significant (pvalue=0.7626. Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F surface 2 35. 68462402 17. 84231201 0. 10 0. 9102 brand 3 35. 67090351 11. 89030117 0. 06 0. 9767 q 1 18. 94670637 18. 94670637 0. 10 0. 7626 So use the additive model: y!" = μ + τ! + β! + ε!" The type III SS ANOVA table shows that both surface and brand have significant effects. Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F

Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F surface 2 1756. 500000 878. 250000 5. 55 0. 0432 brand 3 8646. 000000 2882. 000000 18. 21 0. 0020 Check model adequacy: The residual plot and QQ-plot show no unusual pattern. Te s t s f o r No r ma l i t y Te s t St at i s t i c p Val ue Shapi ro- Wi l k W 0. 882162 Pr < W 0. 0934 Ko l mo g o r o v - Smi rnov D 0. 198836 Pr > D >0. 1500 Cr a me r - von Mi s e s W- Sq 0. 092878 Pr > W- Sq 0. 1279 Ande r s o n- Da r l i ng A- Sq 0. 589591 Pr > A- Sq 0. 0981 (e) Attach your SAS code. data Q3; input surface brand lifetime; datalines; 1 1 734 1 2 659 1 3 646 1 4 676 2 1 724

2 2 688 2 3 670 2 4 675 3 1 756 3 2 699 3 3 697 3 4 680 ; proc glm data=q3; class surface brand; model lifetime=surface brand; output out=diag r=res p=pred; run; data two; set diag; q=pred*pred; proc glm data=two; class surface brand; model lifetime=surface brand q/ss3; run; proc sgplot data=diag; scatter x=pred y=res; refline 0; run; proc univariate data=diag normal; var res; qqplot res/normal (L=1 mu=est sigma=est); run; (f) Assume that in this study 3 observations were collected for each combination and each value in the above table was the average of 3 replicates. A two-way ANOVA model with interaction is fitted and the MSE is 340.4. Complete the following ANOVA table and draw conclusions.

Source DF SS MS F-value P-value Brand Surface Interaction Error Source DF SS MS F-value P-value Brand 3 25938 8648 25.405 <0.001 Surface 2 5269.5 2634.75 7.74 <0.01 Interaction 6 2848.5 474.75 1.39 >0.1 Error 24 8169.6 340.4 Both the brand and surface play a significant effect on the lifetime, but their interaction does not. μ = 692 τ1 = 738 692 = 46 τ2 = 682 692 = 10 τ3 = 671 692 = 21 τ4 = 677 692 = 15 β1 = 678.75 692 = 13.25 β2 = 689.25 692 = 2.75 β3 = 708 692 = 16 τβ11 = 734 738 678.75 + 692 = 9.25 τβ12 = 724 738 689.25 + 692 = 11.25 τβ13 = 756 738 708 + 692 = 2 τβ21 = 659 682 678.75 + 692 = 9.75 τβ22 = 688 682 689.25 + 692 = 8.75 τβ23 = 699 682 708 + 692 = 1 τβ31 = 646 671 678.75 + 692 = 11.75 τβ32 = 670 671 689.25 + 692 = 1.75 τβ33 = 697 671 708 + 692 = 10 τβ41 = 676 677 678.75 + 692 = 12.25 τβ42 = 675 677 689.25 + 692 = 0.75 τβ43 = 680 677 708 + 692 = 13 SS_brand=3*3*(46^2+10^2+21^2+15^2)=25,938

SS_surface=4*3*(13.25^2+2.75^2+16^2)=5269.5 SS_bs=3*(9.25^2+11.25^2+2^2+9.75^2+8.75^2+1+11.75^2+1.75^2+10^2+12.25^2+0.75^2+13^2)=2848.5 Or use the way below: If we assume that the table is average lifetime, then the totals for each treatment can be recovered by multiplying by 3, i.e., y!". = y!". 3, e.g., y!!. = 734 3 = 2202. Similarly, the listed averages can recover other totals, SS!"#$% = y!!.. bn y! abn = 155385378 3 3 SS!"#$ = y!.!. an y! abn = 206932482 3 3 SS!"# = y!... n 620607744 4 3 3 = 25,938 620607744 4 3 3 = 5269.5 y! abn = 51819480 620607744 3 4 3 3 = 34,056 SS!"#$,!"!"# = SS!"# SS!"#$% SS!"#$ = 2848.5 SS! = MSE a b n 1 = 8169.6 4. In a study of carbohydrate uptake (Y) as a function of other factors in male diabetics, observations were taken as follows: Y Age, x 1 Weight, x 2 Dietary Protein, x 3 33 33 100 14 40 47 92 15 37 49 135 18 27 35 144 12 25 46 140 15 43 52 101 15 34 62 95 14 48 23 101 17 30 32 98 15 38 42 105 14 50 31 108 17 51 61 85 19 30 63 130 19 36 40 127 20 41 50 109 15 42 64 107 16 46 56 117 18 24 61 100 13 35 48 118 18

37 28 102 14 Analyze these data to determine which (if any) of the predictor variables (including any appropriate interactions) appear to significantly affect carbohydrate uptake. Throughout, set α = 0.05, but for simplicity do not employ any adjustments for multiplicity/multiple inferences when assessing the effects of the predictor variables. Remember to assess the quality of the fit via standard diagnostics. Attach supporting components of your computer code. Report your findings. The data are found in the file diet.csv. To start, always plot the data! Sample R code diet.df = read.csv( file.choose() ) attach( diet.df ) Y = Y; X1 = Age; X2 = Weight; X3 = Dietary.Protein pairs( cbind(y,x1,x2,x3), pch=19 )

No disturbing patterns are seen in the scatterplot matrix. Now fit the model; sample R code follows (notice use of centered predictors to properly accommodate the higher-order interaction terms): x1 = Age - mean(age); x2 = Weight - mean(weight) x3 = Dietary.Protein - mean(dietary.protein) diet.lm = lm( Y ~ x1*x2*x3 ) anova( diet.lm ) This yields the following ANOVA table (output edited):

Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) x1 1 3.77 3.77 0.1035 0.753166 x2 1 242.92 242.92 6.6793 0.023902 x3 1 367.44 367.44 10.1032 0.007943 x1:x2 1 37.77 37.77 1.0386 0.328259 x1:x3 1 3.97 3.97 0.1091 0.746814 x2:x3 1 100.03 100.03 2.7505 0.123112 x1:x2:x3 1 0.22 0.22 0.0061 0.939262 Residuals 12 436.43 36.37 From the ANOVA table we see that the Sequential Sums of Squares working from the bottom (i.e., the Partial SS) up show no significant interaction of any type (pointwise, at the 5% level). Formally, we test this via: anova( lm(y~x1+x2+x3), diet.lm ) producing Model 1: Y ~ x1 + x2 + x3 Model 2: Y ~ x1 * x2 * x3 Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 578.42 2 12 436.43 4 142 0.9761 0.4563 The P-value for testing all four interactions is P = 0.4563 > 0.05 = α. Again, no interactions are significant. Move now to a reduced model with only main-effect terms (so return to the original, uncentered predictor variables): dietrm.lm = lm( Y~Age+Weight+Dietary.Protein ); anova( dietrm.lm ) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) Age 1 3.77 3.77 0.1042 0.751077 Weight 1 242.92 242.92 6.7195 0.019650 Dietary.Protein 1 367.44 367.44 10.1640 0.005719 Residuals 16 578.42 36.15 Examining the Partial SS shows that Protein is significant at the (pointwise) 5% level. To study the other terms we can either (i) rearrange the sequential order of the reduced-model ANOVA to isolate Weight and then Age, and test their Partial SS contributions, or (ii) since each is a 1 d.f. test, just examine the t-tests for assessing each pointwise β-coefficient. The latter approach is faster: summary( dietrm.lm )

producing (output edited) Call: lm(formula = Y ~ X1 + X2 + X3) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 38.83618 13.19459 2.943 0.00954 Age -0.11797 0.11036-1.069 0.30095 Weight -0.25771 0.08407-3.065 0.00740 Dietary.Protein 2.04319 0.64088 3.188 0.00572 Residual standard error: 6.013 on 16 degrees of freedom Multiple R-squared: 0.515, Adjusted R-squared: 0.424 F-statistic: 5.663 on 3 and 16 DF, p-value: 0.007706 We see Age is insignificant with P = 0.301 > 0.05, but Weight is significant with P = 0.0074 < 0.05, each at the (pointwise) 5% level. Thus a final reduced model retains only Weight and Protein: dietfinal.lm = lm( Y~Weight+Dietary.Protein ); summary( dietfinal.lm ) Call: lm(formula = Y ~ X2 + X3) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 34.8619 12.7130 2.742 0.01389 Weight -0.2511 0.0842-2.982 0.00836 Dietary.Protein 1.9047 0.6303 3.022 0.00769 Residual standard error: 6.038 on 17 degrees of freedom Multiple R-squared: 0.4803, Adjusted R-squared: 0.4192 F-statistic: 7.857 on 2 and 17 DF, p-value: 0.003834 For diagnostic quality assessment: (i) Check VIFs for multicollinearity between X2 and X3: library( car ) vif( dietfinal.lm ); mean(vif( dietfinal.lm )) Weight Dietary.Protein 1.022166 1.022166 Since both values are below 10 and their mean is clearly equal to 1.022166 < 6, no concerns with multicollinearity are evident. (ii) Check normal Q-Q plot: sample R code qqnorm( resid(dietfinal.lm), main=null, pch=19) qqline(resid(dietfinal.lm)) produces the following graphic (no substantive concerns are evidenced).

(iii) Studentized residual plot (with outlier screen): sample R code is n = length(y); p = length( coef(dietfinal.lm) ) tcrit = qt( 1-.5*(.05/n), n-p-1 ) plot( rstudent(dietfinal.lm) ~ fitted(dietfinal.lm), pch=19, ylim=c(-ceiling(tcrit),ceiling(tcrit)) ) abline( h=0 ) abline( h=tcrit, lty=2 ); abline( h=-tcrit, lty=2 )

From the residual plot, no troublesome patterns are seen, and no outliers are observed to extend past the screening limits of ±t(1 (0.05/2n); n p 1) = ±3.580522. (iv) Influence measures: sample R code is influence.measures( dietfinal.lm ) which produces the following output (edited) Influence measures of lm(formula = Y ~ Weight + Dietary.Protein) : dfb.1_ dfb.wght dfb.dt.p dffit cov.r cook.d hat inf 1-0.15514 0.0742 0.10912-0.1985 1.255 0.013655 0.1029 2-0.01504 0.0149 0.00336-0.0209 1.362 0.000155 0.1196 3-0.11675 0.1070 0.05910 0.1525 1.456 0.008190 0.1910 4 0.02535 0.9383-0.85093 1.2489 1.744 0.500999 0.4871 * 5 0.11488-0.2950 0.10765-0.3401 1.467 0.040031 0.2371 6 0.14945-0.1060-0.06465 0.2366 1.137 0.018982 0.0734 7-0.20056 0.1295 0.11339-0.2409 1.270 0.020037 0.1245 8 0.03239-0.1694 0.15000 0.3266 1.069 0.035274 0.0861 9-0.32872 0.2725 0.11328-0.4874 0.842 0.072589 0.0853 10 0.10495-0.0254-0.09430 0.1524 1.264 0.008111 0.0910

11-0.05802-0.1017 0.22427 0.4782 0.738 0.067341 0.0659 12 0.00754-0.1237 0.11468 0.1716 1.736 0.010392 0.3173 * 13 0.66249-0.4048-0.51899-0.8204 0.946 0.204316 0.2015 14 0.45918-0.1850-0.44434-0.5701 1.358 0.108690 0.2562 15 0.08808-0.0085-0.07789 0.2087 1.121 0.014777 0.0587 16 0.02935-0.0322 0.01092 0.1393 1.188 0.006725 0.0529 17-0.20720 0.0641 0.23797 0.3650 1.073 0.043898 0.1001 18-0.71997 0.2341 0.62819-0.8683 0.696 0.210965 0.1506 19 0.15496-0.0572-0.16814-0.2627 1.194 0.023553 0.1018 20 0.04511-0.0178-0.03486 0.0601 1.320 0.001279 0.0970 We see observations at i = 4 and i = 12 are marked for further study: At i = 4 and i =12 the hat matrix diagonals, h ii, exceed 2p/n = 0.3, indicating high leverage at these points. At i = 4 the value of DFFITS exceeds 1 in absolute value, so this point again exhibits high influence. The Cook s Distance D i values are available as the sixth column of the influence.measures object, so we can check their associated F-probability values via Di = influence.measures(dietfinal.lm)$infmat[,6] which( pf(di, df1=p, df2=n-p) > 0.5 ) the result of which is null. Thus no influence is seen on the Cook s Distance metric. Lastly, no values of DFBETAS 2 or DFBETAS 3 exceed 1 in absolute value, so no influence is seen on that measure. 5. A large dataset of n = 1030 samples of concrete involving a total of p 1 = 8 predictor variables was collected: x 1 = Age x 2 = Cement x 3 = Furnace Slag x 4 = Superplasticizer x 5 = Water x 6 = Fly ash x 7 = Coarse Aggregate x 8 =Fine Aggregate along with Y = Compressive Strength The data appear in the file concrete.csv. Consider a multiple linear regression (MLR) model for these data, with E[Y] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + β 6 X 6 + β 7 X 7 + β 8 X 8. Conduct a variable selection search to identify a possible reduced model among the eight predictor variables with this data set. Employ backward elimination and take minimum-bic as your selection criterion. Attach supporting components of your computer code. What is the recommended set of variables for further study? Begin by loading the data and creating X variables (notice that the response variable Y has already been transformed as Compressive Strength):

concrete.df = read.csv(file.choose()) attach( concrete.df ) Y = Y x1 = age x2 = cement x3 = slag x4 = superplasticizer x5 = water x6 = fly.ash x7 = coarse.aggregate x8 = fine.aggregate Always plot the data! The command pairs( concrete.df ) produces a scatterplot matrix (see next page), in which a number of interesting patterns appears. None. however, are grossly disturbing at face value.

Next, build the regression fit and apply backward elimination with BIC control: library( leaps ) cement.lm = lm( Y ~x1+x2+x3+x4+x5+x6+x7+x8 ) n = length(y) step( cement.lm, direction="backward", k=log(n) ) #BIC This produces [output edited -- note that R writes AIC in the leaps output, but we did institute the stepwise regression with the BIC option k=log(n)]: Start: AIC=-125.26

Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 Df Sum of Sq RSS AIC - x8 1 1.85 860.26-129.981 - x7 1 2.42 860.83-129.301 <none> 858.41-125.263 - x4 1 8.17 866.58-122.439 - x5 1 9.66 868.07-120.672 - x6 1 43.87 902.28-80.861 - x3 1 78.49 936.90-42.082 - x2 1 156.17 1014.58 39.963 - x1 1 376.76 1235.17 242.597 Step: AIC=-129.98 Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 Df Sum of Sq RSS AIC - x7 1 0.58 860.85-136.22 <none> 860.26-129.98 - x4 1 7.17 867.43-128.37 - x5 1 50.42 910.68-78.25 - x6 1 80.54 940.80-44.74 - x3 1 225.52 1085.78 102.88 - x1 1 375.00 1235.26 235.74 - x2 1 486.73 1346.99 324.92 Step: AIC=-136.22 Y ~ x1 + x2 + x3 + x4 + x5 + x6 Df Sum of Sq RSS AIC <none> 860.85-136.22 - x4 1 6.70 867.54-135.18 - x5 1 70.57 931.42-62.00 - x6 1 80.08 940.93-51.54 - x3 1 238.90 1099.74 109.11 - x1 1 376.94 1237.79 230.90 - x2 1 505.79 1366.63 332.90 Call: lm(formula = Y ~ x1 + x2 + x3 + x4 + x5 + x6) Coefficients: (Intercept) x1 x2 x3 x4 4.826634 0.010086 0.009174 0.007385 0.021023 x5 x6-0.017053 0.006650

We see after two back-steps, a reduced model with only the first 6 predictors x1 = age x2 = cement x3 = slag x4 = superplasticizer x5 = water x6 = fly.ash is recommended for further study.

6. Consider the simple linear regression model: Y i ~ indep. N(β 0 +β 1 X i, σ 2 ), i = 1,..., n, where in particular it is known that β 0 = 1, so that E[Y] = 1 + β 1 X. Suppose interest exists in estimating the X values at which E[Y] = 0. Let this target parameter be ξ. a) Find ξ as a function of β 1. Also find the maximum likelihood estimator for ξ. Call this ˆξ. This is essentially a one-parameter inverse regression problem. We have E[Y] = 1 + β 1 X. Clearly, at E[Y] = 0 we have 0 = 1 + β 1 X, so solving for X produces ξ= 1/β 1. To find the MLE ˆξ, appeal to ML invariance and first find ˆβ 1. The fastest way to do so is to recognize that if E[Y] = 1 + β 1 X, then E[Y 1] = β 1 X. That is, we essentially regress the new response variable (Y i 1) against X i through the origin! Referring to the various equations in Sec. 4.4 of Kutner et al., we find that the least squares estimator for β 1 is ˆβ 1 = n i=1 X i(y i 1) / n i=1 X i 2. Under the homogeneous-variance, normal-parent assumption here, this estimator remains identical to the MLE, so take ˆξ = 1/ ˆβ 1 = n i=1 X i 2 / n i=1 X i(y i 1). b) Recall from Section 5.5.4 in Casella & Berger that the Delta Method can be used to determine the asymptotic features of a function of random variables. In particular, for a random variable U and a differentiable function g(u), where E[U] = θ, a first-order approximation to E[g(U)] is E[g(U)] g(θ) + { g(θ)}e(u θ) θ Use this to find a first-order approximation for E[ˆξ]. Let g(β 1 ) = ξ = 1/β 1. We know that the MLE for β 1 is unbiased such that E[ ˆβ 1 ] = β 1. Then from the Delta Method we see E[ˆξ] = E[ 1/ ˆβ 1 ] g(β 1 ) + β 1 g(β 1 ) E( ˆβ 1 β 1 ) = 1/β 1 + β 1 g(β 1 )(0) = 1/β 1 = ξ. c) In part (b), a second-order approximation to E[g(U)] is also available from Casella & Berger s book: E[g(U)] g(θ) + { g(θ) θ }E(U θ) + g(θ) ½{ 2 }E[(U θ) 2 ] θ 2 Use this to find a second-order approximation for E[ˆξ]. Again, let g(β 1 ) = ξ = 1/β 1. We know that the MLE for β 1 is unbiased such that E[ ˆβ 1 ] = β 1. Thus for the second-order Delta Method approximation we have E( ˆβ 1 β 1 ) = 0 and E[( ˆβ 1 β 1 ) 2 ] = Var[ ˆβ 1 ] This latter quantity is Var[ ˆβ 1 ] = Var[ n i=1 X i(y i 1)/ n j=1 X j 2 ] = Var[ n i=1 X i(y i 1)]/( n j=1 X j 2 ) 2

= n i=1 X i 2 Var[Y i 1]/( n j=1 X j 2 ) 2 = n i=1 X i 2 Var[Y i ]/( n j=1 X j 2 ) 2 = n i=1 X i 2 σ 2 /( n j=1 X j 2 ) 2 = σ 2 n i=1 X i 2 /( n j=1 X j 2 ) 2 = σ 2 /( n j=1 X j 2 ). Collecting all this together yields E[ˆξ] = E[ 1/ ˆβ 1 ] g(β 1 ) + β 1 g(β 1 ) (0) + ½ 2 g(β 1 ) β 1 2 Var[ ˆβ 1 ] = 1 β 1 = 1 β 1 + 0 2σ 2 2β 1 3 n j=1 X j 2 σ 2 β 3 1 n = ξ j=1 X j 2 ξ 2 σ 2 1 + n j=1 X j 2. (We see that to second order, a bias exists in the point estimator. However, it can in fact be shown that E[ˆξ] does not exist, as E[ ˆξ ] diverges. Thus, one must always be careful with these sorts of approximate expansions.)