Statistics GIDP Ph.D. Qualifying Exam Methodology

Size: px

Start display at page:

Download "Statistics GIDP Ph.D. Qualifying Exam Methodology"

Rudolph Park
6 years ago
Views:

1 Statistics GIDP Ph.D. Qualifying Exam Methodology January 9, 2018, 9:00am 1:00pm Instructions: Put your ID (not your name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have graded. Each question, but not necessarily each part, is equally weighted. Provide answers on the supplied pads of paper and/or use a Microsoft word document or equivalent to report your software code and outputs. Number each problem. You may turn in only one electronic document. Embed relevant code and output/graphics into your word document. Write on only one side of each sheet if you use paper. You may use the computer and/or a calculator. Stay calm and do your best. Good luck! 1. A chemist wishes to test the effect of five chemical agents on the strength of a particular type of cloth. She selects five bolts and applies five chemicals on them. However, because of a limitation in resource, she just can run a design as follows: cloth chemical (a) What design is this? (b) State the statistical model and assumptions. (c) Are Type I and Type III sum of squares equal in the SAS output for the model y = chemical+ cloth +? Why? 1

2 (d) Besides the number of levels of the two factors, what other parameter(s) would you use to describe this design? Also give their value(s). (e) If you re given i 2 = 12.52, what is SS chemical(adjusted)? (f) Fill in the blanks in the ANOVA table below and draw conclusions at = Source DF Seq SS Adj SS Adj MS F P chemical cloth Error Total (g) If the chemist had enough materials to run 5 5 = 25 runs, what design would you suggest to her? And, what is the statistical model? (h) If the chemist had enough materials to run 50 runs and suspected that there might be some interaction between the cloth and chemical, what design would you suggest to her? And, what is the statistical model? 2

3 2. An experimenter is studying the absorbing rate from three medicines. Four batches of pills are randomly selected from each medicine and three determinations of the absorbing rate are made on each batch. Examine the data in the file medicine.csv. (a) From examining the data, indicate what design was used? (b) Give the appropriate statistical model, with assumptions. (c) What are the hypotheses for testing the batch effect? (d) What are the hypotheses for testing the medicine effect? (e) Conduct an analysis of variance on these data. Do any of the factors affect the absorbing rate? Use = Include your SAS code. 3

4 3. The yield of a food product process is being studied. The two factors of interest are temperature and pressure. Three levels of each factor are selected; however, only 9 runs can be made in one day. The experimenter runs a complete replicate of the design on each day - the combination of the levels of pressure and temperature is chosen randomly. The data are shown in the following table. Day 1 Day 2 Pressure Pressure Temperature Low Medium High Here is a portion of the associated SAS output: Sum of Mean Source Squares DF Square F-value Prob > F Day temp pres temp*pres Residual Cor Total (a) What design is this? (b) State the statistical model and the corresponding assumptions. (c) Fill in the blanks in the ANOVA table below: Sum of Mean Source Squares DF Square F-value Prob > F Day temp pres temp*pres Residual Cor Total

5 (d) Is it reasonable to use the ANOVA term and p-value for the term Day to evaluate the significance of this factor? If yes, calculate it. If not, explain why not. (e) Draw conclusions at =0.05. (f) If the experimenter did not include the factor day in the statistical model, what would a new ANOVA table look like? (Use the information given in the SAS output above.) Sum of Source Squares DF Mean Square F-Value Prob > F 5

6 4. A study recorded laboratory animal responses (Y) to a drug a related to three quantitative predictor variables: X 1 = body weight (grams), X 2 = age (months), and X 3 = administered drug dose (mg). These were recorded as follows: Y X 1 X 2 X 3 Y X 1 X 2 X The data are found in the file animal3.csv. Assume a homogenous-variance, multiple linear regression model containing only first-order terms is appropriate for these data. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) Check the predictor variables for any concerns with multicollinearity. What do you find? (b) Perform a ridge regression on these data: construct and display a trace plot and suggest a reasonable value for the tuning parameter c. State why you chose this value. (c) Consider the following values of c: 2.7, 1.5, 8.9. Choose the most reasonable from among these values and employ it as your biasing constant in a ridge regression of Y on X 1, X 2, and X 3. Give the resulting ridge estimators for all the regression coefficients. 6

7 In a biomonitoring study of workplace chemical exposure, retired factory workers were assayed for blood concentrations of an industrial chemical. Three predictor variables were also recorded: X 1 = Years worked, X 2 = Years retired, and X 3 = Age. The data are available in the file chemical.csv. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) One might argue that when Age = 0, we would expect the response to be zero. Explain why we would then also expect the response to be zero when X 1 = X 2 = 0. Operate under this assumption that the response is zero when all three predictor variables are zero, and fit an appropriate multiple regression model (use only first-order terms) to these data using all three predictors. Identify which, if any, of the three predictors significantly affects E[Y]. Conduct your tests at a family-wise error rate (FWER) of 0.5%. (b) Plot the residuals from the model fit in part (a) against the predicted values from the fit. Do any untoward patterns appear? (c) Assess if any observations possess high leverage in the model fit from part (a). (d) Since X 1 and X 2 represent time at or after possible workplace exposure, consider the joint null hypothesis H o : 1 = 2 = 0 vs. H a : any departure. Perform a single test of H o against H a at a false positive rate of 0.5%. 7

8 6. Consider the multiple linear regression model Y ~ N p (X, 2 ), where Y is an n 1 vector, X is an n p matrix, and is a p 1 vector. Show that the maximum likelihood estimator for is derived from the same estimating equations as the least squares estimator for, making the two estimators identical. For simplicity, you may assume that 2 is known. [Hint: from matrix calculus, recall that Va a = VT and at Ua = (U + U a T )a, for conformable matrices U and V and vector a.] 8

9 Statistics GIDP Ph.D. Qualifying Exam Methodology January 9, 2018, 9:00am 1:00pm Instructions: Put your ID (not your name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have graded. Each question, but not necessarily each part, is equally weighted. Provide answers on the supplied pads of paper and/or use a Microsoft word document or equivalent to report your software code and outputs. Number each problem. You may turn in only one electronic document. Embed relevant code and output/graphics into your word document. Write on only one side of each sheet if you use paper. You may use the computer and/or a calculator. Stay calm and do your best. Good luck! 1. A chemist wishes to test the effect of five chemical agents on the strength of a particular type of cloth. She selects five bolts and applies five chemicals on them. However, because of a limitation in resource, she just can run a design as follows: cloth chemical (a) What design is this? BIBD (balanced incomplete block design). (b) State the statistical model and assumptions. y ij = μ + i + j + ij, i = 1,...,5; j = 1,...,5 i = 0, j = 0, ij ~ N(0, 2 ) (c) Are Type I and Type III sum of squares equal in the SAS output for the model y = chemical+ cloth +? Why? No, as the orthogonality does not hold. 1

10 (d) Besides the number of levels of the two factors, what other parameter(s) would you use to describe this design? Also give their value(s). The other parameters are k and γ, the number of treatments per block and how many times each treatment appears, respectively, as well as, the number of blocks in which each pair of treatments appears together. In this case, k=4, γ =4, =3. (e) If you re given ˆ i 2 = 12.52, what is SS chemical(adjusted)? SS chemical(adjusted) = ( a/k) ˆ i 2 = (3)(5)(12.52)/4 = (f) Fill in the blanks in the ANOVA table below and draw conclusions at = Source DF Seq SS Adj SS Adj MS F P chemical < cloth Error Total We see both chemical and cloth are significant at = (g) If the chemist had enough materials to run 5 5 = 25 runs, what design would you suggest to her? And, what is the statistical model? RCBD (randomized complete block design). y ij = μ + i + j + ij, i = 1,...,5; j = 1,...,5 i = 0, j = 0, ij ~ N(0, 2 ) (h) If the chemist had enough materials to run 50 runs and suspected that there might be some interaction between the cloth and chemical, what design would you suggest to her? And, what is the statistical model? Factorial design, with two replicates for each combination. y ijk = μ + i + j + ( ) ij + ijk, i = 1,...,5; j = 1,...,5; k = 1,2 i = 0, j = 0, ( ) ij = 0, ijk ~ N(0, 2 ) 2. An experimenter is studying the absorbing rate from three medicines. Four batches of pills are randomly selected from each medicine and three determinations of the absorbing rate are made on each batch. Examine the data in the file medicine.csv. (a) From examining the data, indicate what design was used? Nested design. (b) Give the appropriate statistical model, with assumptions. 2

11 Y ijk = μ + i + j(i) + k(ij) where represents the medicine effect [a fixed effect with i = 0], represents the batch effect [a random effect with j(i) ~N(0, 2 )], and k(ij) ~ iid N(0, 2 ). (c) What are the hypotheses for testing the batch effect? H 0 : 2 = 0 H 1 : 2 > 0 (d) What are the hypotheses for testing the medicine effect? H 0 : 1 = 2 = 3 = 0 H 1 : at least one i 0 (e) Conduct an analysis of variance on these data. Do any of the factors affect the absorbing rate? Use = Include your SAS code. Analysis of Variance for Absorbing Rate Source DF SS MS F P Medicine Batch(Medicine) < Error Total There is a significant batch effect at = 0.05, as the associated p-value is P < < data Q2; input Medicine datalines; Batch 3

12 ; 33 proc print data=q2; run; proc mixed data=q2 method=type1; class Medicine Batch; model Rate=Medicine; random Batch(Medicine); run; 3. The yield of a food product process is being studied. The two factors of interest are temperature and pressure. Three levels of each factor are selected; however, only 9 runs can be made in one day. The experimenter runs a complete replicate of the design on each day - the combination of the levels of pressure and temperature is chosen randomly. The data are shown in the following table. Day 1 Day 2 Pressure Pressure Temperature Low Medium High Here is a portion of the associated SAS output: Sum of Mean Source Squares DF Square F-value Prob > F Day temp pres temp*pres Residual Cor Total

13 (a) What design is this? Block factorial design. (b) State the statistical model and the corresponding assumptions. y ijk = μ + i + j + ( ) ij + k + ijk, i = 1,2,3; j = 1,2,3; k = 1,2 i = 0, j = 0, i ( ) ij = 0, j ( ) ij = 0, k = 0,, ijk ~ N(0, 2 ) (c) Fill in the blanks in the ANOVA table below: Sum of Mean Source Squares DF Square F-value Prob > F Day temp _ pres _93.98 < _ temp*pres Residual Cor Total (d) Is it reasonable to use the ANOVA term and p-value for the term Day to evaluate the significance of this factor? If yes, calculate it. If not, explain why not. Yes. For Day, F day = 13.01/0.53 = 24.55, p-value< (e) Draw conclusions at =0.05. Both main effects, temperature and pressure, and the blocking factor are significant, as all corresponding p-values are less than = (f) If the experimenter did not include the factor day in the statistical model, what would a new ANOVA table look like? (Use the information given in the SAS output above.) Sum of Source Squares DF Mean Square F-Value Prob > F temp pres temp*pres Residual Cor Total

14 4. A study recorded laboratory animal responses (Y) to a drug a related to three quantitative predictor variables: X 1 = body weight (grams), X 2 = age (months), and X 3 = administered drug dose (mg). These were recorded as follows: Y X 1 X 2 X 3 Y X 1 X 2 X The data are found in the file animal3.csv. Assume a homogenous-variance, multiple linear regression model containing only first-order terms is appropriate for these data. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) Check the predictor variables for any concerns with multicollinearity. What do you find? To start, always plot the data! Sample R code animal.df = read.csv( file.choose() ) attach( animal.df ) pairs( cbind(y,x1,x2,x3) ) 6

15 The plot shows an especially strong linear relationship between X 1 and X 3. For assessing multicollinearity more closely, examine the correlations and the VIFs. Sample R code: cor( cbind(x1,x2,x3) ) library ( car ) vif( lm(y ~ X1+X2+X3)) mean( vif(lm(y ~ X1+X2+X3)) ) Output (edited) is: X1 X2 X3 X X X for the correlations, and X1 X2 X for the VIFs and [1] for VIF. Since max{vif k } = clearly exceeds 10, and VIF = is far larger than 6.0, there is a clear problem with multicollinearity here. The scatterplot matrix and correlation matrix both suggest that X 1 and X 3 are the greater culprits. (If one removes X 3 the VIFs drop to near But, one does not remove predictor variables just because they are highly collinear!) (b) Perform a ridge regression on these data: construct and display a trace plot and suggest a reasonable value for the tuning parameter c. State why you chose this value. Given the high multicollinearity, ridge regression is a valid alternative, but remember first to center the response variable and standardize the predictor variables. Sample R code: U = Y - mean(y) Z1 = scale( X1 ); Z2 = scale( X2 ); Z3 = scale( X3 ) library ( genridge ) const = seq(.001, 5,.001) animal.ridge = ridge( U ~ Z1 + Z2 + Z3, lambda = const ) traceplot( animal.ridge, pch='.', cex=1.5) detach( animal.df ) This produces the trace plot below. The trace for Z 2 is uninformative, but the traces for Z 1 and Z 3 suggest that their coefficient trace curves flatten by about c = 1.5, or possibly c = 2, thus something in this range would be a reasonable choice for c. 7

16 By the way: it is interesting to notice the two dashed vertical lines in the plot near c = 0. Digging into this R function, we find that these are the recommended values for c from two standard sources: the HKB value suggested by Hoerl et al. (1975, Communications in Statistics 4, pp ) and the LW value suggested by Lawless and Wang (1976, Communications in Statistics 5, pp ). Find these in the ridge() object as c( animal.ridge$khkb, animal.ridge$klw ) producing [1] As can be seen, these are much smaller than the visual indication that c is near 1.5 or 2.0. Selection of the biasing parameter c in a ridge regression is a continually developing area of study. (c) Consider the following values of c: 2.7, 1.5, 8.9. Choose the most reasonable from among these values and employ it as your biasing constant in a ridge regression of Y on X 1, X 2, and X 3. Give the resulting ridge estimators for all the regression coefficients. From among the choices c = 2.7, 1.5, 8.9, the choice of c = 2.7 is nonsensical, as c cannot be negative; while the choice of c = 8.9 gives far too large a biasing effect, as the trace plot in part (b) stabilizes well before this point. Thus c = 1.5 is the most reasonable choice, in concordance with the analysis presented in part (b). To perform the ridge analysis, we continue to use the centered response variable U and the standardized predictors Z 1, Z 2, Z 3. Sample R code is: 8

17 library ( genridge ) const = 1.5 animal1.ridge = ridge( U ~ Z1 + Z2 + Z3, lambda = const ) coef( animal1.ridge ) with consequent output Z1 Z2 Z (We could also have extracted the estimates directly as animal1.ridge$coef.) The three estimates are b R1 = , b R2 = , and b R3 = In a biomonitoring study of workplace chemical exposure, retired factory workers were assayed for blood concentrations of an industrial chemical. Three predictor variables were also recorded: X 1 = Years worked, X 2 = Years retired, and X 3 = Age. The data are available in the file chemical.csv. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) One might argue that when Age = 0, we would expect the response to be zero. Explain why we would then also expect the response to be zero when X 1 = X 2 = 0. Operate under this assumption that the response is zero when all three predictor variables are zero, and fit an appropriate multiple regression model (use only first-order terms) to these data using all three predictors. Identify which, if any, of the three predictors significantly affects E[Y]. Conduct your tests at a family-wise error rate (FWER) of 0.5%. Obviously when Age = X 3 = 0, the subject has just been born, so s/he could not have worked any years (so X 1 = 0), nor could s/he have spent any years in retirement (so X 2 = 0). Next, always plot the data! Sample R code and consequent scatterplot: chemical.df = read.csv( file.choose() ) attach( chemical.df ) X1 = Years.worked X2 = Years.retired X3 = Age pairs( cbind(y,x1,x2,x3) ) pairs( concrete.df ) 9

18 The various patterns between Y and the predictor variables are mixed; a formal analysis will prove interesting. Also, there appears to be some slight correlation between X 2 and X 3, but less for the other parings. This is verified by checking the pairwise correlations: cor( cbind(x1,x2,x3) ) X1 X2 X3 X X X So, for the record, some slight problems with multicollinearity may be present. (VIFs are not useful in linear regression through-the-origin, so these are not worth calculating here.) Next, perform the LS fit. We operate without an intercept, as per the instructions in the problem. The model sets E[Y i ] = 1 X i1 + 2 X i2 + 3 X i3. We test the three separate hypothesis H oj : j = 0 vs. H a : j 0 (no indication was given for any one-sided alternatives) for j = 1,2,3. Sample R code and (edited) output: chemical.lm = lm( Y ~ X1 + X2 + X3-1 ) summary( chemical.lm ) 10

19 Call: lm(formula = Y ~ X1 + X2 + X3-1) Coefficients: Estimate Std. Error t value Pr(> t ) X X X e-05 *** Residual standard error: on 23 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 23 DF, p-value: < 2.2e-16 To test the g = 3 hypotheses, extract their corresponding p-values from their partial t-tests and then apply a Bonferroni adjustment: pvals = summary( chemical.lm )$coefficients[,4] p.adjust( pvals, method = "bonferroni" ) X1 X2 X e e e-05 The Bonferroni-adjusted p-values for H o1 : 1 =0 and H o2 : 2 =0 are reported as 1.0 so they clearly exceed = Hence, we fail to reject H o1 and H o2. For H o3 : 3 = 0 we find p 3 = < =. Thus we reject H o3. We conclude that there is no significant effect of Years worked on E[Y] there is no significant effect of Years retired on E[Y] there is a significant effect of Age on E[Y]. (b) Plot the residuals from the model fit in part (a) against the predicted values from the fit. Do any untoward patterns appear? Residual plot: sample R code and resulting residual plot are plot( resid(chemical.lm) ~ predict(chemical.lm), pch=19 ) abline( h=0 ) 11

20 We see no troublesome patterns in the residual plot. (The one residual far to the right in the plot is eye-catching, but it does not indicate a problem with the model fit. We might think to check for possible leverage points, however, so see part (c) below.) (c) Assess if any observations possess high leverage in the model fit from part (a). Leverage analysis: sample R code and (edited) output are hii = hatvalues( chemical.lm ) p = length( coef(chemical.lm) ); n = length(y); print( 2*p/n ) [1] The rule-of-thumb cut-off is seen to be h ii > 2p/n = (2)(3)/26 = A fast check via R employs which( hii > 2*p/n ) producing i.e., the three observations at i = 5, 7, 18 exert high leverage on the model fit. Referring back to the residual plot, we can query what the predicted values were for these three leverage points: predict(chemical.lm)[which(hii > 2*p/n)] The extreme predicted value at Ŷ5 = is indeed seen to one of the high leverage points! (d) Since X 1 and X 2 represent time at or after possible workplace exposure, consider the joint null hypothesis H o : 1 = 2 = 0 vs. H a : any departure. Perform a single test of H o against H a at a false positive rate of 0.5%. Reduced model hypothesis test: sample R code and (edited) output are anova( lm(y ~ X3-1), chemical.lm ) Model 1: Y ~ X3-1 Model 2: Y ~ X1 + X2 + X3-1 Res.Df RSS Df Sum of Sq F Pr(>F) We see the test statistic is F* = with (2,23) d.f. The corresponding p-value is P[F(2,23) ] = which is clearly larger than = Thus we fail to reject 12

21 H o and conclude that Years Worked and Years Retired do not contribute significantly to the model fit. Alternatively, using a rejection region approach: find the critical point as F(0.995, 2, 23) = As F* = < , we again fail to reject H o. 6. Consider the multiple linear regression model Y ~ N p (X, 2 ), where Y is an n 1 vector, X is an n p matrix, and is a p 1 vector. Show that the maximum likelihood estimator for is derived from the same estimating equations as the least squares estimator for, making the two estimators identical. For simplicity, you may assume that 2 is known. [Hint: from matrix calculus, recall that Va a = VT and at Ua = (U + U a T )a, for conformable matrices U and V and vector a.] The likelihood function is L( ) = ( {2π} 1/2 ) n exp{ (Y X ) T (Y X )/2 2 }, so the loglikelihood becomes l( ) = n log( {2π} 1/2 ) (Y X ) T (Y X )/2 2 = n log( {2π} 1/2 ) {Y T Y T X T Y Y T X + T X T X }/2 2. Notice that T X T Y is a scalar, so it must equal its transpose: T X T Y = ( T X T Y) T = Y T X. Then l( ) = n log( {2π} 1/2 ) {Y T Y 2Y T X + T X T X }/2 2. Now take the first derivative with respect to : l( ) = 1 (2YT X T X T X ) 2 = YT X T X T X. From the Hint, Y T X = (Y T X) T = X T Y while T X T X = (X T X + {X T X} T ) = 2X T X the latter equality holding since X T X is clearly symmetric. Then, l( ) = 1 2 (2X T Y 2X T X ) 13

22 so setting l( )/ = 0 (a p 1 vector of zeroes), yields 2X T Y 2X T X = 0, or simply X T X = X T Y. This is equivalent to the least squares normal equations for given in Equation (6.24) from Kutner et al. s textbook, which shows that the two estimation methods lead to the same estimating equations. 14

Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm Instructions: Put your ID (not name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have