Statistics GIDP Ph.D. Qualifying Exam Methodology

Similar documents
Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26 9:00am-1:00pm

MATH 644: Regression Analysis Methods

Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5

Statistics GIDP Ph.D. Qualifying Exam Methodology

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

STAT 525 Fall Final exam. Tuesday December 14, 2010

Tests of Linear Restrictions

1 Use of indicator random variables. (Chapter 8)

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Written Exam (2 hours)

Lecture 6 Multiple Linear Regression, cont.

Statistics GIDP Ph.D. Qualifying Exam Methodology

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

STAT 350: Summer Semester Midterm 1: Solutions

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

What If There Are More Than. Two Factor Levels?

A discussion on multiple regression models

ST430 Exam 2 Solutions

STAT 501 EXAM I NAME Spring 1999

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Statistics GIDP Ph.D. Qualifying Exam Methodology

Variance Decomposition and Goodness of Fit

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

SCHOOL OF MATHEMATICS AND STATISTICS

General Linear Model (Chapter 4)

ST430 Exam 1 with Answers

Stat 5102 Final Exam May 14, 2015

Diagnostics and Transformations Part 2

Ph.D. Preliminary Examination Statistics June 2, 2014

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Lecture 1 Linear Regression with One Predictor Variable.p2

MODELS WITHOUT AN INTERCEPT

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Regression and the 2-Sample t

Lec 1: An Introduction to ANOVA

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Essential of Simple regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Comparing Nested Models

Analysis of Variance. Read Chapter 14 and Sections to review one-way ANOVA.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Inference for Regression

Masters Comprehensive Examination Department of Statistics, University of Florida

CAS MA575 Linear Models

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

STAT22200 Spring 2014 Chapter 14

Multiple Linear Regression

Exam Applied Statistical Regression. Good Luck!

Final Exam. Name: Solution:

Regression. Marc H. Mehlman University of New Haven

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

1 A Review of Correlation and Regression

Lecture 10. Factorial experiments (2-way ANOVA etc)

More about Single Factor Experiments

14 Multiple Linear Regression

Solution to Final Exam

Ch 2: Simple Linear Regression

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

The Random Effects Model Introduction

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

COMPARING SEVERAL MEANS: ANOVA

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Practice Final Examination

Mathematical Notation Math Introduction to Applied Statistics

Chapter 5 Introduction to Factorial Designs Solutions

STA 101 Final Review

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

36-707: Regression Analysis Homework Solutions. Homework 3

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Exam: high-dimensional data analysis February 28, 2014

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Swarthmore Honors Exam 2012: Statistics

13 Simple Linear Regression

Econometrics Midterm Examination Answers

Sociology 593 Exam 1 Answer Key February 17, 1995

Ch 13 & 14 - Regression Analysis

MS&E 226: Small Data

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Unbalanced Data in Factorials Types I, II, III SS Part 1

STAT 510 Final Exam Spring 2015

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data

STATISTICS 479 Exam II (100 points)

First Year Examination Department of Statistics, University of Florida

Multiple Predictor Variables: ANOVA

This exam contains 5 questions. Each question is worth 10 points. Therefore, this exam is worth 50 points.

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Ch 3: Multiple Linear Regression

Transcription:

Statistics GIDP Ph.D. Qualifying Exam Methodology January 9, 2018, 9:00am 1:00pm Instructions: Put your ID (not your name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have graded. Each question, but not necessarily each part, is equally weighted. Provide answers on the supplied pads of paper and/or use a Microsoft word document or equivalent to report your software code and outputs. Number each problem. You may turn in only one electronic document. Embed relevant code and output/graphics into your word document. Write on only one side of each sheet if you use paper. You may use the computer and/or a calculator. Stay calm and do your best. Good luck! 1. A chemist wishes to test the effect of five chemical agents on the strength of a particular type of cloth. She selects five bolts and applies five chemicals on them. However, because of a limitation in resource, she just can run a design as follows: cloth chemical 1 2 3 4 5 1 14 14 13 10 2 12 13 12 9 3 13 11 11 12 4 11 12 10 8 5 17 14 13 12 (a) What design is this? (b) State the statistical model and assumptions. (c) Are Type I and Type III sum of squares equal in the SAS output for the model y = chemical+ cloth +? Why? 1

(d) Besides the number of levels of the two factors, what other parameter(s) would you use to describe this design? Also give their value(s). (e) If you re given i 2 = 12.52, what is SS chemical(adjusted)? (f) Fill in the blanks in the ANOVA table below and draw conclusions at = 0.05. Source DF Seq SS Adj SS Adj MS F P chemical 31.7000 cloth 35.2333 35.2333 8.8083 9.67 0.001 Error 10.0167 10.0167 0.9106 Total 76.9500 (g) If the chemist had enough materials to run 5 5 = 25 runs, what design would you suggest to her? And, what is the statistical model? (h) If the chemist had enough materials to run 50 runs and suspected that there might be some interaction between the cloth and chemical, what design would you suggest to her? And, what is the statistical model? 2

2. An experimenter is studying the absorbing rate from three medicines. Four batches of pills are randomly selected from each medicine and three determinations of the absorbing rate are made on each batch. Examine the data in the file medicine.csv. (a) From examining the data, indicate what design was used? (b) Give the appropriate statistical model, with assumptions. (c) What are the hypotheses for testing the batch effect? (d) What are the hypotheses for testing the medicine effect? (e) Conduct an analysis of variance on these data. Do any of the factors affect the absorbing rate? Use = 0.05. Include your SAS code. 3

3. The yield of a food product process is being studied. The two factors of interest are temperature and pressure. Three levels of each factor are selected; however, only 9 runs can be made in one day. The experimenter runs a complete replicate of the design on each day - the combination of the levels of pressure and temperature is chosen randomly. The data are shown in the following table. Day 1 Day 2 Pressure Pressure Temperature 250 260 270 250 260 270 Low 86.3 84.0 85.8 86.1 85.2 87.3 Medium 88.5 87.3 89.0 89.4 89.9 90.3 High 89.1 90.2 91.3 91.7 93.2 93.7 Here is a portion of the associated SAS output: Sum of Mean Source Squares DF Square F-value Prob > F Day 13.01 1 13.01 temp 5.51 2 2.75 pres 99.85 2 49.93 temp*pres 4.45 4 1.11 Residual 4.25 8 0.53 Cor Total 127.07 17 (a) What design is this? (b) State the statistical model and the corresponding assumptions. (c) Fill in the blanks in the ANOVA table below: Sum of Mean Source Squares DF Square F-value Prob > F Day 13.01 1 13.01 temp 5.51 2 2.75 pres 99.85 2 49.93 temp*pres 4.45 4 1.11 Residual 4.25 8 0.53 Cor Total 127.07 17 4

(d) Is it reasonable to use the ANOVA term and p-value for the term Day to evaluate the significance of this factor? If yes, calculate it. If not, explain why not. (e) Draw conclusions at =0.05. (f) If the experimenter did not include the factor day in the statistical model, what would a new ANOVA table look like? (Use the information given in the SAS output above.) Sum of Source Squares DF Mean Square F-Value Prob > F 5

4. A study recorded laboratory animal responses (Y) to a drug a related to three quantitative predictor variables: X 1 = body weight (grams), X 2 = age (months), and X 3 = administered drug dose (mg). These were recorded as follows: Y X 1 X 2 X 3 Y X 1 X 2 X 3 140.4080 176 6.5 0.88 121.0736 165 7.9 0.84 127.9948 176 9.5 0.88 123.2062 158 6.9 0.80 141.9612 190 9.0 1.00 111.7565 148 7.3 0.74 134.7643 176 8.9 0.88 125.1585 149 5.2 0.75 166.8457 200 7.2 1.00 119.8591 163 8.4 0.81 122.4732 167 8.9 0.83 138.1980 170 7.2 0.85 150.4503 188 8.0 0.94 149.1850 186 6.8 0.94 145.0751 195 10.0 0.98 105.1598 146 7.3 0.73 141.4478 176 8.0 0.88 128.8160 181 9.0 0.90 116.8525 149 6.4 0.75 The data are found in the file animal3.csv. Assume a homogenous-variance, multiple linear regression model containing only first-order terms is appropriate for these data. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) Check the predictor variables for any concerns with multicollinearity. What do you find? (b) Perform a ridge regression on these data: construct and display a trace plot and suggest a reasonable value for the tuning parameter c. State why you chose this value. (c) Consider the following values of c: 2.7, 1.5, 8.9. Choose the most reasonable from among these values and employ it as your biasing constant in a ridge regression of Y on X 1, X 2, and X 3. Give the resulting ridge estimators for all the regression coefficients. 6

In a biomonitoring study of workplace chemical exposure, retired factory workers were assayed for blood concentrations of an industrial chemical. Three predictor variables were also recorded: X 1 = Years worked, X 2 = Years retired, and X 3 = Age. The data are available in the file chemical.csv. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) One might argue that when Age = 0, we would expect the response to be zero. Explain why we would then also expect the response to be zero when X 1 = X 2 = 0. Operate under this assumption that the response is zero when all three predictor variables are zero, and fit an appropriate multiple regression model (use only first-order terms) to these data using all three predictors. Identify which, if any, of the three predictors significantly affects E[Y]. Conduct your tests at a family-wise error rate (FWER) of 0.5%. (b) Plot the residuals from the model fit in part (a) against the predicted values from the fit. Do any untoward patterns appear? (c) Assess if any observations possess high leverage in the model fit from part (a). (d) Since X 1 and X 2 represent time at or after possible workplace exposure, consider the joint null hypothesis H o : 1 = 2 = 0 vs. H a : any departure. Perform a single test of H o against H a at a false positive rate of 0.5%. 7

6. Consider the multiple linear regression model Y ~ N p (X, 2 ), where Y is an n 1 vector, X is an n p matrix, and is a p 1 vector. Show that the maximum likelihood estimator for is derived from the same estimating equations as the least squares estimator for, making the two estimators identical. For simplicity, you may assume that 2 is known. [Hint: from matrix calculus, recall that Va a = VT and at Ua = (U + U a T )a, for conformable matrices U and V and vector a.] 8

Statistics GIDP Ph.D. Qualifying Exam Methodology January 9, 2018, 9:00am 1:00pm Instructions: Put your ID (not your name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have graded. Each question, but not necessarily each part, is equally weighted. Provide answers on the supplied pads of paper and/or use a Microsoft word document or equivalent to report your software code and outputs. Number each problem. You may turn in only one electronic document. Embed relevant code and output/graphics into your word document. Write on only one side of each sheet if you use paper. You may use the computer and/or a calculator. Stay calm and do your best. Good luck! 1. A chemist wishes to test the effect of five chemical agents on the strength of a particular type of cloth. She selects five bolts and applies five chemicals on them. However, because of a limitation in resource, she just can run a design as follows: cloth chemical 1 2 3 4 5 1 14 14 13 10 2 12 13 12 9 3 13 11 11 12 4 11 12 10 8 5 17 14 13 12 (a) What design is this? BIBD (balanced incomplete block design). (b) State the statistical model and assumptions. y ij = μ + i + j + ij, i = 1,...,5; j = 1,...,5 i = 0, j = 0, ij ~ N(0, 2 ) (c) Are Type I and Type III sum of squares equal in the SAS output for the model y = chemical+ cloth +? Why? No, as the orthogonality does not hold. 1

(d) Besides the number of levels of the two factors, what other parameter(s) would you use to describe this design? Also give their value(s). The other parameters are k and γ, the number of treatments per block and how many times each treatment appears, respectively, as well as, the number of blocks in which each pair of treatments appears together. In this case, k=4, γ =4, =3. (e) If you re given ˆ i 2 = 12.52, what is SS chemical(adjusted)? SS chemical(adjusted) = ( a/k) ˆ i 2 = (3)(5)(12.52)/4 = 46.95. (f) Fill in the blanks in the ANOVA table below and draw conclusions at = 0.05. Source DF Seq SS Adj SS Adj MS F P chemical 4 31.7000 46.95 11.74 12.89 < 0.001 cloth 4 35.2333 35.2333 8.8083 9.67 0.001 Error 11 10.0167 10.0167 0.9106 Total 19 76.9500 We see both chemical and cloth are significant at = 0.05. (g) If the chemist had enough materials to run 5 5 = 25 runs, what design would you suggest to her? And, what is the statistical model? RCBD (randomized complete block design). y ij = μ + i + j + ij, i = 1,...,5; j = 1,...,5 i = 0, j = 0, ij ~ N(0, 2 ) (h) If the chemist had enough materials to run 50 runs and suspected that there might be some interaction between the cloth and chemical, what design would you suggest to her? And, what is the statistical model? Factorial design, with two replicates for each combination. y ijk = μ + i + j + ( ) ij + ijk, i = 1,...,5; j = 1,...,5; k = 1,2 i = 0, j = 0, ( ) ij = 0, ijk ~ N(0, 2 ) 2. An experimenter is studying the absorbing rate from three medicines. Four batches of pills are randomly selected from each medicine and three determinations of the absorbing rate are made on each batch. Examine the data in the file medicine.csv. (a) From examining the data, indicate what design was used? Nested design. (b) Give the appropriate statistical model, with assumptions. 2

Y ijk = μ + i + j(i) + k(ij) where represents the medicine effect [a fixed effect with i = 0], represents the batch effect [a random effect with j(i) ~N(0, 2 )], and k(ij) ~ iid N(0, 2 ). (c) What are the hypotheses for testing the batch effect? H 0 : 2 = 0 H 1 : 2 > 0 (d) What are the hypotheses for testing the medicine effect? H 0 : 1 = 2 = 3 = 0 H 1 : at least one i 0 (e) Conduct an analysis of variance on these data. Do any of the factors affect the absorbing rate? Use = 0.05. Include your SAS code. Analysis of Variance for Absorbing Rate Source DF SS MS F P Medicine 2 676.06 338.03 1.46 0.281 Batch(Medicine) 9 2077.58 230.84 12.20 <0.0001 Error 24 454.00 18.92 Total 35 3207.64 There is a significant batch effect at = 0.05, as the associated p-value is P < 0.0001 < 0.05. data Q2; input Medicine datalines; 1 1 25 1 1 30 1 1 26 2 1 19 2 1 17 2 1 14 3 1 14 3 1 15 3 1 20 1 2 19 1 2 28 1 2 20 2 2 23 2 2 24 2 2 21 3 2 35 3 2 21 3 2 24 1 3 15 1 3 17 1 3 14 2 3 18 Batch Rate @@; 3

2 3 21 2 3 3 3 17 38 3 3 54 3 3 1 4 50 15 1 4 16 1 4 13 2 4 35 2 4 27 2 4 3 4 25 25 3 4 29 3 4 ; 33 proc print data=q2; run; proc mixed data=q2 method=type1; class Medicine Batch; model Rate=Medicine; random Batch(Medicine); run; 3. The yield of a food product process is being studied. The two factors of interest are temperature and pressure. Three levels of each factor are selected; however, only 9 runs can be made in one day. The experimenter runs a complete replicate of the design on each day - the combination of the levels of pressure and temperature is chosen randomly. The data are shown in the following table. Day 1 Day 2 Pressure Pressure Temperature 250 260 270 250 260 270 Low 86.3 84.0 85.8 86.1 85.2 87.3 Medium 88.5 87.3 89.0 89.4 89.9 90.3 High 89.1 90.2 91.3 91.7 93.2 93.7 Here is a portion of the associated SAS output: Sum of Mean Source Squares DF Square F-value Prob > F Day 13.01 1 13.01 temp 5.51 2 2.75 pres 99.85 2 49.93 temp*pres 4.45 4 1.11 Residual 4.25 8 0.53 Cor Total 127.07 17 4

(a) What design is this? Block factorial design. (b) State the statistical model and the corresponding assumptions. y ijk = μ + i + j + ( ) ij + k + ijk, i = 1,2,3; j = 1,2,3; k = 1,2 i = 0, j = 0, i ( ) ij = 0, j ( ) ij = 0, k = 0,, ijk ~ N(0, 2 ) (c) Fill in the blanks in the ANOVA table below: Sum of Mean Source Squares DF Square F-value Prob > F Day 13.01 1 13.01 temp 5.51 2 2.75 _5.18 0.0360 pres 99.85 2 49.93 _93.98 < 0.0001_ temp*pres 4.45 4 1.11 Residual 4.25 8 0.53 Cor Total 127.07 17 (d) Is it reasonable to use the ANOVA term and p-value for the term Day to evaluate the significance of this factor? If yes, calculate it. If not, explain why not. Yes. For Day, F day = 13.01/0.53 = 24.55, p-value< 0.0001. (e) Draw conclusions at =0.05. Both main effects, temperature and pressure, and the blocking factor are significant, as all corresponding p-values are less than = 0.05. (f) If the experimenter did not include the factor day in the statistical model, what would a new ANOVA table look like? (Use the information given in the SAS output above.) Sum of Source Squares DF Mean Square F-Value Prob > F temp 5.51 2 2.75 1.434 0.28805 pres 99.85 2 49.93 26.032 0.00018 temp*pres 4.45 4 1.11 0.579 0.68566 Residual 17.26 9 1.918 Cor Total 127.07 17 5

4. A study recorded laboratory animal responses (Y) to a drug a related to three quantitative predictor variables: X 1 = body weight (grams), X 2 = age (months), and X 3 = administered drug dose (mg). These were recorded as follows: Y X 1 X 2 X 3 Y X 1 X 2 X 3 140.4080 176 6.5 0.88 121.0736 165 7.9 0.84 127.9948 176 9.5 0.88 123.2062 158 6.9 0.80 141.9612 190 9.0 1.00 111.7565 148 7.3 0.74 134.7643 176 8.9 0.88 125.1585 149 5.2 0.75 166.8457 200 7.2 1.00 119.8591 163 8.4 0.81 122.4732 167 8.9 0.83 138.1980 170 7.2 0.85 150.4503 188 8.0 0.94 149.1850 186 6.8 0.94 145.0751 195 10.0 0.98 105.1598 146 7.3 0.73 141.4478 176 8.0 0.88 128.8160 181 9.0 0.90 116.8525 149 6.4 0.75 The data are found in the file animal3.csv. Assume a homogenous-variance, multiple linear regression model containing only first-order terms is appropriate for these data. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) Check the predictor variables for any concerns with multicollinearity. What do you find? To start, always plot the data! Sample R code animal.df = read.csv( file.choose() ) attach( animal.df ) pairs( cbind(y,x1,x2,x3) ) 6

The plot shows an especially strong linear relationship between X 1 and X 3. For assessing multicollinearity more closely, examine the correlations and the VIFs. Sample R code: cor( cbind(x1,x2,x3) ) library ( car ) vif( lm(y ~ X1+X2+X3)) mean( vif(lm(y ~ X1+X2+X3)) ) Output (edited) is: X1 X2 X3 X1 1.0000000 0.5000101 0.9902126 X2 0.5000101 1.0000000 0.4900711 X3 0.9902126 0.4900711 1.0000000 for the correlations, and X1 X2 X3 52.101917 1.335679 51.427154 for the VIFs and [1] 34.95492 for VIF. Since max{vif k } = 52.102 clearly exceeds 10, and VIF = 34.95 is far larger than 6.0, there is a clear problem with multicollinearity here. The scatterplot matrix and correlation matrix both suggest that X 1 and X 3 are the greater culprits. (If one removes X 3 the VIFs drop to near 1.33. But, one does not remove predictor variables just because they are highly collinear!) (b) Perform a ridge regression on these data: construct and display a trace plot and suggest a reasonable value for the tuning parameter c. State why you chose this value. Given the high multicollinearity, ridge regression is a valid alternative, but remember first to center the response variable and standardize the predictor variables. Sample R code: U = Y - mean(y) Z1 = scale( X1 ); Z2 = scale( X2 ); Z3 = scale( X3 ) library ( genridge ) const = seq(.001, 5,.001) animal.ridge = ridge( U ~ Z1 + Z2 + Z3, lambda = const ) traceplot( animal.ridge, pch='.', cex=1.5) detach( animal.df ) This produces the trace plot below. The trace for Z 2 is uninformative, but the traces for Z 1 and Z 3 suggest that their coefficient trace curves flatten by about c = 1.5, or possibly c = 2, thus something in this range would be a reasonable choice for c. 7

By the way: it is interesting to notice the two dashed vertical lines in the plot near c = 0. Digging into this R function, we find that these are the recommended values for c from two standard sources: the HKB value suggested by Hoerl et al. (1975, Communications in Statistics 4, pp. 105-123) and the LW value suggested by Lawless and Wang (1976, Communications in Statistics 5, pp. 307-323). Find these in the ridge() object as c( animal.ridge$khkb, animal.ridge$klw ) producing [1] 0.01914972 0.05385612 As can be seen, these are much smaller than the visual indication that c is near 1.5 or 2.0. Selection of the biasing parameter c in a ridge regression is a continually developing area of study. (c) Consider the following values of c: 2.7, 1.5, 8.9. Choose the most reasonable from among these values and employ it as your biasing constant in a ridge regression of Y on X 1, X 2, and X 3. Give the resulting ridge estimators for all the regression coefficients. From among the choices c = 2.7, 1.5, 8.9, the choice of c = 2.7 is nonsensical, as c cannot be negative; while the choice of c = 8.9 gives far too large a biasing effect, as the trace plot in part (b) stabilizes well before this point. Thus c = 1.5 is the most reasonable choice, in concordance with the analysis presented in part (b). To perform the ridge analysis, we continue to use the centered response variable U and the standardized predictors Z 1, Z 2, Z 3. Sample R code is: 8

library ( genridge ) const = 1.5 animal1.ridge = ridge( U ~ Z1 + Z2 + Z3, lambda = const ) coef( animal1.ridge ) with consequent output Z1 Z2 Z3 1 9.237563-5.558832 6.200083 (We could also have extracted the estimates directly as animal1.ridge$coef.) The three estimates are b R1 = 9.237563, b R2 = 5.558832, and b R3 = 6.200083. 5. In a biomonitoring study of workplace chemical exposure, retired factory workers were assayed for blood concentrations of an industrial chemical. Three predictor variables were also recorded: X 1 = Years worked, X 2 = Years retired, and X 3 = Age. The data are available in the file chemical.csv. In any computer calculations you perform below, supply both your supporting code and pertinent output for your answers. (a) One might argue that when Age = 0, we would expect the response to be zero. Explain why we would then also expect the response to be zero when X 1 = X 2 = 0. Operate under this assumption that the response is zero when all three predictor variables are zero, and fit an appropriate multiple regression model (use only first-order terms) to these data using all three predictors. Identify which, if any, of the three predictors significantly affects E[Y]. Conduct your tests at a family-wise error rate (FWER) of 0.5%. Obviously when Age = X 3 = 0, the subject has just been born, so s/he could not have worked any years (so X 1 = 0), nor could s/he have spent any years in retirement (so X 2 = 0). Next, always plot the data! Sample R code and consequent scatterplot: chemical.df = read.csv( file.choose() ) attach( chemical.df ) X1 = Years.worked X2 = Years.retired X3 = Age pairs( cbind(y,x1,x2,x3) ) pairs( concrete.df ) 9

The various patterns between Y and the predictor variables are mixed; a formal analysis will prove interesting. Also, there appears to be some slight correlation between X 2 and X 3, but less for the other parings. This is verified by checking the pairwise correlations: cor( cbind(x1,x2,x3) ) X1 X2 X3 X1 1.0000000 0.1464776-0.2580107 X2 0.1464776 1.0000000 0.7641656 X3-0.2580107 0.7641656 1.0000000 So, for the record, some slight problems with multicollinearity may be present. (VIFs are not useful in linear regression through-the-origin, so these are not worth calculating here.) Next, perform the LS fit. We operate without an intercept, as per the instructions in the problem. The model sets E[Y i ] = 1 X i1 + 2 X i2 + 3 X i3. We test the three separate hypothesis H oj : j = 0 vs. H a : j 0 (no indication was given for any one-sided alternatives) for j = 1,2,3. Sample R code and (edited) output: chemical.lm = lm( Y ~ X1 + X2 + X3-1 ) summary( chemical.lm ) 10

Call: lm(formula = Y ~ X1 + X2 + X3-1) Coefficients: Estimate Std. Error t value Pr(> t ) X1-0.003378 0.015378-0.220 0.828 X2 0.026205 0.034861 0.752 0.460 X3 0.043152 0.008185 5.272 2.38e-05 *** Residual standard error: 0.4163 on 23 degrees of freedom Multiple R-squared: 0.9779, Adjusted R-squared: 0.9751 F-statistic: 339.8 on 3 and 23 DF, p-value: < 2.2e-16 To test the g = 3 hypotheses, extract their corresponding p-values from their partial t-tests and then apply a Bonferroni adjustment: pvals = summary( chemical.lm )$coefficients[,4] p.adjust( pvals, method = "bonferroni" ) X1 X2 X3 1.000000e+00 1.000000e+00 7.143729e-05 The Bonferroni-adjusted p-values for H o1 : 1 =0 and H o2 : 2 =0 are reported as 1.0 so they clearly exceed = 0.005. Hence, we fail to reject H o1 and H o2. For H o3 : 3 = 0 we find p 3 = 7.1437 10 5 < 0.005 =. Thus we reject H o3. We conclude that there is no significant effect of Years worked on E[Y] there is no significant effect of Years retired on E[Y] there is a significant effect of Age on E[Y]. (b) Plot the residuals from the model fit in part (a) against the predicted values from the fit. Do any untoward patterns appear? Residual plot: sample R code and resulting residual plot are plot( resid(chemical.lm) ~ predict(chemical.lm), pch=19 ) abline( h=0 ) 11

We see no troublesome patterns in the residual plot. (The one residual far to the right in the plot is eye-catching, but it does not indicate a problem with the model fit. We might think to check for possible leverage points, however, so see part (c) below.) (c) Assess if any observations possess high leverage in the model fit from part (a). Leverage analysis: sample R code and (edited) output are hii = hatvalues( chemical.lm ) p = length( coef(chemical.lm) ); n = length(y); print( 2*p/n ) [1] 0.2307692 The rule-of-thumb cut-off is seen to be h ii > 2p/n = (2)(3)/26 = 0.2308. A fast check via R employs which( hii > 2*p/n ) producing 5 7 18 i.e., the three observations at i = 5, 7, 18 exert high leverage on the model fit. Referring back to the residual plot, we can query what the predicted values were for these three leverage points: predict(chemical.lm)[which(hii > 2*p/n)] 5 7 18 3.44315 2.80281 2.67382 The extreme predicted value at Ŷ5 = 3.44315 is indeed seen to one of the high leverage points! (d) Since X 1 and X 2 represent time at or after possible workplace exposure, consider the joint null hypothesis H o : 1 = 2 = 0 vs. H a : any departure. Perform a single test of H o against H a at a false positive rate of 0.5%. Reduced model hypothesis test: sample R code and (edited) output are anova( lm(y ~ X3-1), chemical.lm ) Model 1: Y ~ X3-1 Model 2: Y ~ X1 + X2 + X3-1 Res.Df RSS Df Sum of Sq F Pr(>F) 1 25 4.1011 2 23 3.9856 2 0.11551 0.3333 0.72 We see the test statistic is F* = 0.3333 with (2,23) d.f. The corresponding p-value is P[F(2,23) 0.3333] = 0.7200 which is clearly larger than = 0.005. Thus we fail to reject 12

H o and conclude that Years Worked and Years Retired do not contribute significantly to the model fit. Alternatively, using a rejection region approach: find the critical point as F(0.995, 2, 23) = 6.7300. As F* = 0.3333 < 6.7300, we again fail to reject H o. 6. Consider the multiple linear regression model Y ~ N p (X, 2 ), where Y is an n 1 vector, X is an n p matrix, and is a p 1 vector. Show that the maximum likelihood estimator for is derived from the same estimating equations as the least squares estimator for, making the two estimators identical. For simplicity, you may assume that 2 is known. [Hint: from matrix calculus, recall that Va a = VT and at Ua = (U + U a T )a, for conformable matrices U and V and vector a.] The likelihood function is L( ) = ( {2π} 1/2 ) n exp{ (Y X ) T (Y X )/2 2 }, so the loglikelihood becomes l( ) = n log( {2π} 1/2 ) (Y X ) T (Y X )/2 2 = n log( {2π} 1/2 ) {Y T Y T X T Y Y T X + T X T X }/2 2. Notice that T X T Y is a scalar, so it must equal its transpose: T X T Y = ( T X T Y) T = Y T X. Then l( ) = n log( {2π} 1/2 ) {Y T Y 2Y T X + T X T X }/2 2. Now take the first derivative with respect to : l( ) = 1 (2YT X T X T X ) 2 = 1 2 2 YT X T X T X. From the Hint, Y T X = (Y T X) T = X T Y while T X T X = (X T X + {X T X} T ) = 2X T X the latter equality holding since X T X is clearly symmetric. Then, l( ) = 1 2 (2X T Y 2X T X ) 13

so setting l( )/ = 0 (a p 1 vector of zeroes), yields 2X T Y 2X T X = 0, or simply X T X = X T Y. This is equivalent to the least squares normal equations for given in Equation (6.24) from Kutner et al. s textbook, which shows that the two estimation methods lead to the same estimating equations. 14