Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Size: px

Start display at page:

Download "Statistical Methods III Statistics 212. Problem Set 2 - Answer Key"

Marjorie Hall
5 years ago
Views:

1 Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423 patients being treated in a kidney-stone clinic. The data set contains the number of kidney stones that each patient has formed from the beginning of treatment at the clinic, together with the number of years of treatment. Patient age (at baseline) and sex are recorded. A small number of patients have only one functional kidney, either because the kidney was surgically removed or never functioned; this is also recorded in the data set. Only patients with at least one year of follow-up (treatment) are included in the dataset. The data can be found on the course web-site at dgillen/stat212/data/stones.txt. Scientific interest lies in determining wether patients with only one functioning kidney have a higher or lower rate of stone development when compared to those patients that have two functioning kidneys. (a) Produce descriptive statistics relevant to the scientific question of interest. Solution: : Answers to this will vary. At minimum, summary statistics of the population stratified by the predictor of interest should have been produced (this should have included years of followup). A histogram of the rate of stone development stratified by the predictor of interest would also have help to visualize the fact that the data are zero-inflated. Finally a boxplot of rates of stone developments by the predictor of interest could also have been produced. (b) Specify an a priori model for answering the scientific question of interest. choice of adjustment variables in your model. You should justify the Solution: : There are few covariates available for the analysis. A very quick reading of any literature would show that age and male sex are risk factors for kidney stones. In addition, multiple reports have suggested that there is an interaction between these covariates (at younger ages, males have higher rates of stone development, but females catch up at later ages). As such, I would be adjusting for these as potential precision variables (and possible confounders, though there is no information on why individuals have one kidney because of failure or because they were a donor). I note that there are a lot of potential unmeasured confounding factors. Among them are sodium intake, water consumption, hypertension, and hereditary kidney disease. (c) Notationally, write down your a priori regression model. Explain, using your model, how the rate is related to the actual count of kidney stones. Provide an interpretation of the intercept (presuming you have one in your model) and the coefficient associated with the predictor of interest. Note that you will probably wish to transform these parameters to provide meaningful interpretations. Solution: : My a priori model is given as follows: log(λ i ) = β 0 + β 1 (age i 40) + β 2 sex i + β 3 (age i 40) sex i + β 4 nx1 i where λ i is the rate of kidney stone development for subject i. If Y i is the number of kidney stones for subject i and Y i has mean µ i, then µ i = λ i yrfu i and so we have the model log(µ i ) = β 0 + β 1 (age i 40) + β 2 sex i + β 3 (age i 40) sex i + β 4 nx1 i + log(yrfu i ) (d) Use Poisson regression to fit your model and interpret the estimated intercept (presuming you have one in your model) and the estimated coefficient associated with the predictor of interest. Note that 1

2 you will probably wish to transform these parameters to provide meaningful interpretations. Solution: : Below is my initial model fit. Note that I have computed the average age over the course of followup as this is probably a more honest measure of subject age (given the high variance of followup times). ### Create variable for the average age over the course of followup > stones$mean.age <- (stones$age + (stones$age+stones$yrfu))/2 > ### Fit a priori regression model with Poisson regression > fit <- glm( stones ~ I(mean.age-40)*sex + nx1, data=stones, family=poisson, offset=log(yrfu) ) > glmci(fit) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) e+00 I(mean.age - 40) e+00 sex e-04 nx e-04 I(mean.age - 40):sex e-04 Interpretations of the (transformed) intercept and coefficient associated with the predictor of interest are as follows: From our model, we estimated that the rate of stones among 40 year old males patients with two functioning kidneys is approximately.1174 per person per year. We estimate that the rate of stone development among patients with one functioning kidney is approximately 50% that of patients with two functioning kidneys that are similar with respect to age and sex. (e) Using your fitted model, examine the data for overdispersion. What do you conclude. Specifically, do the kidney stone count data have a Poisson distribution for a randomly-selected person given their covariates? Why or why not? Solution:From the squared Pearson residual plot, it certainly appears that these data are overdispersed relative to the Poisson distribution (smoother is consistently above the horizontal line y=1. Note that the plot does not support a simple scalar (ie. quasipoisson) form of overdispersion. (f) Refit your model accounting for overdispersion by: i. using a scaled overdispersion model, scaling the standard errors by the Pearson statistic ii. using the robust variance estimator to adjust the standard errors of the regression coefficient estimates Solution: ### Quasi fit and robust SE > fit.quasi <- glm( stones ~ I(mean.age-40)*sex + nx1, data=stones, family=quasipoisson, offset=log(yrfu) ) > summary( fit.quasi ) Call: 2

3 Squared Pearson Residual Fitted mean Figure 1: Squared Pearson residual plot to assess overdispersion (plot is zoomed-in for a better visual representation. glm(formula = stones ~ I(mean.age - 40) * sex + nx1, family = quasipoisson, data = stones, offset = log(yrfu)) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** I(mean.age - 40) * sex nx I(mean.age - 40):sex Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for quasipoisson family taken to be ) Null deviance: on 1422 degrees of freedom Residual deviance: on 1418 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 6 > glmci( fit.quasi ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) I(mean.age - 40) sex nx

4 I(mean.age - 40):sex > > glmci( fit, robust=true ) exp( Est ) robust ci95.lo robust ci95.hi robust z value robust Pr(> z ) (Intercept) I(mean.age - 40) sex nx I(mean.age - 40):sex Notice from the above that regardless of the method, the variance estimates are much wider after accounting for overdispersion. While the scaled version is the more extreme correction, the Pearson plot does not support such a model for the overdispersion. (g) Now refit your model using negative binomial regression fit via maximum likelihood (see Problem Set 1). How do your estimates and corresponding inference compare with the Poisson fit and the overdisperion models in (f)? Solution: ### Negative binomial fit > library(mass) > fit.nb <- glm.nb( stones ~ I(mean.age-40)*sex + nx1, data=stones ) > summary(fit.nb) Call: glm.nb(formula = stones ~ I(mean.age - 40) * sex + nx1, data = stones, init.theta = , link = log) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) I(mean.age - 40) sex nx I(mean.age - 40):sex Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for Negative Binomial(0.1394) family taken to be 1) Null deviance: on 1422 degrees of freedom Residual deviance: on 1418 degrees of freedom AIC: 2658 Number of Fisher Scoring iterations: 1 Theta:

5 Std. Err.: x log-likelihood: We can see that there are slight differences in the regression parameter estimates, because of the different weights being applied with the negative binomial model (recall that the quasipoisson and robust variance estimator will have the same coefficient estimates as the usual Poisson model). However, the estimates are certainly not qualitatively different and the resulting inference from the negative binomial model is similar to that from using the robust variance estimator. (h) Which of the overdispersion models would you select and why? Solution: Given the relatively large sample size, I would prefer to use the robust variance estimator for these data. The data clearly seem to be overdispersed relative to the Poisson distribution and the scalar form of overdispersion is not supported by the Pearson plot. The Pearson plot would be approximately linear (with non-zero slope) under the negative-binomial model. This again is not supported by the data. The robust variance makes no assumption about the form of the mean-variance relationship. The robust variance estimator may produce less efficient estimators than the negative binomial model if indeed the negative binomial assumption were correct (since the optimal weights are likely not being used), but that certainly does not seem to be the case here and truly is less of a concern to me than guaranteeing valid inference across a wide range for possible mean-variance relationships. (i) Based upon your final model, what do you conclude regarding the association between the number of functioning kidneys and the rate of kidney stone development? What are the limitations of your analysis? Solution: We estimate that the rate of stone development among patients with one functioning kidney is approximately 50% that of patients with two functioning kidneys that are similar with respect to age and sex (arr=0.499; 95% CI:.0.286, 0.871; p=.0145). However, this studies comes with multiple limitations: All patients were sampled from a single specialty clinic, calling into account the generalizability of the result. Further, there is no information the reason for patients having only one kidney, nor is there information on the actual timing of stone development. Finally, there are numerous unmeasured potential confounding factors that were previously discussed. 2. (To be turned in on Wednesday, April 25th) Following our discussion in Lecture 1 that pertained to the impact of correlation on precision, in this problem we will consider the impact of correlated data in the setting of linear regression. Specifically, we will consider the validity of ordinary least squares (OLS) estimates when outcomes are dependent and covariates vary within and between clusters. We will induce correlation by assuming each subject has a random intercept term so that a subject s response is given by the model: Y ij = β 0 + β 1 t ij + b 0,i + ɛ ij, i = 1,..., n, j = 1,..., J, where β = (β 0, β 1 ) consists of fixed effect parameters defining the mean model for the response, and we assume that b 0,i Normal(0, τ 2 ) and ɛ ij iid Normal(0, σ 2 ), with b 0,i and ɛ ij independent. Further assume that observations on different clusters are independent, so that the correlation between Y ij and Y kl is 0 for i k. (a) Based upon the above model specification, what is the mean and variance of Y ij? 5

6 Expectation: E[Y ij ] = E[β 0 + β 1 t ij + b 0,i + ɛ ij ] = E[β 0 ] + E[β 1 t ij ] + E[b 0,i ] + E[ɛ ij ] = β 0 + β 1 t ij = β 0 + β 1 t ij Variance: V ar[y ij ] = V ar[β 0 + β 1 t ij + b 0,i + ɛ ij ] = V ar[b 0,i + ɛ ij ] = V ar[b 0,i ] + V ar[ɛ ij ] + 2Cov(b 0,i, ɛ ij ) = τ 2 + σ = τ 2 + σ 2 (b) Based upon the above model specification, what is the covariance and correlation between Y ij and Y ij, j j? Combine this with your answer from (a) to write down the covariance matrix for Y i, the vector of responses for cluster i. Covariance: Cov(Y ij, Y ij ) = Cov(β 0 + β 1 t ij + b 0,i + ɛ ij, β 0 + β 1 t ij + b 0,i + ɛ ij ) = Cov(b 0,i + ɛ ij, b 0,i + ɛ ij ) = Cov(b 0,i, b 0,i ) + Cov(b 0,i, ɛ ij ) + Cov(ɛ ij, b 0,i ) + Cov(ɛ ij, ɛ ij ) = τ = τ 2 Cov(Y Correlation: Corr(Y ij, Y ij ) = ij,y ij ) = τ 2 (V ar(yij))v ar(y ij )) τ 2 +σ 2 τ 2 + σ 2 τ 2 τ 2 τ 2 τ 2 + σ 2 τ 2 Cov(Y i ) : Cov(Y i ) J,J = τ 2 τ 2 τ 2 + σ 2 (c) Now consider using OLS to estimate β where t ij is constant within a cluster. Consider the case n = 25 and J = 10 (i.e. 10 measurements on each of 25 clusters). Further suppose that σ 2 = 10 and τ 2 = 5. Sample the values of t ij from a Uniform(1,10) distribution using R with a random seed of Using these values of t ij, i = 1,..., 25, j = 1,..., 10, compute the true variance of the OLS estimator of β. (Hint: It may be useful to note that the OLS estimator can be written as ( n ) 1 n β = i=1 XT i X i i=1 XT i Y i, where X i is the design matrix (dimension 10 2) corresponding to cluster i.) set.seed(12345) n <- 25 J <- 10 id <- rep( 1:n, each=j ) t.ij <- rep( runif(n,1,10), each=j ) # cbind(id,t.ij) Sigmai <- matrix(5,nrow=j,ncol=j) + diag(10,nrow=j) XtX <- Reduce( +, lapply( split(t.ij,id), function(x){ t(cbind(1,x)) %*% cbind(1,x) } ) ) XtSigmaX <- Reduce( +, lapply( split(t.ij,id), function(x){ t(cbind(1,x)) %*% Sigmai %*% cbind(1,x) } ) ) Var.beta <- solve(xtx) %*% XtSigmaX %*% solve(xtx) colnames(var.beta) <- c("beta_0", "Beta_1") rownames(var.beta) <- c("beta_0", "Beta_1") Var.beta x x (d) Now simulate 10,000 datasets for each parameter scenario given in the table below using the model and sampling scheme given in (c) (you may assume that the values of t ij are fixed by design and hence 6

7 do not vary by simulation). For each dataset, compute the OLS estimate of β 1 and the model based variance estimate as computed by lm(). Use the results of your simulation study to fill in a table of the form: β 0 β 1 σ 2 τ 2 E[ β 1 ] Mean Var[ β 1 ] Obs Var[ β 1 ] Cov. Prob. Type I Error where the columns represent the mean OLS estimate of β 1, the mean of the model-based variance, the observed variance, the coverage probability of a 95% confidence interval for β 1 based upon the model-based variance, and the observed type I error rate (only for the case β 1 = 0). Comment on the validity of the OLS estimate and corresponding inference in each of the above cases. Specifically comment on how the mean of the model-based and observed variance relate to the true variance you computed in (c) for the case τ 2 = 5. Code is posted on the course webpage. Briefly, OLS is consistent to the true value regardless of the value of τ 2. When τ 2 = 0 (i.e. under uncorrelated data), the model based variance (i.e Mean V ˆ ar[ ˆβ 1 ]) and the observed variance, which is an estimate of the true variance, are the same. Therefore, the inferences made based on the model has the correct coverage probability and correct Type I error rate. For τ 2 > 0 (i.e. under correlated data), as the table in part d shows, model-based variance is less than the observed variance indicating that OLS underestimates the true variance. This results in lower coverage probability and higher Type I error rate compared with the true coverage probability and the true Type I error rate (i.e unreliable inference when τ 2 > 0). In particular, when τ 2 = 5 observed variance, which is expected to be an estimate of the true variance, is very close to the variance estimated in part c (both are about 0.034) while the model-based variance (OLS model) is far less than the true variance. This leads to a lower coverage probability and higher type I error rate. (e) Now consider using OLS to estimate β where t ij varies within a cluster. Again consider the case n = 25, J = 10, and suppose that σ 2 = 10 and τ 2 = 5. Assume that t ij = j for all i = 1,..., 25, j = 1,..., 10. Compute the true variance of the OLS estimator of β. ## ##### Compute true variance of OLSE under varying within-cluster covariate ## n <- 25 J <- 10 id <- rep( 1:n, each=j ) t.ij <- rep( 1:J, n ) Sigmai <- matrix(5,nrow=j,ncol=j) + diag(10,nrow=j) XtX <- Reduce( +, lapply( split(t.ij,id), function(x){ t(cbind(1,x)) %*% cbind(1,x) } ) ) XtSigmaX <- Reduce( +, lapply( split(t.ij,id), function(x){ t(cbind(1,x)) 7

8 %*% Sigmai %*% cbind(1,x) } ) ) Var.beta <- solve(xtx) %*% XtSigmaX %*% solve(xtx) colnames(var.beta) <- c("beta_0", "Beta_1") rownames(var.beta) <- c("beta_0", "Beta_1") Var.beta x x (f) Again simulate 10,000 datasets using the above model and values of t ij give in (e). For each dataset, compute the OLS estimate of β 1 and the model based variance estimate as computed by lm(). and produce an analogous table that given in part (d) (same values for β 1, σ 2, and τ 2 ). Comment on the validity of the OLS estimate and corresponding inference in each of the above cases. Specifically comment on how the mean of the model-based and observed variance relate to the true variance you computed in (e) for the case τ 2 = 5. β 0 β 1 σ 2 τ 2 E[ β 1 ] Mean Var[ β 1 ] Obs Var[ β 1 ] Cov. Prob. Type I Error Again, OLS is consistent to the true value regardless of the value of τ 2. When τ 2 = 0 (i.e. under uncorrelated data), the model based variance (i.e Mean V ˆ ar[ ˆβ 1 ]) and the observed variance, which is an estimate of the true variance, are the same. Therefore, the inferences made based on the model has the correct coverage probability and correct Type I error rate. For τ 2 > 0 (i.e. under correlated data), as the table in part e shows, model-based variance is greater than the observed variance indicating that OLS overestimates the true variance. This results in higher coverage probability and lower Type I error rate compared to the true coverage probability and the true Type I error rate (i.e unreliable inference when τ 2 > 0). In particular, when τ 2 = 5 observed variance, which is expected to be an estimate of the true variance, is very close to the variance estimated in part e (both are about ) while the model-based variance (OLS model) is greater than the true variance. This leads to a higher coverage probability and lower type I error rate. (g) Summarize the conclusions from this exercise regarding the impact of correlated data when using OLS to estimate model parameters. Comment on how the results you observed here relate to our discussion of the repeated measures and case-crossover designs from Lecture 1. Ordinary Least Square (OLS) estimates ignore any correlation in data. Under correlated data, model-based variance of the OLS estimates are incorrect, however, the estimates still remain consistent. Depending on the study design, model-based variance may underestimate (part d) or may 8

9 overestimate (part f) the true variance. In Part d, data are simulated under the repeated measure design. Under this design, OLS model underestimates the true variance and therefore, leads to lower coverage probability and higher type I error rate. In Part f, data are simulated under the cross-over design. Under this design, OLS model overestimates the true variance and therefore, leads to higher coverage probability and lower type I error rate. 9

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in