are censored. Because artificial censoring reduces the information available and leads to nonsmooth estimating functions, prior research has noted

Size: px

Start display at page:

Download "are censored. Because artificial censoring reduces the information available and leads to nonsmooth estimating functions, prior research has noted"

Oliver Tate
6 years ago
Views:

1 ABSTRACT VOCK, DAVID MICHAEL. Advanced Statistical Methods for Complex Longitudinal Data. (Under the direction of Marie Davidian and Anastasios A. Tsiatis.) Longitudinally collected data play an important role in many biomedical applications. As researchers have greater capacity to collect and store large amounts of data over time, biomedical research has shifted from trying to understand the effects of therapy at a single time point, to a more nuanced understanding of how therapy, which could change over time, longitudinally affects a patient. However, many standard methods of analyzing longitudinal data are not adequate to address questions of interest. The first half of the thesis addresses methodological challenges when the longitudinally collected response may be censored. Mixed models are commonly used to represent longitudinal or repeated measures data. An additional complication arises when the response is censored, for example, due to limits of quantification of the assay used. While Gaussian random effects are routinely assumed, little work has characterized the consequences of misspecifying the random effects distribution nor has a more flexible distribution been studied for censored longitudinal data. We show that, in general, maximum likelihood estimators will not be consistent when the random effects density is misspecified, and the effect of misspecification is likely to be greatest when the true random effects density deviates substantially from normality and the number of non-censored observations on each subject is small. We develop a mixed model framework for censored longitudinal data in which the random effects are represented by the flexible seminonparametric (SNP) density and show how to obtain estimates in SAS procedure NLMIXED. Simulations show that this approach can lead to reduction in bias and increase in efficiency relative to assuming Gaussian random effects. The methods are demonstrated on data from a study of hepatitis C virus. In the second portion, we develop methods to assess the effect of organ transplantation, a time-varying treatment, on survival. Because the number of patients waiting for organ transplants exceeds the number of organs available, a better understanding of how transplantation affects the distribution of residual lifetime is needed to improve organ allocation. However, there has been little work to assess the survival benefit of transplantation from a causal perspective. Previous methods developed to estimate the causal effects of treatment in the presence of timevarying confounders have assumed that treatment assignment was independent across patients, which is not true for organ transplantation. We develop a version of G-estimation that accounts for the fact that treatment assignment is not independent across individuals to estimate the parameters of a structural nested failure time model. In addition, G-estimation for failure time models requires the use of artificial censoring, a technique where some subjects observed to fail

2 are censored. Because artificial censoring reduces the information available and leads to nonsmooth estimating functions, prior research has noted that finding the solutions to estimating equations can be difficult. We suggest some computational strategies to mitigate the problems typically encountered with artificial censoring. We derive the asymptotic properties of our estimator and confirm through simulation studies that our method leads to valid inference of the effect of transplantation on the distribution of residual lifetime. We demonstrate our method on the survival benefit of lung transplantation using data from the United Network for Organ Sharing (UNOS).

4 Advanced Statistical Methods for Complex Longitudinal Data by David Michael Vock A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Statistics Raleigh, North Carolina 2012 APPROVED BY: Dennis D. Boos Daowen Zhang Marie Davidian Co-chair of Advisory Committee Anastasios A. Tsiatis Co-chair of Advisory Committee

5 DEDICATION To my parents, Janice and Dave; my sister, Emily; and my best friend, Laura. ii

6 BIOGRAPHY Originally from Wheaton, Illinois, David M. Vock is the son of David J. and Janice Vock. David graduated from Wheaton North High School in 2003 and was captain of the two-time Illinois state champion scholastic bowl team for Wheaton North. After high school, David attended St. Olaf College in Northfield, Minnesota where he majored in mathematics and chemistry with a concentration in statistics. While at St. Olaf College, he had the opportunity to study biostatistics and collaborate with World Health Organization researchers in Geneva, Switzerland. David was elected to Phi Beta Kappa and graduated summa cum laude with a Bachelor of Arts degree and distinction in statistics from St. Olaf in Interested in many different academic disciplines, David decided to pursue graduate studies in statistics because, in the words of John Tukey, he could play in everyone s backyard. He began his graduate studies in 2007 at North Carolina State University in Raleigh, North Carolina. David received a Master of Statistics degree from Noth Carolina State University in He actively collaborates with medical researchers at Duke Clinical Research Institute. David s collaborative work has largely focused on hepatitis C and lung transplantation research and has motivated much of his methodological research in longitudinal data analysis, causal inference, and dynamic treatment regimes. He intends to complete his Ph.D. under the direction of Drs. Marie Davidian and Anastasios A. Tsiatis. David has accepted a position as an assistant professor in the Division of Biostatistics at the University of Minnesota School of Public Health. iii

7 ACKNOWLEDGEMENTS Many people have helped and guided me through my graduate studies and dissertation research. I would like to express my gratitude to a few people below while acknowledging that many more people deserve praise. I would like especially to thank my advisors, Drs. Marie Davidian and Butch Tsiatis. They have showed incredible patience in allowing me to explore various research lines and I am very thankful for their latitude. Their expertise and gentle guidance have been immensely helpful in completing this work. I also would like to thank them for offering me a research assistantship to support me during most of my time at North Carolina State. Mostly, I would like to express my gratitude to them for being great faculty role models and mentors. They have guided me through the peer-review process, made sure that I would gain teaching experience, and given me valuable advice on searching for academic jobs. Butch and Marie have shown me how to contribute to interdisciplinary research and how to use collaborative research to motivate methodological work. I wish to thank all the faculty and staff at North Carolina State that have helped me along the way. I want to thank especially the Directors of the Statistics Graduate Program during my first years at North Carolina State, Drs. Pam Arroway and Jacqueline Hughes-Oliver, for being patient in answering all my questions and taking the time to help me find interesting funding opportunities. I am also appreciative of all the advice Dr. Eric Laber has given me about how to interview for faculty jobs and how to be a successful junior faculty member. Finally, I would like to thank Drs. Dennis Boos and Daowen Zhang for serving on my committee, writing letters of recommendation, and being wonderful teachers of some of the most interesting and useful courses I have taken in graduate school, especially Advanced Inference and Mixed Models and Variance Components. I am also thankful for the opportunity to work with Karen Pieper at Duke Clinical Research Institute. She has been an invaluable resource on how to work with clinical investigators and artfully explain advanced methodology to a broad audience. I have learned a lot from her ability to prioritize the most urgent needs of a statistical analysis and to cajole subject area experts to buy-in to rigorous statistical analysis. In addition, I would like to thank all the clinical investigators that I have worked with at Duke. Besides teaching me a lot about their own area of research, they have been patient and nurturing. I am especially grateful for the great education that I received at St. Olaf College that well prepared me for graduate school. I want to thank my mentors and research advisors, Drs. Rebecca Judge, Julie Legler, and Paul Roback. They have been extremely supportive and provided a lot of encouragement that I could be successful in graduate school. iv

8 Finally, I want to thank all my family and friends for their love and support. Simply put, without them I would not have succeeded. My parents have given so much to see that I have been successful and have always been loving and caring. Both my parents and my sister, Emily, have taught me to laugh, stay humble, and never to take myself too seriously. Finally, I would like to thank my girlfriend Laura for her patience, support, and good humor while I wrote this dissertation. v

9 TABLE OF CONTENTS List of Tables viii List of Figures xii Chapter 1 Mixed Model Analysis of Censored Longitudinal Data with Flexible Random Effects Density Introduction Gaussian Linear Mixed Model with Censored Response Semiparametric Linear Mixed Model with Censored Response Estimation Using SAS NLMIXED Simulation Application Discussion Chapter 2 Assessing the Effect of Organ Transplantation on the Distribution of Residual Lifetime Introduction Developing a Statistical Framework Potential Outcome Framework Failure Time Model Observed Data Organ Allocation Model Development of Class of Estimators Proposed Class of Estimators Asymptotic Properties of the Class of Estimators Reasonable Estimators in the Class Nuisance Parameters Censored Survival Times Artificial Censoring Mitigating Problems of Artificial Censoring Simulation Study UNOS Dataset Description of Cohort Model Specification Results Diagnostics and Non-parametric Estimate of Survival Conclusion References Appendices Appendix A Asymptotic Bias of Parameter Estimators for Censored Longitudinal Mixed Models vi

10 Appendix B SAS Code for Censored Longitudinal Models B.1 Obtain Empirical Bayes Estimates B.2 Obtain Grid of Starting Values B.3 SNP Estimation Appendix C Additional Simulations for Mixed Model Analysis of Censored Longitudinal Data Appendix D Joint Distribution of Potential Outcomes and Observed Data Appendix E Consistency and Asymptotic Normality of the Class of Estimators E.1 Proof of Consistency E.2 Proof of Asymptotic Normality Appendix F Improving the Efficiency of Class of Estimators Appendix G Doubly Robust Estimators Appendix H Estimating Nuisance Parameters H.1 Proof: Estimation of Nuisance Parameters Does Not Affect Asymptotic Variance H.2 Suggestions on Estimating ξ Appendix I Organ Transplantation Simulation Details vii

11 LIST OF TABLES Table 1.1 Table 1.2 Table 1.3 Table 1.4 Table 1.5 Table 1.6 Table 1.7 Table 1.8 Proportion of subjects by number of non-censored observation for the four random effects densities considered in the simulations with q = 2 described in Section Simulation results when Gaussian random effects were assumed for all models regardless of the true distribution of the random effects. The simulation included 500 data sets with 500 subjects each. MC Avg: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; CP: Monte Carlo coverage probability of the 95 percent Wald-type confidence intervals Proportion of subjects by number of non-censored observation when the true random effects density was bimodal and the time points where data were collected was varied. t A = (0, 1, 2, 3, 4) T ; t B = (0, 1, 2, 3, 3.5, 4, 4.5) T ; t C = ( 2, 1, 0, 1, 2, 3, 4) T Simulation results when Gaussian random effects were assumed for the bimodal random effects but t i in (1.2) was varied. The simulation included 500 data sets with 500 subjects each. t A = (0, 1, 2, 3, 4) T ; t B = (0, 1, 2, 3, 3.5, 4, 4.5) T ; t C = ( 2, 1, 0, 1, 2, 3, 4) T ; MC Avg: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; CP: Monte Carlo coverage probability of the 95 percent Waldtype confidence intervals Proportion of time K was selected using AIC, HQIC, and BIC when the SNP density was used to model the random effects density. The simulation included 500 data sets with 500 subjects each Simulation results when the SNP density was used to model the random effects density. Models with K = 0, K = 1, and K = 2 were fit and K was selected using HQIC. The simulation included 500 data sets with 500 subjects each. MC Avg: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; Ratio MSE: Ratio of the Monte Carlo mean square error between K = 0 and K selected by the HQIC Proportion of responses censored at each time point for the 811 patients with CT genotype in the IDEAL trial Proportion of subjects by number of uncensored viral load measurements for the 811 patients with the CT genotype in the IDEAL trial Table 1.9 Information criteria and parameter estimates from fitting model (1.10) concerning the IDEAL study with K = 0, K = 1, and K = Table 1.10 Parameter estimates, standard errors, and a 95 percent Wald-type confidence limits from fitting model (1.10) assuming the random effects followed the SNP density with K = viii

12 Table 2.1 Table 2.2 Summary of the parameter estimates for θ 0 by estimator from 1,000 Mote Carlo simulations when the number of organs allocated was 1,250. Converge: Proportion of Monte Carlo datasets the estimator converged; MC mean: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; Ratio Avg SE:SD: Ratio of the average of the standard error estimates to the standard deviation of the parameter estimates; Ratio MSE: Ratio of the average squared error of the fac estimates to the average squared error of the estimates of the given estimator; CP: Monte Carlo coverage probability of 95 percent Wald-type confidence intervals. Approximate standard errors for the values in the column MC Mean ranged from to , the column MC SD ranged from to , the column Avg SE ranged from to , the column Ratio Avg SE:SD ranged from to 0.045, the column Ratio MSE ranged from to 0.27, the column CP ranged from to The true value of θ 0 is Summary of the parameter estimates for θ 1 by estimator from 1,000 Mote Carlo simulations when the number of organs allocated was 1,250. Converge: Proportion of Monte Carlo datasets the estimator converged; MC mean: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; Ratio Avg SE:SD: Ratio of the average of the standard error estimates to the standard deviation of the parameter estimates; Ratio MSE: Ratio of the average squared error of the fac estimates to the average squared error of the estimates of the given estimator; CP: Monte Carlo coverage probability of 95 percent Wald-type confidence intervals. Approximate standard errors for the values in the column MC Mean ranged from to , the column MC SD ranged from to , the column Avg SE ranged from to , the column Ratio Avg SE:SD ranged from to 0.045, the column Ratio MSE ranged from to 0.11, the column CP ranged from to The true value of θ 1 is ix

13 Table 2.3 Table 2.4 Table 2.5 Summary of the parameter estimates for θ 0 by estimator from 1,000 Mote Carlo simulations when the number of organs allocated was 2,500. Converge: Proportion of Monte Carlo datasets the estimator converged; MC mean: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; Ratio Avg SE:SD: Ratio of the average of the standard error estimates to the standard deviation of the parameter estimates; Ratio MSE: Ratio of the average squared error of the fac estimates to the average squared error of the estimates of the given estimator; CP: Monte Carlo coverage probability of 95 percent Wald-type confidence intervals. Approximate standard errors for the values in the column MC Mean ranged from to , the column MC SD ranged from to , the column Avg SE ranged from to , the column Ratio Avg SE:SD ranged from to 0.045, the column Ratio MSE ranged from to 0.13, the column CP ranged from to The true value of θ 0 is Summary of the parameter estimates for θ 1 by estimator from 1,000 Mote Carlo simulations when the number of organs allocated was 2,500. Converge: Proportion of Monte Carlo datasets the estimator converged; MC mean: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; Ratio Avg SE:SD: Ratio of the average of the standard error estimates to the standard deviation of the parameter estimates; Ratio MSE: Ratio of the average squared error of the fac estimates to the average squared error of the estimates of the given estimator; CP: Monte Carlo coverage probability of 95 percent Wald-type confidence intervals. Approximate standard errors for the values in the column MC Mean ranged from to , the column MC SD ranged from to , the column Avg SE ranged from to , the column Ratio Avg SE:SD ranged from to 0.045, the column Ratio MSE ranged from to 0.090, the column CP ranged from to The true value of θ 1 is Parameter estimates for the model of the propensity to receive an available organ. Group A: primarily COPD; Group B: primarily pulmonary hypertension; Group C: primarily CF; Group D: primarily IPF; Single LTx: single lung transplantation. The reference for Private Insurance and Medicare Insurance is Medicaid or other public funding for payment. The reference for each UNOS Region is Region 11. To allow for a non-linear relationship between LAS and the log propensity to receive treatment, we considered a restricted cubic spline transformation of LAS knot points at 32, 36, 43, and 85. LAS and LAS refer to the non-linear terms of this transformation x

14 Table 2.6 Table B.1 Parameter estimates of the scale accelerated failure time model for the effect of lung transplantation. Group A: primarily COPD; Group B: primarily pulmonary hypertension; Group C: primarily CF; Group D: primarily IPF; Single LTx: single lung transplantation. The reference for Private Insurance and Medicare Insurance is Medicaid or other public funding for payment. LAS has been centered at 30, so the interpretation of the intercept term is the log acceleration factor at a LAS of Conversion between the notation used in Section 1 and the names used in the example code Table C.1 Proportion of subjects by the number of non-censored observations for the four random effect densities considered in the simulations with q = Table C.2 Simulation results when Gaussian random effects were assumed for all models regardless of the true distribution of the random effects. The simulation included 500 data sets with 500 subjects each. MC Avg: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; CP: Monte Carlo coverage probability of the 95 percent Wald-type confidence intervals Table C.3 Proportion of time K was selected using AIC, HQIC, and BIC for q = 1 simulations when SNP was used to estimate the random effects Table C.4 Simulation results when SNP was used to estimate the random effects for q = 1 simulations. Models with K = 0, K = 1, and K = 2 were fit and K was selected using HQIC. The simulation included 500 data sets with 500 subjects each. MC Avg: Monte Carlo average of the parameter estimates; MC SD: Monte Carlo standard deviation of the parameter estimates; Avg SE: Average of the standard error estimates; Ratio MSE: Ratio of the Monte Carlo mean square error between K = 0 and K selected by the HQIC Table I.1 Quantile (years) of the simulated residual life had the patient never been transplanted by LAS at transplantation. The maximum residual follow-up time is 5.66 years in the simulation xi

15 LIST OF FIGURES Figure 1.1 Figure 1.2 Trajectory of log 10 viral load of 25 randomly selected subjects with genotype CT from the IDEAL study. The lower limit of quantification, log 10 IU/mL, is shown in the green dotted line. For graphical purposes, the lower limit of quantification was imputed for the censored responses.. 2 Contour plot of the true random effects densities from the skewed and bimodal scenarios considered in the simulation described in Section 1.5. In each case, the densities were a mixture of two normals Figure 1.3 Average estimated contour and marginal density plots from fitting 500 data sets with skewed random effects where the density was assumed to follow the SNP density with K selected by HQIC. The blue dotted lines on the marginal density plots are the true marginal density. The true bivariate random effects density is plotted in Figure 1.2 and described in Section Figure 1.4 Average estimated contour and marginal density plots from fitting 500 data sets with bimodal random effects where the density was assumed to follow the SNP density with K selected by HQIC. The blue dotted lines on the marginal density plots are the true marginal density. The true bivariate random effects density is plotted in Figure 1.2 and described in Section Figure 1.5 Contour plot of the bivariate density estimate and marginal density estimates for the subject-specific intercepts and slopes with K = 2 from model (1.10) concerning the IDEAL study Figure 2.1 Figure 2.2 Figure 2.3 Plot of ij (θ, F j) and Ṽ ij (θ, F j) as a function of R ij where Cij (θ, F j) = 500 and ɛ = 50. These are the smooth approximations to ij (θ, F j ) = I{ R ij (θ) < Cij (θ, F j)} and Ṽij(θ, F j ) = min{ R ij (θ), Cij (θ, F j)}, respectively Histogram of all LAS scores each time an organ was offered (left) and LAS scores at transplantation for those patients that were transplanted (right) Kaplan-Meier residual survival curves for patients from the time of listing on the waitlist and transplantation. We have stratified by LAS at listing and transplantation, respectively, for these curves. The waitlist survival was censored at the time the patient was transplanted xii

16 Figure 2.4 Figure 2.5 Figure C.1 Figure C.2 Figure I.1 Logarithm of the estimated scale acceleration factor in the transformation model for the causal effect of lung transplantation in Group A (primarily COPD, left) and Group D (primarily IPF, right) by LAS and transplantation type. The curves here are drawn for patients under 60 years of age, not in UNOS regions 6-8, and not paying for the transplant through private insurance or Medicare. We emphasize that negative (positive) values of the log scale factor suggest that transplantation increases (decreases) lifespan. The histogram in the background gives the number of patients transplanted for that range of LAS and native disease group Kaplan-Meier survival curves of the transformed residual lifetime for all (i, j) pairs by LAS group and native disease. The rows correspond to native disease groups A, C, and D from top to bottom. The solid blue line corresponds to all (i, j) pairs where dn ij = 1 and the red dotted line corresponds to all (i, j) pairs where dn ij = 0. If the transformation model has been correctly specified then the two lines should be approximately the same Estimated random effects density for each of the four random effect densities considered (clockwise from upper left: normal, t 5, bimodal, and skewed) when SNP was used to estimate the random effects and K was selected using HQIC. The estimated density is plotted for 100 randomly selected Monte Carlo data sets. The truth is superimposed in red, and the average of all 500 Monte Carlo data sets is superimposed in blue Histogram of the estimated variance of the random effect when SNP was used to estimate the random effects density and the true random effect density was t 5 (left). The estimated random effects densities for those data sets where the estimate of var(b 0i ) is greater than 4 are also plotted (right) Logarithm of rate parameter from the exponential distribution used to generate the residual lifetime had the patient never been transplanted xiii

17 Chapter 1 Mixed Model Analysis of Censored Longitudinal Data with Flexible Random Effects Density 1.1 Introduction Longitudinal or repeated measures data are commonly represented by mixed effects models. A complication occurs when the response is censored for some of the observations, which often arises when assay measures are collected over time and the assay procedure is subject to limits of quantification (Hughes, 1999; Jacqmin-Gadda et al., 2000; Wu, 2002; Gray et al., 2004; Wang and Fygenson, 2009, and references therein). As an example, we consider data from the Individual Dosing Efficacy versus Flat Dosing to Assess Optimal Pegylated Interferon Therapy (IDEAL) study, which tracked the viral load progression of treatment-naive patients with hepatitis C (Thompson et al., 2010). One objective was to characterize the viral load decline in patients receiving standard treatment, pegylated interferon-alpha and ribavirin, over the first twelve weeks of treatment. Within-subjects, the trajectory of the log 10 viral load over the first 12 weeks is approximately linear, but responses were censored from below at log 10 IU/mL, the lower limit of quantification of the assay used. Over the first 12 weeks, 10.5 percent of all observations were censored, and 35.2 percent were censored at week 12. The trajectories of 25 randomly selected subjects are shown in Figure 1.1. A standard analysis would impute the censored responses as the limit or half limit of quantification and then fit a mixed model for uncensored longitudinal data. However, Jacqmin-Gadda et al. (2000) showed through simulation studies that such crude parameter estimators are badly biased even for modest levels of censoring. A refined analysis would postulate a likelihood that 1

18 Time (weeks) log Viral Load (IU/mL) Figure 1.1: Trajectory of log 10 viral load of 25 randomly selected subjects with genotype CT from the IDEAL study. The lower limit of quantification, log 10 IU/mL, is shown in the green dotted line. For graphical purposes, the lower limit of quantification was imputed for the censored responses. 2

19 accounts for the censoring and obtain the maximum likelihood estimates (MLE). Substantial research has gone into efficient methods to solve for the MLE. Hughes (1999) developed a Monte Carlo Expectation-Maximization (MCEM) estimation procedure to obtain the MLE for linear mixed effects models with censored responses. Wu (2002) extended the work of Hughes by developing a EM algorithm for nonlinear mixed-effects models with censored responses that allowed for covariates to be measured with error. Wu (2004) further extended the EM algorithm to joint models of multiple longitudinal processes when the responses may be censored. Vaida and Liu (2009) derived a closed form expression for the expectation step rather than use Monte Carlo simulation, which greatly reduced the computational time. The estimation procedure developed by Vaida and Liu (2009) is available in the R package lmec. Other computational approaches have been proposed besides the EM algorithm. Fitzgerald et al. (2002) implemented a multiple imputation approach that avoids maximization of a censored likelihood. Jacqmin-Gadda et al. (2000) noted that the full likelihood is the product of the marginal density of the non-censored responses and the conditional distribution of the censored responses, evaluated at the limit of detection, given the non-censored responses. The authors used numerical integration to estimate the conditional distribution and then combined the Simplex algorithm and the Marquardt algorithm in the Fortran language to solve for the MLE. In contrast, both Thiébaut and Jacqmin-Gadda (2004) and Lyles et al. (2000) approximate the full likelihood of mixed effect model by numerically integrating over the random effects densities and then using standard optimizers to obtain parameter estimates. Previous methods cited above using this full-likelihood approach have all assumed Gaussian random effects and intra-subject error. While intra-subject error may be reasonably assumed to be normally distributed, the assumption of Gaussian random effects may be too restrictive in many applications. In the IDEAL study, some patients do not respond to treatment while others show marked declines in the viral load, so the distribution of subject-specific slopes is unlikely to be approximately Gaussian. For non-censored linear mixed effects models, the maximum likelihood estimators for the fixed effects and covariance components are consistent under broad regularity conditions even when the random effects distribution is misspecified (Verbeke and Lesaffre, 1997). For non-linear mixed effects models, which include mixed models with censored responses and generalized linear mixed models, the maximum likelihood estimators will not, in general, be consistent if the random effects distribution is misspecified. The effect of misspecification has been well-studied in generalized linear mixed models. Broadly speaking, researchers have noted that there is considerable bias and loss of efficiency in the maximum likelihood parameter estimators assuming Gaussian random effects when either the true random effects density deviates substantially from normality or the variance of the random effects is large and there is moderate misspecification of the random effects (Neuhaus et al., 1992; Hartford and Davidian, 2000; Heagerty and Kurland, 3

20 2001; Agresti et al., 2004; Litière et al., 2008). The potential effect of misspecification of the random effects distribution when the response is censored has not been examined. A variety of methods have been proposed to relax the Gaussian assumption for the random effects (Magder and Zeger, 1996; Verbeke and Lesaffre, 1997; Kleinman and Ibrahim, 1998; Aitkin, 1999; Zhang and Davidian, 2001; Chen et al., 2002; Lee and Thompson, 2008; Ho and Hu, 2008; Komarek and Lesaffre, 2008; Lin and Lee, 2008; Bandyopadhyay et al., 2010; Lachos et al., 2010); however, none of these methods has been implemented for censored longitudinal data. For linear mixed models, Zhang and Davidian (2001) assumed the random effects follow a smooth density that can be represented by the seminonparametric (SNP) formulation proposed by Gallant and Nychka (1987). They showed the log likelihood can be written in a closed form leading to straightforward estimation. We show how the SNP representation can be implemented for the random effects density in linear mixed models when the response is potentially censored. In Section 1.2, we formulate the censored-response mixed model with Gaussian random effects and show that incorrectly assuming the random effects are Gaussian leads to inconsistent estimators of model parameters. In addition, we suggest general scenarios where estimation is likely to be particularly poor. In Section 1.3, we discuss the censored-response linear mixed model, where the random effects are assumed to follow a smooth, continuous density and introduce the SNP representation. Section 1.4 describes how to fit the SNP model using SAS procedure NLMIXED, how to select the degree of flexility in the model, and how to choose starting values. Section 1.5 presents the results of several simulations. In Section 1.6, we illustrate the method by application to data from the IDEAL study. 1.2 Gaussian Linear Mixed Model with Censored Response For ease of presentation, we assume responses are potentially left-censored; developments are easily generalized to right- or interval-censored data. Let Y ij, i = 1,..., m and j = 1,..., n i be the j th response for subject i that would have been been observed had there been no censoring. Consider the usual linear mixed model for longitudinal data Y ij = v T ijδ + s T ijα i + ɛ ij = x T ijβ + s T ijb i + ɛ ij, (1.1) where v ij = (x T ij, st ij )T ; δ = (β T, γ T ); b i = γ + α i ; β and γ are the p- and q-dimensional fixed effects parameters associated with covariates x ij and s ij, respectively; α i are the q-dimensional, mean zero subject-specific random effect vectors associated with covariates s ij, independent across i; and ɛ ij iid N(0, σ 2 ) are the intra-subject error, which are independent of α i. We adopt the rightmost representation in (1.1) and note that the centered random effects b i can be written 4

21 as b i = µ + RZ i, where µ is a (q 1) vector, R is a (q q) lower triangular matrix, and Z i is a (q 1) vector of random effects. Typically, Z i are assumed to follow a standard q-dimensional normal distribution, which implies b i are normally distributed with mean µ = γ and covariance matrix var(α i ) = RR T. As an example, a special case of (1.1) is the linear random coefficient model with baseline covariate x i given by Y ij = x i β + b 0i + b 1i t ij + ɛ ij, (1.2) where b i = (b 0i, b 1i ) T and s ij = (1, t ij ) T. Due to left-censoring, we observe Q ij which takes the value Y ij for Y ij > l ij and takes the value l ij, the known lower limit of quantification for the j th response on subject i, otherwise. Letting r = vech(r) be the non-zero elements of R, the parameters of interest are θ = (β T, µ T, r T, σ 2 ) T, and the likelihood assuming Gaussian random effects is m L(θ; Q) = i=1 n i j=1 [ {Φ 1 (m ij )} I(Q ij=l ij ) { 1 σ φ 1(m ij ) } I(Q ij >l ij ) ] φ q (z i )dz i, (1.3) where m ij = {Q ij x T ij β st ij (µ + RZ i)}/σ, Q is the vector of all responses observed, and φ n ( ) and Φ n ( ) are the standard n-dimensional normal density and distribution. Alternatively, we can express (1.1) as Y i = V i δ + S i α i + ɛ i, where V i {n i (p + q)} and S i (n i q) are the matrices with rows vij T and st ij, respectively, and Y i = (Y i1,..., Y ini ) T and similarly for ɛ i, l i, and Q i. To simplify the following argument, we assume that the design matrix is fixed, V i = V, and n i = n for i = 1,..., m and also suppress the index i throughout. To express (1.3) using vector notation, let f Y (y; θ) and F Y (y; θ) be the multivariate normal density and distribution, respectively, with mean V δ and covariance matrix Σ = SRR T S T + σ 2 I n. Define f Y1 Y 2 (y 1 y 2 ; θ) and F Y1 Y 2 (y 1 y 2 ; θ) to be the conditional multivariate normal density and distribution, respectively, of the vector Y 1 given the vector Y 2. There are 2 n 2 patterns of censoring/non-censoring that could be observed for an individual with at least one censored and one non-censored observation. Index each of these 2 n 2 distinct censoring patterns by k. Let Y k,o (Y k,c, respectively) be the random vector containing elements of Y that would be observed (censored) under censoring pattern k. If the k th pattern were observed for a particular subject, then the likelihood contribution from that subject assuming normal random effects is given by F Yk,c Y k,o (l k,c Q k,o ; θ)f Yk,o (Q k,o ; θ), where Q k,o is the vector of non-censored responses and l k,c (l k,o, respectively) is the limit of quantification for the censored observations (non-censored observations) under censoring pat- 5

22 tern k. The likelihood contribution from one subject can now be written as L(θ, Q) = f Y (Q; θ) I(Q>l) F Y (l; θ) I(Q=l) {F Yk,c Y k,o (l k,c Q k,o ; θ)f Yk,o (Q k,o ; θ)} I(Q k,c=l k,c,q k,o >l k,o ). 2 n 2 k=1 If the random effects distribution is incorrectly specified, the maximum likelihood estimator ˆθ will converge to value of θ that solves E { / θ log L(θ; Q)} = 0, where the expectation is taken with respect to the true random effects distribution. Assuming that the covariance components are known and using a first-order Taylor series expansion, we show in Appendix A that ˆδ will converge to the value of δ that solves l 0 = D(δ, V )(δ δ) n 2 k=1 [ l k,o lk,c { } V T Σ 1 (q V δ GY (l) ) F Y (l) f Y (q) g Y (q) dq (1.4) V T k,c Σ 1 k,c {q k,c V k,c δ } { GYk,c Y (l } ] k,o k,c q k,o ) F Yk,c Y k,o (l k,c q k,o ) f Y k,c Y k,o (q k,c q k,o ) g Yk,c Y k,o (q k,c q k,o ) g Yk,o (q k,o )dq k,c dq k,o, where g Y (q; δ ) and G Y (q; δ ) are the true marginal density and distribution, respectively, of Y ; all densities in (1.4) are evaluated at δ, the true parameter value; and D(δ, V ), Q k,c, V and Σ k,c are given in the Appendix A. While complex, the approximation reveals important insights on the asymptotic bias when the random effects are incorrectly specified. In the likely case that the random effects distribution is misspecified, the bias is driven by the difference in the true conditional distribution of the censored response given the non-censored response (that is g Yk,c Y k,o (q k,c q k,o )) and the same conditional distribution induced by the Gaussian random effects (that is f Yk,c Y k,o (q k,c q k,o )), weighted by the likelihood of the non-censored response (that is g Yk,o (q k,o )). Practically, this suggests that small deviations from normality will lead to only slight asymptotic bias. If we have correctly specified the intra-subject error distribution, then g Yk,c Y k,o (y k,c y k,o ) will be approximately normal if the dimension of Y k,o is large regardless of the true random effects density. Thus, if there is a large probability of observing many non-censored observations on each subject, then the asymptotic bias is likely to be slight. This suggests that the overall percentage of censored responses is immaterial in determining the asymptotic bias. Instead the absolute number of non-censored responses for each subject is critical. k,c, 6

23 1.3 Semiparametric Linear Mixed Model with Censored Response We offer a summary of semiparametric linear mixed models and refer the reader to Zhang and Davidian (2001) for a complete description. We now assume that the distribution of Z i belongs to the smooth class of continuous densities described by Gallant and Nychka (1987). This class is sufficiently flexible to include skewed, thick- and thin-tailed, and multimodal distributions but does not include densities with jumps, kinks, or oscillations. These densities can be represented as an infinite series but can be approximated by a truncated series. The densities that are part of the truncated series are referred to as seminonparametric or SNP. We thus assume that the density of Z i can be approximated by the SNP representation with degree of truncation K given by h K (z) = PK(z)φ 2 q (z) = (j j q) K a j1,...,j q (z j 1 1 zjq q ) 2 φ q (z), (1.5) where j l 0 for l = 1,..., q and K is the order of the polynomial P K (z). For example, with K = 2 and q = 2, P K (z) = a 00 + a 10 z 1 + a 01 z 2 + a 20 z1 2 + a 11z 1 z 2 + a 02 z2 2. When K = 0, we show below that a 0,...,0 must equal 1, and the density of Z i is a standard q-dimensional normal. K controls the degree of flexibility of the density h K (z); we discuss in Section 1.4 how to select K. The coefficients a j1,...,j q must be chosen so that h K (z) integrates to one. The constraint hk (z)dz = 1 is equivalent to imposing E{PK 2 (U)} = 1, where U follows a standard q- dimensional normal distribution (Zhang and Davidian, 2001). Let a be the d-dimensional vector of coefficients for the polynomial P K (z) and let j 1,..., j q be the subscripts corresponding to the j th element. Then the above constraint { can be rewritten as E{PK 2 } (U)} = at Aa = 1, where A is the matrix with (j, k) element equal to E(U j 1+k 1 1 ) E(U jq+kq 1 ) and U 1 is distributed as a standard normal. Rather than impose a constraint on a, we may rewrite the (d 1) vector a as a function of d 1 parameters. Because A must be positive definite, there exists a matrix B such that A = B 2. If we let c = Ba, then a T Aa = 1 implies c T c = 1. As c and c will result in the same density h K (z), c = (c 1,..., c d ) T must lie in the half-unit sphere of R d. Therefore, we can express c in terms of a polar coordinate transformation, where c 1 = sin(ξ 1 ), c 2 = cos(ξ 1 ) sin(ξ 2 ),..., c d 1 = cos(ξ 1 )... cos(ξ d 2 ) sin(ξ d 1 ),c d = cos(ξ 1 )... cos(ξ d 1 ), π/2 ξ j π/2 for j = 1,..., d 1, and ξ = (ξ 1,..., ξ d 1 ) T. The vector a can be written as B 1 c, which only involves d 1 parameters. The parameters of interest are now θ K SNP = (βt, µ T, r T, σ 2, ξ T ) T, which includes an additional d 1 parameters compared to when normality is assumed. 7

24 Note that the SNP density does not impose E(Z i ) = 0, so that E(b i ) = γ = µ + R{E(Z i )} and Var(b i ) = R{Var(Z i )}R T. The moments of Z i are linear combinations of the moments of a standard normal density, which can be found using recursion formulas. 1.4 Estimation Using SAS NLMIXED In the case where there is no censoring of the response variable, the log likelihood assuming the SNP representation of the random effects can be expressed in a closed form (Zhang and Davidian, 2001). Censoring necessitates numerical integration. For a fixed K, the likelihood of θ K SNP is given by m L(θSNP K ; Q) = i=1 n i j=1 [ {Φ 1 (m ij )} I(Q ij=l ij ) { 1 σ φ 1(m ij ) } I(Q ij >l ij ) ] P 2 K(z i )φ q (z i )dz i (1.6) For each observation i, this requires evaluation of a q-dimensional integral, which, in practice, would likely only be one- or two-dimensional. In contrast, we could integrate over the marginal density of Y i for each censored observation as Jacqmin-Gadda et al. (2000) proposed. However, the dimension of that integral would equal the number of censored observations for subject i, which could be quite large and computationally intractable. Even though the integral in (1.6) is typically only one- or two-dimensional, approximating the integral can be difficult for complex random effects densities such as the SNP density. More generally, we need to be able to approximate the following integral, L i (θ) = n i j=1 { } f Qij Z i (q ij z i ; β, µ, r, x ij ) g Zi (z i )dz i. (1.7) where f Qij Z i (q ij z i ; β, µ, r, x ij ) and g Zi (z i ) are the density of the response conditioned on the random effects and random effects density, respectively. In this application, and g Zi (z i ) = P 2 K (z i)φ q (z i ). f Qij Z i (q ij z i ; β, µ, r, x ij ) = {Φ 1 (m ij )} I(Q ij=l ij ) { 1 σ φ 1(m ij ) } I(Q ij >l ij ) Many of the standard numerical integration techniques perform quite poorly when the random effects density is complex. For example, adaptive Gaussian quadrature requires maximizing the integrand in (1.7) for each of the m subjects at each updated estimate of ˆβ, ˆµ, and ˆr. The maximizations can be too time consuming if g Zi is sufficiently complex as is the case for the SNP density. Traditional non-adaptive Gaussian quadrature centers the quadrature points at the mean of the random effects and scales them based on the variance of the random effects. 8

25 The mode of the integrand in (1.7) may not be near the mean of the random effects and a large number of quadrature points may be needed to achieve a reasonable approximation to the integral. For q 2, the number of quadrature points needed is usually too large to be computationally feasible. Other approximation techniques, such as the Laplace approximation (Wolfinger, 1993), first-order approximation (Beal and Sheiner, 1988), and importance sampling (Pinheiro and Bates, 1995) are often quite poor or infeasible if g bi is non-gaussian. Even though adaptive Gaussian quadrature is computationally infeasible, it is instructive to examine the underpinnings of the integral approximation (Pinheiro and Bates, 1995). Let Ẑi be the empirical Bayes estimate of Z i given the current estimates of β, µ and r, defined as ˆβ (k), ˆµ (k), and ˆr (k), respectively, and G i be the estimated covariance matrix of Ẑi. That is Ẑ i = arg max zi n i j=1 { f Qij Z i (q ij z i ; ˆβ } (k), ˆµ (k), ˆr (k), x ij ) g Zi (z i ) We can write Z i = Ẑi + G i 1/2 U i for the random variable U i and then perform a change of variable to have the integral in (1.7) taken with respect to U i : L i (θ) = n i G 1/2 i j=1 { } f Qij Z i (q ij Ẑi + G 1/2 i u i ; β, µ, r, x ij ) g Zi (Ẑi + G 1/2 i u i )du i. SAS procedure NLMIXED then approximates the integral using standard Gaussian quadrature with quadrature points based on the standard normal kernel. Using the approximation to (1.7) described above, we can then maximize (1.6) with respect to β, µ, and r to obtain ˆβ (k+1), ˆµ (k+1), and ˆr (k+1), respectively. Typically one would iterate, until convergence, between finding the empirical Bayes estimates to re-approximate (1.7), and then updating β, µ, and r. This requires m maximizations at each iteration because the integral in (1.7) is re-centered and re-scaled about the updated empirical Bayes estimates at each iteration. Heuristically, we would like to center and scale the integral one time at something reasonably close to Z i to avoid m maximizations at each iteration. We propose to center and scale the quadrature points at the empirical Bayes estimates and their estimated variance from assuming Gaussian random effects. Here we need to be careful as the Z i are assumed to have mean zero when Gaussian random effects are assumed, but this need not be the case for a flexible random effects density, such as the SNP density. Let ˆb i,g be the empirical Bayes estimate of b i, the centered random effect, from assuming Gaussian random effects and G i,g the estimated variance of the empirical Bayes estimates. 9

26 Then we have µ + RZ i = ˆb i,g + G 1/2 i,g U i Z i = R 1 (ˆb i,g µ) + R 1 G 1/2 i,g U i, which we can substitute into (1.7) and take the integral with respect to U i : L i (θ) = n i R 1 G 1/2 i,g j=1 [f Qij Z i { q ij R 1 (ˆb i,g µ) + R 1 G 1/2 i,g u i; β, µ, r, x ij }] (1.8) { } g Zi R 1 (ˆb i,g µ) + R 1 G 1/2 i,g u i; η du i. We can now use standard Gaussian quadrature with quadrature points based on the standard normal kernel to approximate (1.8). Because SAS procedure NLMIXED does not allow the user to center the quadrature points at arbitrary values, to force NLMIXED to use the integration strategy above we must use programming statements to center the integration as in (1.8). In addition to forcing NLMIXED to center the quadrature points at arbitrary values, we must also trick NLMIXED to allow non-gaussian random effects. The SAS procedure NLMIXED has been developed to obtain maximum likelihood estimates for mixed models with Gaussian random effects. Except when K = 0, the SNP random effects density will not be normally distributed. However, if we consider n i j=1 [ {Φ 1 (m ij )} I(Q ij=l ij ) { 1 σ φ 1(m ij ) } I(Q ij >l ij ) ] P 2 K(z i ) (1.9) to be the likelihood for Q ij conditioned on the random effects, then the random effects Z i can be thought to follow a standard q-dimensional normal distribution (Liu and Yu, 2008). Example code to implement SNP for left-censored mixed models is given in Appendix B. Among the optimization routines available in NLMIXED, dual-quasi Newton optimization works well. The inverse Hessian matrix may be used to obtain standard errors for the parameter estimates and is computed as part of the standard output in NLMIXED. The preceding discussion assumes a fixed K. We treat K as a tuning parameter and select K by visually comparing the estimated densities and computing information criteria evaluated at the estimates, ˆθ SNP K, of θk SNP from model fits for different values of K (Zhang and Davidian, 2001). The information criteria are of the form 2L(ˆθ SNP K, Q) + C(m)p where p is the number of parameters in the model. For the Akaike information criterion (AIC), C(m) = 2; Hannan- Quinn information criterion (HQIC), C(m) = 2 log log m; and Bayesian information criterion 10

Mixed model analysis of censored longitudinal data with flexible random-effects density

Biostatistics (2012), 13, 1, pp. 61 73 doi:10.1093/biostatistics/kxr026 Advance Access publication on September 13, 2011 Mixed model analysis of censored longitudinal data with flexible random-effects