A weighted simulation-based estimator for incomplete longitudinal data models

Size: px

Start display at page:

Download "A weighted simulation-based estimator for incomplete longitudinal data models"

Antonia Snow
5 years ago
Views:

1 To appear in Statistics and Probability Letters, 113 (2016), doi /j.spl A weighted simulation-based estimator for incomplete longitudinal data models Daniel H. Li 1 and Liqun Wang 2 1 Juno Therapeutics, Seattle, Washington, USA 2 University of Manitoba, Winnipeg, Canada Final Version, February 2016 Abstract Recently, Li and Wang (2012a, b) and Wang (2007) have proposed a simulation-based estimator for generalized linear and nonlinear mixed models with complete longitudinal data. This estimator is constructed using the simulation-by-parts technique which leads to the unique feature that it is consistent even using finite number of simulated random points. This paper extends the methodology to deal with incomplete longitudinal data by applying the inverse probability weighting method for the monotone missing-at-random response data. The finite sample performance of this estimator is investigated through simulation studies and compared with the multiple imputation approach. Keywords: Simulation-based estimator. Generalized linear mixed model; Inverse probability weighting; Missing data; 1 Introduction In biomedical, environmental and social sciences research, longitudinal data analysis is widely used and constitutes the fundamental statistical research methodologies. Generalized linear mixed models (GLMM) have been widely used in the modeling of longitudinal data. Li and Wang (2012a) proposed a simulation-based estimator (SBE) for GLMM based on the first two marginal moments of the response variables, which does not rely on the normality distribution assumption for random effects. Li and Wang (2012b) extended the SBE to the GLMM where some covariates are Corresponding author, Department of Statistics, University of Manitoba, 186 Dysart Road, Winnipeg, Manitoba, Canada R3T 2N2. liqun.wang@umanitoba.ca 1

2 Li and Wang (2016): A weighted simulation-based estimator 2 measured with error. This approach was originally studied by Wang (2007) for nonlinear mixed effects models. The SBE is constructed using a novel simulation-by-parts technique to ensure its consistency by using finite number of simulated random points. This is the key difference from many other simulation-based estimators proposed in the literature, where they require the number of simulated random points go to infinity to achieve consistency. So far, the SBE is only studied under complete data settings although incomplete or missing data are common in longitudinal studies. For example, in clinical trials, missing data are almost inevitable because subjects may decide to withdraw from the study at anytime prior to completion or subjects are not compliant to protocol for scheduled assessments. Problems arise if the mechanism leading to the missing data depends on the response process. It is known that ignoring missing data and using naive methods may introduce bias, reduce the power of inference and lead to misleading conclusions (Little and Rubin, 2002). The extension of the SBE to account for incomplete longitudinal data is non-trivial and needs to be addressed to allow this estimator used in more general settings. In this paper, we discuss the validity of SBE under different missing data mechanism and modify it for the data missing at random (MAR) with monotone missingness through the inverse probability weighting (IPW) method. The IPW is a general methodology for constructing parameter estimators in semi-parametric models with complete as well as missing data (Robins, et al., 1995; Yi and Cook, 2002). Another popular approach to deal with missing data is the multiple imputation (Rubin, 1987; Schafer, 1997). We also investigate the performance of the SBE using this strategy. The structure of the paper is as follows. In Section 2, we introduce and review the missing data mechanism, pattern and estimation. Section 3 provides details on the proposed weighted simulationbased estimator, and addresses some practical computational issues. In particular, Section 3.2 handles the data missing completely at random (MCAR), while Section 3.3 focuses on the data MAR. Some simulation studies are conducted in Section 4 to examine the finite sample performance of the proposed estimator, and concluding remarks are given in section 5. 2 Missing Data Framework and Notation 2.1 Missing data mechanism To obtain valid inferences, it is essential to consider the reason for missingness. Let Y ij be the j th response for the i th subject, R i = (r i1, r i2,, r in ) be the vector of missing data indicators for Y i = (y i1,, y in ), such that r ij = 1 if response y ij is observed, and 0 otherwise. We partition Y i into Yi O and Yi M, where Yi O contains those y ij for which r ij = 1 and Yi M contains the remaining components. Assuming X i = (x i1,, x in ) to be a vector of covariates always observed, Little and Rubin (2002) classified missing data mechanism into three types: (1) MCAR, where the missingness is unrelated to the responses so that P (R i Y i, X i ) = P (R i X i ). (2) MAR, where the missingness depends only on the observed responses so that P (R i Y i, X i ) = P (R i Yi O, X i ). This is a weaker

3 Li and Wang (2016): A weighted simulation-based estimator 3 and more plausible assumption than MCAR. (3) MNAR, where the missingness depends on both observed and unobserved responses. 2.2 Missing data patterns There are two broad classes of missing data patterns: intermittent missing and dropout. Intermittent missing pattern refers to the scenario that a subject completes the study but skips a few occasions in the middle of the study period. Dropout (attrition, lost of follow-up) is a particular example of monotone pattern of missingness, which means if one observation is missing, then all subsequent observations are unobserved. Intermittent missing is often easier to deal with because the subject is still participating in the study and the reason of missing values can be ascertained. Dropout is more serious because the subject is no longer available and it is not certain whether the dropout is related to the observed or unobserved outcome. MAR mechanisms are commonly assumed when the interest lies on the parameter estimation (Robins et al., 1995: Lindsey, 2000). 2.3 Estimation of missing data process Let λ ij = P (r ij = 1 r i,j 1 = 1, X i, Yi O ) be the conditional probability that subject i is observed at time j, given that the subject is present at time j 1; and π ij = P (r ij = 1 X i, Yi O ) be the marginal probability that subject i is present at time j. Then π ij = j t=2 λ it. Generally it is assumed that all individuals are observed on the first occasion so that r i1 = λ i1 = 1. Further, let π ijk = P (r ij = 1, r ik = 1 X i, Yi O ) be the probability of observing both y ij and y ik given the response history and covariates. Usually λ ij is estimated using a logistic regression model logitλ ij = A ij α, where A ij is a vector consisting of information on X i and response history, and α is the vector of parameters (Diggle and Kenward, 1994; Fitzmaurice etal., 1996; Molenberghs etal., 1997; Yi and Cook 2002). 3 Weighted Simulation-based Estimator 3.1 GLMM formulation Suppose subject i is measured repeatedly on n i occasions. In a GLMM it is assumed that, given the covariates and random effects b i IR q, the responses y ij are conditionally independent and have distribution from the exponential family { } ωij y ij a(ω ij ) f(y ij b i, X i, Z i ) = exp + c(y ij, φ), i = 1,, N, j = 1,, n i, φ (3.1) where φ is a dispersion parameter, ω ij is the canonical parameter and a( ) and c( ) are known functions. The conditional mean and variance µ c ij = E(y ij b i, X i, Z i ) = a (1) (ω ij ) (3.2) vij c = V ar(y ij b i, X i, Z i ) = φa (2) (ω ij ) (3.3)

4 Li and Wang (2016): A weighted simulation-based estimator 4 satisfy g 1 {µ c ij } = x ij β + z ij b i and vij c = φν(µc ij ), where a(d) denotes the d th derivative against ω ij, g 1 ( ) and ν( ) are known link and variance functions, respectively. The random effects are assumed to have mean zero and distribution f b (u; θ) with unknown parameters θ IR r. Our approach is motivated by the fact that all the model parameters of interest can be identified and consistently estimated using the first two marginal moments E(y ij X i, Z i ) = g(x ijβ + z iju)f b (u; θ)du, (3.4) E(y ij y ik X i, Z i ) = g(x ijβ + z iju)g(x ik β + z ik u)f b(u; θ)du + δ jk φ ν(g(x ijβ + z iju))f b (u; θ)du j k, (3.5) where δ jk = 1 if j = k and 0 otherwise. For example, in a linear model y ij = x ij β + z ij b i + ɛ ij with b i N(0, D(θ)), the first two marginal moments are E(y ij X i, Z i ) = x ij β and E(y ijy ik X i, Z i ) = (x ij β)(x ik β) + z ij D(θ)z ik + δ jk φ. Another example is a logistic model logitp (y ij = 1 b i ) = x ij β + z ij b i, where the moments are given in (3.4)-(3.5) with logit link function g(w) = (1 + e w ) 1. In general the integrals on the right-hand sides of (3.4)-(3.5) are intractable but can be approximated using the Monte Carlo simulation techniques such as importance sampling. 3.2 Simulation-based estimator for the data MCAR Li and Wang (2012a, b) and Wang (2007) used a simulation-by-parts technique to construct two sets of simulated moments which are unbiased estimates of the true moments. First, a known density h(u) is chosen such that its support covers that of the integrands in (3.4)-(3.5). Second, a set of random points u is, s = 1, 2,..., 2S are generated from h(u) which are used to construct the simulated moments as µ ij,1 (ψ) = 1 S η ijk,1 (ψ) = 1 S S g(x ij β + z ij u is)f b (u is ; θ), (3.6) h(u is ) s=1 S g(x ij β + z ij u is)g(x ik β + z ik u is)f b (u is ; θ) + h(u is ) s=1 δ jk φ S S ν(g(x ij β + z ij u is))f b (u is ; θ) s=1 h(u is ) and µ ij,2 (ψ) and η ijk,2 (ψ) are constructed similarly using the second half of the points u is, s = S + 1, S + 2,..., 2S. Finally the SBE for ψ = (β, θ, φ) is obtained by minimizing (3.7) Q N,S (ψ) = N ρ i,1(ψ)w i ρ i,2 (ψ) (3.8) i=1 within a compact parameter space Ψ, where ρ i,t (ψ) = (y ij µ ij,t (ψ), 1 j n i, y ij y ik η ijk,t (ψ), 1 j k n i ), t = 1, 2, and W i = W (X i, Z i ) is a nonnegative definite weight matrix. As is shown in

5 Li and Wang (2016): A weighted simulation-based estimator 5 Li and Wang (2012a, b) that in the case of complete data the SBE is consistent and asymptotically normal as N for any finite S. Their simulation studies and real data applications have also shown that the SBE works well in the finite sample situations with moderately large S. Now for the case of data MCAR, we define the SBE ˆψ N,S as the solution of the score equation N i=1 ρ i,1 (ψ) W i i ρ i,2 (ψ) = 0, (3.9) where i = diag(r ij, 1 i n i, r ij r ik, 1 j k n i ). It is easy to see that (3.9) is an unbiased estimating equation because under MCAR i does not depend on Y i and therefore [ ρ ] [ ρ ] E W i i ρ i,2 (ψ 0 ) = E W i i E(ρ i,2 (ψ 0 ) X i, Z i ) = 0, where ψ 0 = (β 0, θ 0, φ 0) denotes the true parameter value. Moreover, following Li and Wang (2012a) it is straightforward to show that for a finite S, N( ˆψ N,S ψ 0 ) L N(0, B 1 CB 1 ), where [ ρ ] ρ i,2 (ψ 0 ) B = E W i i (3.10) and [ ρ ] C = E W i i ρ ρ i,2(ψ 0 )ρ i,2 (ψ 0 ) i W i. (3.11) An approximately optimal choice of W i as derived in Li and Wang (2012a) is given by A( ˆψ N1 ) = 1 N N ρ i,1 ( ˆψ N1 ) i ρ i,2( ˆψ N1 ), (3.12) i=1 where ˆψ N1 is an initial consistent estimator of ψ. 3.3 Weighted simulation-based estimator for the data MAR Now we modify the SBE to handle the data MAR by using the IPW method. The idea is to weight each subject s contribution in the estimation by the inverse probability that the subject drops out at the time of dropping out (Robins et al., 1995). The weights are obtained based on models for the missing data process as specified in section 2.3. Specifically, let i = diag(r ij /π ij, 1 j n i, r ij r ik /π ijk, 1 j k n i ) be the weight matrix accommodating missingness. Then we define the weighted SBE (WSBE) ψ N,S as the solution of N i=1 ρ i,1 (ψ) W i i ρ i,2 (ψ) = 0. (3.13) This is an unbiased estimating equation because under MAR E[ i X i, Z i, Y i ] is an identity matrix and therefore by the law of iterated expectation we have [ ρ ] [ ρ E W i i ρ i,2 (ψ 0 ) = E [ ρ = E ] W i E[ i X i, Z i, Y i ]ρ i,2 (ψ 0 ) ] W i ρ i,2 (ψ 0 ) = 0,

6 Li and Wang (2016): A weighted simulation-based estimator 6 It follows that ψ N,S is consistent and asymptotically normally distributed with asymptotic covariance matrix B 1 CB 1, where B and C are given by (3.10) and (3.11) respectively with i replaced by i. Similarly, the approximately optimal weight Ãi is calculated as in (3.12) with i replaced by i. For the computation of A i or Ãi, the moment estimator described in Li and Wang (2012a) needs to be modified because the length of ρ i (ψ) is different across subjects. The second-order marginal moments can be calculated using (3.7), and the third- and fourth-order moments can be calculated using the same simulation method by constructing the conditional moments first. For all j, k, l, t, cov(y ij, y ik y il b i, X i, Z i ) = E(y ij y ik y il b i, X i, Z i ) E(y ij b i, X i, Z i )E(y ik y il b i, X i, Z i ), and cov(y ij y ik, y il y it b i, X i, Z i ) = E(y ij y ik y il y it b i, X i, Z i ) E(y ij y ik b i, X i, Z i )E(y il y it b i, X i, Z i ). Alternatively, one can adopt the idea of working variance matrix (Prentice and Zhao, 1991; Vonesh et al., 2002) to construct Ãi. For example, assuming y i is multivariate normal, then cov(y ij, y ik y il b i, X i, Z i ) = µ c il σ ijk + µ c ik σ ijl, and cov(y ij y ik, y il y it b i, X i, Z i ) = σ ijl σ ikt + σ ijt σ ikl + µ c ik µc il σ ijt + µ c ij µc il σ ikt + µ c ik µc it σ ijl + µ c ij µ itσikl c, where σ ijk = E[(y ij u ij )(y ik u ik ) b i, X i, Z i ]. Thus, both third and fourth moments can be obtained through the first two moments. We can also assume independence among the elements of y i, in which case the third and fourth moments of y i are respectively given by E[(y ij u c ij )3 ] + 2µ c ij σ ijj 2(µ c ij )3 if j = k = l, σ ijj µ c ik if j = l k, cov(y ij, y ik y il b i, X i, Z i ) = σ ijj µ c il if j = k l, 0 otherwise and E[yij 4 ] (µc ij )2 σ ijj if j = k = l = t, E[(y ij u c ij cov(y ij y ik, y il y it b i, X i, Z i ) = )3 ]µ it + 2µ c ij µc it σ ijj if j = k = l t, E[(y ij u ij ) 3 ]µ c il + 2µc ij µc il σ ijj if j = k = t l, 0 otherwise. If we further assume that the distribution of y i is symmetric, then we have E[(y ij u c ij )3 ] = 0. 4 Monte Carlo Simulation Studies In this section we conduct simulation studies to assess the finite sample performance of the proposed WSBE under the MCAR and MAR scenarios with various amount of missing data. We consider two models for two types of response variables. In particular, we consider a linear mixed model for the continuous response y ij = β 0 + β 1 x ij + b i + ɛ ij with ɛ ij N(0, 1), and a mixed Poisson model for the count data log E(y ij b i ) = β 0 + β 1 x ij + b i. In both models we set β = (1, 1) and b i N(0, θ) with θ = The covariate x ij is generated from normal distribution N(1, 1) in the linear model and N(0.5, 1) in the Poisson model. The missing indicator r ij is generated from the logistic model logitλ ij = α 0 + α 1 y i,j 1, with α = (2, 0), (2, 0.5), (2, 1) for the linear model and α = (3, 0), (0.5, 0.1), (0.5, 0.5) for the Poisson model respectively. Note that α 1 = 0 represents the

7 Li and Wang (2016): A weighted simulation-based estimator 7 scenario of data MCAR, while α 1 0 represents data MAR. For a given α 0, the smaller α 1 results in higher percentage of missing data. The combined choice of α 0 and α 1 leads to about 10-40% drop-out, spread over time points 2-4. Therefore, these parameter setups not only lead to different missing data mechanism but also different percentage of missing data. For comparisons, we also calculate the naive SBE that ignores the missing data and the SBE based on the multiple imputed data. The multiple imputation is done using the R package MICE with predictive mean matching method and iteration time as the seed for random number generation (Horton and Lipsitz 2001). Further, we set the number of multiple imputations to be 5 which is generally sufficient to yield efficient results (Rubin, 1987). The sample sizes are N = 50, 100, 200, 500 and the number of observations per subject is n i = 4. In each simulation we generate M = 500 datasets and report the average biases (1/M) M i=1 ˆψ i ψ 0 and the root mean square errors (RMSE) ( M i=1 ( ˆψ i ψ 0 ) 2 /M) 1/2. Table 1 and 2 contain the numerical results for two models respectively. These results show that in the case of MCAR (α 1 = 0), the SBE and WSBE perform similarly, which is consistent with theory. However, in linear model the MI-SBE has larger bias and RMSE than the other two estimators except for the variance parameter θ due to the relatively high percentage of missing data. When we repeat the simulation with lower percentage of missing data, the MI-SBE performs actually slightly better than the SBE and WSBE. Moreover, the RMSE of all estimators reduce when the sample size increases. In the case of MAR (α 1 0), the WSBE has smaller bias and RMSE than the SBE for variance parameters. For the regression parameters, the WSBE performs similarly as the SBE for small sample sizes, but improves fast and is significantly better for large sample sizes. In general, MI-SBE clearly outperforms the WSBE and SBE, especially in the Poisson model. This is not surprising since it is documented in the literature that MI is generally more efficient than the IPW method (Robins et al., 1995). More general discussions about the IPW and MI methods can be found in e.g., Carpenter et al. (2006) and Seaman et al. (2012). To improve efficiency, one may consider applying augmented inverse probability weight method (Robins et al., 1995). Furthermore, we notice that the numerical computation of the MI-SBE is more stable. We have repeated our simulations with various values of (α 0, α 1 ) and observed similar patterns as discussed above. 5 Concluding remarks Incomplete longitudinal data are common in practical applications. For a valid analysis, a study of the missing mechanism is necessary. Although comprehensive theoretical work and application of the SBE for GLMM were discussed in Li and Wang (2012a, b), there is still a strong need to examine this approach when missing data are present. In this paper, we show that the SBE based on observed data is only valid for data MCAR, and hence we adopt the inverse probability weighting method to construct the WSBE for data MAR. We also investigate the performance of

8 Li and Wang (2016): A weighted simulation-based estimator 8 SBE for incomplete longitudinal data by the means of multiple imputation. Our simulation studies demonstrate that the proposed WSBE is feasible to compute, performs well under finite sample sizes, and is comparable to the multiple imputation approach in many cases. Furthermore, this paper suggests a few ways to compute the optimal weight matrix under the incomplete longitudinal data setting. Since the weight matrix contains the third and fourth moments, the computation can be cumbersome even using simulated moments. In our experience, diagonal weight works quite well and can reduce the computational burden substantially. The proposed WSBE is formulated under the GLMM framework, however, it can be easily extended to nonlinear mixed effects models. In principle, the WSBE and MI-SBE can also be used for data with intermittent missing pattern or longitudinal data with unequally spaced repeated measures. Missing data and measurement error often arise simultaneously in a real world problem, so it would be valuable to develop the proposed methodology to cope with these situations. Another future research is to further extend the SBE to deal with the non-ignorable missing data problems. Acknowledgments The authors are grateful to the editor and two anonymous referees for their valuable comments and suggestions that helped to improve this paper. The research was supported by grant RGPIN/ from the Natural Sciences and Engineering Research Council of Canada (NSERC). References [1] Carpenter, J.R., Kenward, M.G., Vansteelandt, S. (2006). A comparison of multiple imputation and doubly robust estimation for analyses with missing data. Journal of the Royal Statistical Society, Series A, 169, [2] Diggle, P. and Kenward, M. G. (1994). Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics, 43, [3] Fitzmaurice, G.M., Laird, N.M. and Zahner, G.E.P. (1996). Multivariate logistic models for incomplete binary responses. Journal of the American Statistical Association, 91, [4] Horton, N. J. and Lipsitz, S. R. (2001). Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician, 55, [5] Li, H. and Wang, L. (2012a) A consistent simulation-based estimator in generalized linear mixed models. Journal of Statistical Computation and Simulation, 82, [6] Li, H. and Wang, L. (2012b) Consistent estimation in generalized linear mixed models with measurement error. Journal of Biometrics and Biostatistics, S7:007. DOI / S7-007.

9 Li and Wang (2016): A weighted simulation-based estimator 9 [7] Lindsey, J. K. (2000). Dropouts in longitudinal studies: definitions and models. Journal of Biopharmaceutical Statistics, 10, [8] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, Second edition, John Wiley & Sons, Hoboken, N.J.. [9] Molenberghs, G., Kenward, M.G. and Lesaffre, E. (1997). The analysis of longitudinal ordinal data with nonrandom drop-out. Biometrika, 84, [10] Prentice, R.L. and Zhao, L.P. (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics, 47, [11] Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90, [12] Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley. [13] Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data, New York: Chapman & Hall. [14] Seaman, S.R., White, I.R., Copas, A.J., and Li L. (2012). Combining multiple imputation and inverse-probability weighting. Biometrics, 68, [15] Vonesh, E. F., Wang, H., Nie L. and Majumdar, D. (2002). Conditional second-order generalized estimating equations for generalized linear and nonlinear mixed effects models. Journal of the American Statistical Association, 97, [16] Wang, L. (2007). A unified approach to estimation of nonlinear mixed effects and Berkson measurement error models. The Canadian Journal of Statistics, 35, [17] Yi, G. Y. and Cook, R. J. (2002). Marginal methods for incomplete longitudinal data arising in clusters. Journal of the American Statistical Association, 97,

10 Li and Wang (2016): A weighted simulation-based estimator 10 Table 1: Simulation results for the linear regression model Missingness N SBE WSBE MI-SBE Bias RMSE Bias RMSE Bias RMSE (2,0) 50 β β θ φ β β θ φ β β θ φ β β θ φ (2,0.5) 50 β β θ φ β β θ φ β β θ φ β β θ φ (2,1) 50 β β θ φ β β θ φ β β θ φ β β θ φ

11 Li and Wang (2016): A weighted simulation-based estimator 11 Table 2: Simulation results for the Poisson regression model Missingness N SBE WSBE MI-SBE Bias RMSE Bias RMSE Bias RMSE (3,0) 50 β β θ β β θ β β θ β β θ (0.5,0.1) 50 β β θ β β θ β β θ β β θ (0.5,0.5) 50 β β θ β β θ β β θ β β θ

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs Michael J. Daniels and Chenguang Wang Jan. 18, 2009 First, we would like to thank Joe and Geert for a carefully