Comparison of multiple imputation methods for systematically and sporadically missing multilevel data V. Audigier, I. White, S. Jolani, T. Debray, M. Quartagno, J. Carpenter, S. van Buuren, M. Resche-Rigon INSERM, UMR 1153, ECSTRA team, Saint-Louis Hospital, Paris MODAL Seminar, November 22th, Lille 1 / 29
Motivation: GREAT data (Great Network, 2013) Risk factors associated with short-term mortality in acute heart failure 28 observational cohorts, 11685 patients, 2 binary and 8 continuous variables (patient characteristics and potential risk factors) X 1, X 2, X 3,... X p, Y (LVEF ) sporadically and systematically missing data Index Aim: explain the relationship between biomarkers (BNP, AFIB,...) and the left ventricular ejection fraction (LVEF) y ik = x ik β + z ik b k + ε ik b k N (0, Ψ) ε ik N ( 0, σ 2) ) ˆβ and associated variability var ( β gender bmi age SBP DBP HR bnpl AFib LVEF 2 / 29
Methods to handle missing values Missing data are often assumed to be missing at random (MAR) Ad-hoc methods Complete-case analysis generally leads to biased estimates increases standard errors Single imputation leads to unbiased estimates standard errors are downwardly biased 3 / 29
Methods to handle missing values Relevant methods Likelihood approaches Frequentist framework: EM algorithm Bayesian framework: Data Augmentation leads to unbiased estimates specic to the analysis model not always feasible Multiple imputation leads to unbiased estimates can be used for several analysis models 4 / 29
Multiple imputation (Rubin, 1987) 1 Generate a set of M parameters θ m of an imputation model to generate M plausible imputed data sets ) P (X miss X obs, θ 1 )......... P (X miss X obs, θ M (ˆFû )ij (ˆFû ) 1 ij +ε1 ij (ˆFû ) 2 ij +ε2 ij (ˆFû ) 3 ij +ε3 ij (ˆFû ) B ij +εb ij 2 Fit the analysis model on each imputed data set: ˆβ m, Var ( ) ˆβ m 3 Combine the results: ˆβ = 1 M ˆβ M m=1 m T = 1 M Var ( ) M m=1 ˆβm + ( ) ( 1 + 1 1 M M M 1 m=1 ˆβm ˆβ Provide estimation of the parameters and of their variability ) 2 5 / 29
MI for multilevel data Two standard ways to perform MI Fully conditional specication (FCS, MICE): a conditional imputation model for each variable Joint modelling (JM): a joint imputation model for all variables The imputation model (joint or conditional) needs to be in line with the data need to account for the heterogeneity between clusters need to account for the types of data (continuous and binary) 6 / 29
MI for multilevel data Type and name Handles missing data: Coded in R Sporadic? Systematic? in binary variable? JM-Pan yes yes no yes, (Pan) JM-REALCOM yes yes yes no JM-jomo yes yes yes yes, (jomo) JM-Mplus yes yes yes no FCS-2lnorm yes no no yes, (mice) FCS-1stage yes using variant yes yes yes FCS-2stage yes yes yes using yes variant 7 / 29
Outline 1 Introduction GREAT data Multiple imputation MI for multilevel data 2 MI methods for multilevel data Continuous variables Univariate case Multivariate case Binary variables 3 Comparisons Simulations Application 4 Conclusion 8 / 29
Continuous variables Heteroscedastic random eect model as imputation model y ik = x ik β + z ik b k + ε ik b k N (0, Ψ) ε ik N (0, Σ k ) Multiple imputation under this model 1 generating M sets of parameters θ m = ( ) β m, Ψ m, (Σ m k ) 1 k K Bayesian formulation: draw θ m from its posterior distribution asymptotic method: estimate θ m, draw θ m from the asymptotic distribution of the estimator 2 imputing the data according each set θ m draw bk m yobs k, θ m draw y miss ik θ m, bk m 9 / 29
Continuous variables Heteroscedastic random eect model as imputation model y ik = x ik β + z ik b k + ε ik b k N (0, Ψ) ε ik N (0, Σ k ) Multiple imputation under this model 1 generating M sets of parameters θ m = 2 imputing the data according each set θ m draw bk m yobs k, θ m draw y miss ik θ m, bk m Specic issues 1 how to generate Σ k without y ik? (systematic) ( ) β m, Ψ m, (Σ m k ) 1 k K 2 how to draw b m k without y ik (systematic) or given y ik (sporadic)? 9 / 29
FCS-1stage (Jolani et al., 2015) Conditional imputation models y ik = x ik β + z ik b k + ε ik b k N (0, Ψ) ε ik N ( 0, σ 2) For each incomplete variable 1 generate θ m = ( ) β m, Ψ m, σ 2 m 1 m M prior: non-informative (Jereys) posterior distribution ( β)) β m N ( β, var W (K, b b ) Ψ 1 m requires REML estimate ( ) σ 2 m Inv-Γ n p 2, (n p) σ 2 2 2 impute in each cluster k with systematically missing data draw b k N (0, Ψ m ) impute data according to the imputation model 10 / 29
FCS-1stage (Jolani et al., 2015) Conditional imputation models y ik = x ik β + z ik b k + ε ik b k N (0, Ψ) ε ik N ( 0, σ 2) For each incomplete variable 1 generate θ m = ( ) β m, Ψ m, σ 2 m 1 m M prior: non-informative (Jereys) posterior distribution ( β)) β m N ( β, var W (K, b b ) Ψ 1 m requires REML estimate ( ) σ 2 m Inv-Γ n p 2, (n p) σ 2 2 2 impute in each cluster k with sporadically missing data draw b k N ( µ bk y k, Ψ bk y k ) impute data according to the imputation model 10 / 29
FCS-2stage (Resche-Rigon and White, 2016) Conditional imputation models y ik = x ik (β + b k ) + ε ik b k N (0, Ψ) ε ik N ( ) 0, σ 2 k the same imputation model, with heteroscedastic assumption 1 generate θ m = ( β m, Ψ m, ( σ1 2,..., ) ) σ2 K ) m estimate θ and var ( θ with a two-stage estimator draw θ m from the asymptotic distribution of the estimator with expectation θ ) and variance var ( θ 2 impute in each cluster k 11 / 29
FCS-2stage (Resche-Rigon and White, 2016) 1 generate θ m = ( β m, Ψ m, ( σ1 2,..., ) ) σ2 K m stage 1 t a linear model to each observed cluster ( ) 1 β k = X k X k X k y k stage 2 combine the estimates β k = (β + b k ) + ε k b k N (0, Ψ) ε k N ( )) 0, var ( βk Two estimators available: REML or method of moments Ψ, β and their associated (asymptotic) variances 12 / 29
FCS-2stage (Resche-Rigon and White, 2016) 1 generate θ m = ( β m, Ψ m, ( σ1 2,..., ) ) σ2 K m stage 1 t a linear model to each observed cluster ) 1 β k = (X k X k X k y k σ k = y k X k β k 2 stage 2 combine the estimates n k p 1 log σ k = (log σ + s k ) + ε k s k N (0, Ψ s) ε k N (0, var (log σ k )) ( )) β k = (β + b k ) + ε k b k N (0, Ψ) ε k N 0, var ( βk Two estimators available: REML or method of moments log σ, Ψ s and their associated (asymptotic) variances Ψ, β and their associated (asymptotic) variances 12 / 29
FCS-2stage (Resche-Rigon and White, 2016) 1 generate θ m = ( β m, Ψ m, ( σ1 2,..., ) ) σ2 K m stage 1 t a linear model to each observed cluster ) 1 β k = (X k X k X k y k σ k = y k X k β k 2 stage 2 combine the estimates n k p 1 log σ k = (log σ + s k ) + ε k s k N (0, Ψ s) ε k N (0, var (log σ k )) ( )) β k = (β + b k ) + ε k b k N (0, Ψ) ε k N 0, var ( βk Two estimators available: REML or method of moments log σ, Ψ s and their associated (asymptotic) variances Ψ, β and their associated (asymptotic) variances 2 impute in each cluster k with systematically missing data draw b k from their marginal distribution impute data according to the imputation model 12 / 29
FCS-2stage (Resche-Rigon and White, 2016) 1 generate θ m = ( β m, Ψ m, ( σ1 2,..., ) ) σ2 K m stage 1 t a linear model to each observed cluster ) 1 β k = (X k X k X k y k σ k = y k X k β k 2 stage 2 combine the estimates n k p 1 log σ k = (log σ + s k ) + ε k s k N (0, Ψ s) ε k N (0, var (log σ k )) ( )) β k = (β + b k ) + ε k b k N (0, Ψ) ε k N 0, var ( βk Two estimators available: REML or method of moments log σ, Ψ s and their associated (asymptotic) variances Ψ, β and their associated (asymptotic) variances 2 impute in each cluster k with sporadically missing data draw b k conditionally to β k impute data according to the imputation model 12 / 29
JM-jomo (Quartagno and Carpenter, 2016) y ik = x ik β + z ik b k + ε ik b k N (0, Ψ) ε ik N (0, Σ k ) 1 Bayesian formulation to generate θ m = (β m, Ψ m, Σ m ) 1 m M (informative) prior: β 1, Ψ 1 W (ν 1, Λ 1 ), Σ 1 k ν 2, Λ 2 W (ν 2, Λ 2 ) posterior: unknown explicitly but... most of conditional posterior distributions are known Gibbs sampler do not require REML estimate unknown conditional distributions can be simulated by MCMC 2 Imputation (given by step 1) 13 / 29
Binary variables FCS-1stage (Jolani et al., 2015) t a logistic model with mixed eect to all clusters sporadically missing values not handled FCS-2stage (Resche-Rigon and White, 2016) t a logistic model with xed eect to each cluster combine estimates using a meta-analysis large clusters are required JM-jomo (Quartagno and Carpenter, 2016) probit link: outcomes are latent normal variables, variance for errors are xed to 1 draw latent normal variables derive categories more time consuming 14 / 29
Summary method Bayesian / asymptotic prior for covariance matrices heteroscedasticity assumption for errors binary variables FCS-1stage Bayesian Jerey no probit link FCS-2tage asymptotic yes logistic link JM-jomo Bayesian Wishart yes logistic link 15 / 29
Simulation design Data generation: 500 incomplete data sets are independently simulated (n = 11685, K = 28, 18 n k 1834) y ik = β 0 + β 1 x (1) ik + β 2 x (2) ik with β = (.72,.11,.03), Ψ = (µ k, ν k, ξ k ) N 0,.12.001.001.001.12.001.001.001.12 + b0 k [ + b1 k x(1) ik + ε ik.0077.0015.0015.0004 x (1) ik : N (2.9 + µ k,.36) ( ( )) x (2) ik : logit P x (2) ik = 1 = 4.2 + ν k x (3) ik : N (2.9 + ξ k,.36) ], σ =.15 add missing values on x (1), x (2) with π syst =.25 and π spor =.25 16 / 29
Simulation design Methods JM-jomo, FCS-1stage, FCS-2stage Full, CC, FCS-x, FCS-noclust, JM-pan M = 5 imputed arrays ) Estimands: β and var ( β Criteria: bias, rmse, variance estimate, coverage 17 / 29
Base-case conguration β β true Full CC FCS fix FCS noclust JM pan FCS 1stage FCS 2stageMM FCS 2stageRE JM jomo 0.01 0.01 0.03 β 1 β 2 18 / 29
Base-case conguration Method ) ) var ( β var ( β 95% Cover Time (min) β 1 β 2 β 1 β 2 β 1 β 2 Full 0.0047 0.0029 0.0048 0.0030 93.8 94.2 CC 0.0070 0.0053 0.0071 0.0053 92.2 94.4 FCS-x 0.0043 0.0043 0.0058 0.0042 85.6 94.6 0.732 FCS-noclust 0.0041 0.0043 0.0067 0.0046 59.6 92.4 0.601 JM-pan 0.0048 0.0042 0.0058 0.0039 83.0 95.2 0.006 FCS-1stage 0.0049 0.0046 0.0056 0.0043 90.4 96.8 63.43 FCS-2stagemm 0.0059 0.0054 0.0058 0.0044 95.0 97.0 0.538 FCS-2stagere 0.0059 0.0049 0.0058 0.0044 95.0 96.2 1.304 JM-jomo 0.0066 0.0069 0.0057 0.0050 96.8 97.6 6.739 19 / 29
Base-case conguration Method ) ) var ( β var ( β 95% Cover Time (min) β 1 β 2 β 1 β 2 β 1 β 2 Full 0.0047 0.0029 0.0048 0.0030 93.8 94.2 CC 0.0070 0.0053 0.0071 0.0053 92.2 94.4 FCS-x 0.0043 0.0043 0.0058 0.0042 85.6 94.6 0.732 FCS-noclust 0.0041 0.0043 0.0067 0.0046 59.6 92.4 0.601 JM-pan 0.0048 0.0042 0.0058 0.0039 83.0 95.2 0.006 FCS-1stage 0.0049 0.0046 0.0056 0.0043 90.4 96.8 63.43 FCS-2stagemm 0.0059 0.0054 0.0058 0.0044 95.0 97.0 0.538 FCS-2stagere 0.0059 0.0049 0.0058 0.0044 95.0 96.2 1.304 JM-jomo 0.0066 0.0069 0.0057 0.0050 96.8 97.6 6.739 19 / 29
Robustness to the cluster size: point estimate β 1 β 2 Relative bias 0.20 0.15 0.10 0.05 0.00 JM FCS 1stage FCS 2stageRE FCS 2stageMM Relative bias 0.20 0.15 0.10 0.05 0.00 0 100 200 300 400 0 100 200 300 400 n k n k 20 / 29
Robustness to the number of clusters: point estimate β 1 β 2 Relative bias 0.20 0.15 0.10 0.05 0.00 JM FCS 1stage FCS 2stageRE FCS 2stageMM Relative bias 0.20 0.15 0.10 0.05 0.00 10 15 20 25 10 15 20 25 K K 21 / 29
Robustness to π syst : point estimate β 1 β 2 Relative bias 0.20 0.15 0.10 0.05 0.00 JM FCS 1stage FCS 2stageRE FCS 2stageMM Relative bias 0.20 0.15 0.10 0.05 0.00 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.10 0.15 0.20 0.25 0.30 0.35 0.40 π syst π syst 22 / 29
Robustness to π syst : variance estimate 0.0050 0.0060 0.0070 SE β 1 Model SE JM Model SE FCS 1stage Model SE FCS 2stageRE Model SE FCS 2stageMM Emp SE 0.005 0.006 0.007 0.008 0.009 SE β 2 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.10 0.15 0.20 0.25 0.30 0.35 0.40 π syst π syst 23 / 29
Robustness to the type of imputed variables β β true Full FCS 1stage FCS 2stageMM FCS 2stageRE JM jomo 0.01 0.01 0.03 β 1 β 2 Method ) ) var ( β var ( β 95% Cover Time (min) β 1 β 2 β 1 β 2 β 1 β 2 Full 0.0050 0.0029 0.0049 0.0028 95.0 95.0 FCS-1stage 0.0057 0.0044 0.0059 0.0043 92.0 95.2 103.665 FCS-2stagemm 0.0063 0.0051 0.0060 0.0044 94.0 96.2 0.652 FCS-2stagere 0.0056 0.0045 0.0061 0.0044 90.4 95.0 1.572 JM-jomo 0.0074 0.0072 0.0064 0.0047 97.0 98.6 5.612 24 / 29
Other congurations Methods have similar performances when the missing data mechanism is MAR the outcome of the analysis model is binary the variance of random eects is higher or smaller binary variables are generated using a probit link 25 / 29
Appplication to GREAT data Explain the relationship between biomarkers easily measurable (BNP, AFIB) and the left ventricular ejection fraction y=lvef, X = BNP, AFib MI using M = 20 imputed arrays gender bmi age SBP DBP HR bnpl AFib LVEF Index CC JM FCS-1stage FCS- 2stagere FCS- 2stagemm β BNP Est -0.1132-0.0891-0.0902-0.0854-0.1009 ModelSE 0.0108 0.0078 0.0153 0.0099 0.0112 β AFIB Est 0.0268 0.0216 0.0251 0.0215 0.0273 ModelSE 0.0071 0.0046 0.0047 0.0040 0.0045 Time 94.0 18609.3 361.3 31.8 26 / 29
Appplication to GREAT data Explain the relationship between biomarkers easily measurable (BNP, AFIB) and the left ventricular ejection fraction y=lvef, X = BNP, AFib MI using M = 20 imputed arrays gender bmi age SBP DBP HR bnpl AFib LVEF Index CC JM FCS-1stage FCS- 2stagere FCS- 2stagemm β BNP Est -0.1132-0.0891-0.0902-0.0854-0.1009 ModelSE 0.0108 0.0078 0.0153 0.0099 0.0112 β AFIB Est 0.0268 0.0216 0.0251 0.0215 0.0273 ModelSE 0.0071 0.0046 0.0047 0.0040 0.0045 Time 94.0 18609.3 361.3 31.8 26 / 29
Appplication to GREAT data Explain the relationship between biomarkers easily measurable (BNP, AFIB) and the left ventricular ejection fraction y=lvef, X = BNP, AFib MI using M = 20 imputed arrays gender bmi age SBP DBP HR bnpl AFib LVEF Index CC JM FCS-1stage FCS- 2stagere FCS- 2stagemm β BNP Est -0.1132-0.0891-0.0902-0.0854-0.1009 ModelSE 0.0108 0.0078 0.0153 0.0099 0.0112 β AFIB Est 0.0268 0.0216 0.0251 0.0215 0.0273 ModelSE 0.0071 0.0046 0.0047 0.0040 0.0045 Time 94.0 18609.3 361.3 31.8 26 / 29
Appplication to GREAT data Explain the relationship between biomarkers easily measurable (BNP, AFIB) and the left ventricular ejection fraction y=lvef, X = BNP, AFib MI using M = 20 imputed arrays gender bmi age SBP DBP HR bnpl AFib LVEF Index CC JM FCS-1stage FCS- 2stagere FCS- 2stagemm β BNP Est -0.1132-0.0891-0.0902-0.0854-0.1009 ModelSE 0.0108 0.0078 0.0153 0.0099 0.0112 β AFIB Est 0.0268 0.0216 0.0251 0.0215 0.0273 ModelSE 0.0071 0.0046 0.0047 0.0040 0.0045 Time 94.0 18609.3 361.3 31.8 26 / 29
Appplication to GREAT data Explain the relationship between biomarkers easily measurable (BNP, AFIB) and the left ventricular ejection fraction y=lvef, X = BNP, AFib MI using M = 20 imputed arrays gender bmi age SBP DBP HR bnpl AFib LVEF Index CC JM FCS-1stage FCS- 2stagere FCS- 2stagemm β BNP Est -0.1132-0.0891-0.0902-0.0854-0.1009 ModelSE 0.0108 0.0078 0.0153 0.0099 0.0112 β AFIB Est 0.0268 0.0216 0.0251 0.0215 0.0273 ModelSE 0.0071 0.0046 0.0047 0.0040 0.0045 Time 94.0 18609.3 361.3 31.8 26 / 29
Conclusion An overview of MI methods for multilevel data FCS-1stage, FSC-2stage and JM-jomo all appear to perform well Outperform had-hoc methods FCS-2stage FCS-1stage JM-jomo MM version provides a quick way to obtain rst results for large clusters relevant with few systematically missing values time consuming with binary variables advised with a lot of incomplete categorical variables be careful with few clusters Methods are implemented in R mice package for FCS methods jomo package for JM-jomo 27 / 29
Limits and perspectives Limits congeniality y ik = β 0 + β 1 x (1) ik + β2 x (2) ik + b0 k + bk 1 x(1) ik + ε ik convergence of the FCS approaches computational time Perspectives correction for FCS-2stage with small clusters? handle logistic link for FCS-1stage? 28 / 29
References I Great Network. Managing acute heart failure in the ED - case studies from the acute heart failure academy, 2013. http://www.greatnetwork.org. D. B. Rubin. Multiple Imputation for Non-Response in Survey. Wiley, New-York, 1987. S. van Buuren. Multiple imputation of multilevel data. In The Handbook of Advanced Multilevel Analysis. Routledge, Milton Park, UK, 2010. S. Jolani, T. P. A. Debray, H. Kojberg, S. van Buuren, and K. G. M. Moons. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Statistics in Medicine, 34(11):18411863, 2015. M. Resche-Rigon, I. R. White, J. Bartlett, S.A.E. Peters, S.G. Thompson, and on behalf of the PROG-IMT Study Group. Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Statistics in Medicine, 32(28):48904905, 2013. ISSN 1097-0258. doi: 10.1002/sim.5894. M. Resche-Rigon and I. White. Multiple imputation by chained equations for systematically and sporadically missing multilevel data. smmr, 2016. J. L. Schafer. Imputation of missing covariates under a multivariate linear mixed model. Technical report, Dept. of Statistics, The Pennsylvania State University, 1997. M. Quartagno and J. R. Carpenter. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with missing covariates. Statistics in Medicine, 35(17):29382954, 2016. ISSN 1097-0258. 29 / 29