Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level A Monte Carlo Simulation to Test the Tenability of the SuperMatrix Approach Kyle M Lang Quantitative Psychology Training Program University of Kansas Lawrence, KS February 3, 203
Outline Introduction to the problem and motivation for then current project Description of the simulation study Discussion of key findings Discussion of limitations of the current work and suggestions for future directions Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 2 / 6
Motivation for the Current Work The Motivating Problem: How to judge the adequacy of latent variable models fit to multiply imputed data? Currently no strong consensus on how to combine fit measures across imputations Rubin s Rules (Rubin, 987) are not directly applicable to pooling χ 2 statistics Makes it difficult to assess the adequacy of latent variable models Extant solutions to this problem (eg, Cai & Lee, 2009; Lee & Cai, 202; Meng & Rubin, 992) tend to entail complicated calculations I was interested in developing an easily implemented technique to combine χ 2 statistics across imputations SuperMatrix (SM) Technique Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 3 / 6
What is the SuperMatrix Technique? X Y Z x X Y Z X 2 Y 2 Z 2 X m Y m Z m y 2 z 2 x 3 y 3 ) Create m imputed data sets x y z x 2 y 2 z 2 x y z x 2 y 2 z 2 x y z x 2 y 2 z 2 x 4 z 4 x 3 y 3 z 3 x 3 y 3 z 3 x 3 y 3 z 3 x 4 y 4 z 4 x 4 y 4 z 4 x 4 y 4 z 4 X Y Z x y z x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x y z 2) Stack all m imputed data sets into a single data frame x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x y z x 2 y 2 z 2 3) Compute a single covariance matrix from the aggregate data frame X X Y Z σ 2 X Y σ Y,X σ 2 Y x 3 y 3 z 3 Z σ Z,X σ Z,Y σ 2 Z x 4 y 4 z 4 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 4 / 6
Methods: Data Generation Data Generating Model Procedure 64 A 6 6 6 6 6 6 64 A2 A0 B B2 64 64 64 64 2 Factor A 5 Factor B 2 2 05 05 05 Covariate Covariate 2 3 B0 For each replication: A single population realization was generated These fully observed data were used to fit the complete data comparison models MAR missingness was then introduced to the complete data This incomplete data set was submitted to the various missing data treatments under study Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 5 / 6
Methods: Simulation Conditions Parameters Varied Simulation Structure Sample Size 00, 20,, 980, 000 Percent Missing 2%, 4%,, 48%, 50% Final Conditions 500 Replications 3 Missingness Treatments SuperMatrix, Naive Approach, & FIML 2 Model Structures Full: ψ 2, = ˆψ Restricted: ψ 2, = 0 3(Missing Data Treatments) 2(Model Structures) 46(Sample Sizes) 25(Percents Missing) = 6900 Crossed Conditions Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 6 / 6
θ 22,2 Methods: Analysis Strategy Analysis Model Analysis Model 2 ψ 2, Factor Factor A B ψ 2, Factor Factor A B λ λ, λ λ λ 2,,2 0, 2,2 λ 20,2 A A2 A0 B B2 B0 λ λ 2,2, λ λ λ,2 λ 20,2 2, 0, θ, e A θ 2,2 e A2 θ 0,0 e A0 θ, e B θ 2,2 e B2 θ 20,20 e B0 A A2 A0 B B2 B0 θ, θ 2,2 θ 0,0 θ, θ 2,2 θ 20,20 θ 2, θ 2,2 θ 2,0 θ 2, θ2,2 θ 22,0 θ 22, θ 22,2 θ 22,20 Test Statistics: P RB = 00 RMSE = ( K K i= ) ˆT i T T θ 2,2 Covariate θ 2,20 θ 22,2 θ 22, Covariate 2 K K i= ( ˆTi T ) 2 = ( ˆT T ) 2 + ( SE ˆT Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 7 / 6 ) 2 θ 22,22
Hypotheses Hypothesis : Convergence SuperMatrix will lead to higher convergence rates than FIML Hypothesis 2: Direct Model Fit SM-based model fit will be trivially different from complete data-based model fit Hypothesis 3: Relative Performance Naive-based model fit will show universally larger deviations from complete data-based estimates than will SM-based fit Hypothesis 4: Nested Model χ 2 Testing χ 2 tests derived from SuperMatrix χ 2 values will show negligible deviation from analogous tests derived from complete data χ 2 values Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 8 / 6
Results: Important Findings Hypothesis was strongly supported 00% convergence for all imputation conditions Very low convergence rates for several FIML conditions N < 200 P M > 40% 0% convergence when N = 00 and P M = 50% 000 N 800 600 400 Convergence Rates of FIML Models Plotted by Sample Size and Percent Missing 200 PM 0 20 30 40 50 0 2 4 6 8 0 Convergence Rate Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 9 / 6
Results: Important Findings 2 Plate : CFI for the SuperMatrix, Naive, and Complete Data Conditions 0 9 Hypothesis 3 was also definitively supported Across all conditions, model fit derived from the SM technique more closely approximated the complete data values than did the model fit derived from the Naive approach 0 20 PM 30 40 50 200 400 600 800 000 Plate 2: TLI for the SuperMatrix, Naive, and Complete Data Conditions N 7 6 5 8 0 9 8 CFI 7 TLI 6 0 20 PM 30 40 50 200 400 600 800 000 N 5 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 0 / 6
Results: Important Findings 3 Hypothesis 4 was supported, as well SM-based χ 2 values accurately replicated the complete data values across nearly all conditions Bias becomes unacceptable when P M > 40% and N 200 FIML-based χ 2 values quickly become negatively biased For all sample sizes, bias was unacceptable for P M > 0% PM PM Plate : Δχ 2 for the Complete Data and SuperMatrix Conditions 0 20 30 40 0 20 30 40 50 200 400 600 800 000 Plate 2: Δχ 2 for the Complete Data and FIML Conditions 50 N 200 400 600 800 000 N 300 250 200 50 00 50 0 300 250 200 50 00 50 0 Δχ 2 Δχ 2 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 / 6
Results: Important Findings 4 Hypothesis 2 was not supported Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 2 / 6
Conclusion: Limitations Trivially simple models Only multivariate normally distributed indicators Small number of comparison conditions Inability to assess Power and Type I Error Rates Because of the large effect size associated with the latent covariance (ie, r = 5), rejection rates could not be scrutinized Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 3 / 6
Conclusion: Future Directions Include currently recommended techniques as comparison conditions Expectation Maximization (EM) Algorithm (Dempster, Laird, & Rubin, 977) Yuan & Bentler Two Stage Estimator (Yuan & Bentler, 2000) Satorra-Bentler Robust χ 2 (Satorra & Bentler, 994) Address the poor performance of the SuperMatrix when assessing direct model fit Convert the SM technique into a two-stage estimator Correct the likelihood ratio statistic so that it follows a χ 2 distribution Manipulate the sample size term of the χ 2 expression Implement the Lee and Cai (202) correction to the minimized fit function value Assess Power and Type I Error rates of hypothesis tests conducted under the SM technique Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 4 / 6
References Cai, L, & Lee, T (2009) Covariance structure model fit testing under missing data: An application of the supplemented em algorithm Multivariate Behavioral Research, 44, 28 304 Dempster, A P, Laird, N M, & Rubin, D B (977) Maximum likelihood from incomplete data via the em algorithm Journal of the Royal Statistical Society Series B (Methodological), 38 Lee, T, & Cai, L (202) Alternative multiple imputation inference for mean and covariance structure modeling Journal of Educational and Behavioral Statistics, 37(6), 675 702 Meng, X L, & Rubin, D B (992) Performing likelihood ratio tests with multiply-imputed data sets Biometrika, 79, 03 Rubin, D B (987) Multiple imputation for nonresponse in surveys (Vol 59) Wiley Online Library Satorra, A, & Bentler, P (994) Corrections to test statistics and standard errors in covariance structure analysis Yuan, K-H, & Bentler, P M (2000) Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data Sociological Methodology, 30, 65 200 Retrieved from http://dxdoiorg/0/008-75000078 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 5 / 6
Thank you for your time Questions/Comments? Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 6 / 6