Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

Similar documents
Kyle M. Lang. Date defended: February 13, Chairperson Todd D. Little. Wei Wu. Paul E. Johnson

Don t be Fancy. Impute Your Dependent Variables!

Basics of Modern Missing Data Analysis

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Evaluating the Sensitivity of Goodness-of-Fit Indices to Data Perturbation: An Integrated MC-SGR Approach

EM Algorithm II. September 11, 2018

Pooling multiple imputations when the sample happens to be the population.

Strati cation in Multivariate Modeling

Comparison between conditional and marginal maximum likelihood for a class of item response models

F-tests for Incomplete Data in Multiple Regression Setup

Introduction An approximated EM algorithm Simulation studies Discussion

FIT CRITERIA PERFORMANCE AND PARAMETER ESTIMATE BIAS IN LATENT GROWTH MODELS WITH SMALL SAMPLES

Likelihood-based inference with missing data under missing-at-random

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Fractional Imputation in Survey Sampling: A Comparative Review

Inferences on a Normal Covariance Matrix and Generalized Variance with Monotone Missing Data

PIRLS 2016 Achievement Scaling Methodology 1

Multilevel Statistical Models: 3 rd edition, 2003 Contents

SIMULTANEOUS CONFIDENCE INTERVALS AMONG k MEAN VECTORS IN REPEATED MEASURES WITH MISSING DATA

Modification and Improvement of Empirical Likelihood for Missing Response Problem

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

A weighted simulation-based estimator for incomplete longitudinal data models

ANALYSIS OF TWO-LEVEL STRUCTURAL EQUATION MODELS VIA EM TYPE ALGORITHMS

Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates

Confirmatory Factor Analysis. Psych 818 DeShon

A note on multiple imputation for general purpose estimation

Misspecification in Nonrecursive SEMs 1. Nonrecursive Latent Variable Models under Misspecification

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

Plausible Values for Latent Variables Using Mplus

A Note on Bayesian Inference After Multiple Imputation

Statistical Practice

University of Michigan School of Public Health

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Inferences on missing information under multiple imputation and two-stage multiple imputation

Chapter 4: Factor Analysis

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Parametric fractional imputation for missing data analysis

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

The Expectation Maximization Algorithm

Structural Equation Modeling

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Type I Error Rates of the Kenward-Roger Adjusted Degree of Freedom F-test for a Split-Plot Design with Missing Values

Bayesian Analysis of Latent Variable Models using Mplus

Determining the number of components in mixture models for hierarchical data

Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys. Tom Rosenström University of Helsinki May 14, 2014

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Multivariate statistical methods and data mining in particle physics

Impact of serial correlation structures on random effect misspecification with the linear mixed model.

Estimating the parameters of hidden binomial trials by the EM algorithm

Subsample ignorable likelihood for regression analysis with missing data

Supplemental material to accompany Preacher and Hayes (2008)

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Biostat 2065 Analysis of Incomplete Data

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Scaled and adjusted restricted tests in. multi-sample analysis of moment structures. Albert Satorra. Universitat Pompeu Fabra.

An Approximate Test for Homogeneity of Correlated Correlation Coefficients

Statistical Methods for Handling Missing Data

Analyzing Pilot Studies with Missing Observations

The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference

Robust Means Modeling vs Traditional Robust Tests 1

Analysis of Incomplete Non-Normal Longitudinal Lipid Data

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Nesting and Equivalence Testing

Topics and Papers for Spring 14 RIT

On Selecting Tests for Equality of Two Normal Mean Vectors

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

A Comparative Study of Imputation Methods for Estimation of Missing Values of Per Capita Expenditure in Central Java

Methods for Handling Missing Non-Normal Data in Structural Equation Modeling. Fan Jia

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

EM for ML Estimation

Model fit evaluation in multilevel structural equation models

Flexible mediation analysis in the presence of non-linear relations: beyond the mediation formula.

Planned Missingness Designs and the American Community Survey (ACS)

Time-Invariant Predictors in Longitudinal Models

6 Pattern Mixture Models

Two-phase sampling approach to fractional hot deck imputation

Likelihood Ratio Criterion for Testing Sphericity from a Multivariate Normal Sample with 2-step Monotone Missing Data Pattern

VCMC: Variational Consensus Monte Carlo

Stochastic approximation EM algorithm in nonlinear mixed effects model for viral load decrease during anti-hiv treatment

Telescope Matching: A Flexible Approach to Estimating Direct Effects

Chapter 4. Parametric Approach. 4.1 Introduction

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Reconstruction of individual patient data for meta analysis via Bayesian approach

Testing Structural Equation Models: The Effect of Kurtosis

Graybill Conference Poster Session Introductions

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Akaike Information Criterion

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Discussing Effects of Different MAR-Settings

Modeling Multiscale Differential Pixel Statistics

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Transcription:

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level A Monte Carlo Simulation to Test the Tenability of the SuperMatrix Approach Kyle M Lang Quantitative Psychology Training Program University of Kansas Lawrence, KS February 3, 203

Outline Introduction to the problem and motivation for then current project Description of the simulation study Discussion of key findings Discussion of limitations of the current work and suggestions for future directions Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 2 / 6

Motivation for the Current Work The Motivating Problem: How to judge the adequacy of latent variable models fit to multiply imputed data? Currently no strong consensus on how to combine fit measures across imputations Rubin s Rules (Rubin, 987) are not directly applicable to pooling χ 2 statistics Makes it difficult to assess the adequacy of latent variable models Extant solutions to this problem (eg, Cai & Lee, 2009; Lee & Cai, 202; Meng & Rubin, 992) tend to entail complicated calculations I was interested in developing an easily implemented technique to combine χ 2 statistics across imputations SuperMatrix (SM) Technique Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 3 / 6

What is the SuperMatrix Technique? X Y Z x X Y Z X 2 Y 2 Z 2 X m Y m Z m y 2 z 2 x 3 y 3 ) Create m imputed data sets x y z x 2 y 2 z 2 x y z x 2 y 2 z 2 x y z x 2 y 2 z 2 x 4 z 4 x 3 y 3 z 3 x 3 y 3 z 3 x 3 y 3 z 3 x 4 y 4 z 4 x 4 y 4 z 4 x 4 y 4 z 4 X Y Z x y z x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x y z 2) Stack all m imputed data sets into a single data frame x 2 y 2 z 2 x 3 y 3 z 3 x 4 y 4 z 4 x y z x 2 y 2 z 2 3) Compute a single covariance matrix from the aggregate data frame X X Y Z σ 2 X Y σ Y,X σ 2 Y x 3 y 3 z 3 Z σ Z,X σ Z,Y σ 2 Z x 4 y 4 z 4 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 4 / 6

Methods: Data Generation Data Generating Model Procedure 64 A 6 6 6 6 6 6 64 A2 A0 B B2 64 64 64 64 2 Factor A 5 Factor B 2 2 05 05 05 Covariate Covariate 2 3 B0 For each replication: A single population realization was generated These fully observed data were used to fit the complete data comparison models MAR missingness was then introduced to the complete data This incomplete data set was submitted to the various missing data treatments under study Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 5 / 6

Methods: Simulation Conditions Parameters Varied Simulation Structure Sample Size 00, 20,, 980, 000 Percent Missing 2%, 4%,, 48%, 50% Final Conditions 500 Replications 3 Missingness Treatments SuperMatrix, Naive Approach, & FIML 2 Model Structures Full: ψ 2, = ˆψ Restricted: ψ 2, = 0 3(Missing Data Treatments) 2(Model Structures) 46(Sample Sizes) 25(Percents Missing) = 6900 Crossed Conditions Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 6 / 6

θ 22,2 Methods: Analysis Strategy Analysis Model Analysis Model 2 ψ 2, Factor Factor A B ψ 2, Factor Factor A B λ λ, λ λ λ 2,,2 0, 2,2 λ 20,2 A A2 A0 B B2 B0 λ λ 2,2, λ λ λ,2 λ 20,2 2, 0, θ, e A θ 2,2 e A2 θ 0,0 e A0 θ, e B θ 2,2 e B2 θ 20,20 e B0 A A2 A0 B B2 B0 θ, θ 2,2 θ 0,0 θ, θ 2,2 θ 20,20 θ 2, θ 2,2 θ 2,0 θ 2, θ2,2 θ 22,0 θ 22, θ 22,2 θ 22,20 Test Statistics: P RB = 00 RMSE = ( K K i= ) ˆT i T T θ 2,2 Covariate θ 2,20 θ 22,2 θ 22, Covariate 2 K K i= ( ˆTi T ) 2 = ( ˆT T ) 2 + ( SE ˆT Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 7 / 6 ) 2 θ 22,22

Hypotheses Hypothesis : Convergence SuperMatrix will lead to higher convergence rates than FIML Hypothesis 2: Direct Model Fit SM-based model fit will be trivially different from complete data-based model fit Hypothesis 3: Relative Performance Naive-based model fit will show universally larger deviations from complete data-based estimates than will SM-based fit Hypothesis 4: Nested Model χ 2 Testing χ 2 tests derived from SuperMatrix χ 2 values will show negligible deviation from analogous tests derived from complete data χ 2 values Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 8 / 6

Results: Important Findings Hypothesis was strongly supported 00% convergence for all imputation conditions Very low convergence rates for several FIML conditions N < 200 P M > 40% 0% convergence when N = 00 and P M = 50% 000 N 800 600 400 Convergence Rates of FIML Models Plotted by Sample Size and Percent Missing 200 PM 0 20 30 40 50 0 2 4 6 8 0 Convergence Rate Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 9 / 6

Results: Important Findings 2 Plate : CFI for the SuperMatrix, Naive, and Complete Data Conditions 0 9 Hypothesis 3 was also definitively supported Across all conditions, model fit derived from the SM technique more closely approximated the complete data values than did the model fit derived from the Naive approach 0 20 PM 30 40 50 200 400 600 800 000 Plate 2: TLI for the SuperMatrix, Naive, and Complete Data Conditions N 7 6 5 8 0 9 8 CFI 7 TLI 6 0 20 PM 30 40 50 200 400 600 800 000 N 5 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 0 / 6

Results: Important Findings 3 Hypothesis 4 was supported, as well SM-based χ 2 values accurately replicated the complete data values across nearly all conditions Bias becomes unacceptable when P M > 40% and N 200 FIML-based χ 2 values quickly become negatively biased For all sample sizes, bias was unacceptable for P M > 0% PM PM Plate : Δχ 2 for the Complete Data and SuperMatrix Conditions 0 20 30 40 0 20 30 40 50 200 400 600 800 000 Plate 2: Δχ 2 for the Complete Data and FIML Conditions 50 N 200 400 600 800 000 N 300 250 200 50 00 50 0 300 250 200 50 00 50 0 Δχ 2 Δχ 2 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 / 6

Results: Important Findings 4 Hypothesis 2 was not supported Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 2 / 6

Conclusion: Limitations Trivially simple models Only multivariate normally distributed indicators Small number of comparison conditions Inability to assess Power and Type I Error Rates Because of the large effect size associated with the latent covariance (ie, r = 5), rejection rates could not be scrutinized Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 3 / 6

Conclusion: Future Directions Include currently recommended techniques as comparison conditions Expectation Maximization (EM) Algorithm (Dempster, Laird, & Rubin, 977) Yuan & Bentler Two Stage Estimator (Yuan & Bentler, 2000) Satorra-Bentler Robust χ 2 (Satorra & Bentler, 994) Address the poor performance of the SuperMatrix when assessing direct model fit Convert the SM technique into a two-stage estimator Correct the likelihood ratio statistic so that it follows a χ 2 distribution Manipulate the sample size term of the χ 2 expression Implement the Lee and Cai (202) correction to the minimized fit function value Assess Power and Type I Error rates of hypothesis tests conducted under the SM technique Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 4 / 6

References Cai, L, & Lee, T (2009) Covariance structure model fit testing under missing data: An application of the supplemented em algorithm Multivariate Behavioral Research, 44, 28 304 Dempster, A P, Laird, N M, & Rubin, D B (977) Maximum likelihood from incomplete data via the em algorithm Journal of the Royal Statistical Society Series B (Methodological), 38 Lee, T, & Cai, L (202) Alternative multiple imputation inference for mean and covariance structure modeling Journal of Educational and Behavioral Statistics, 37(6), 675 702 Meng, X L, & Rubin, D B (992) Performing likelihood ratio tests with multiply-imputed data sets Biometrika, 79, 03 Rubin, D B (987) Multiple imputation for nonresponse in surveys (Vol 59) Wiley Online Library Satorra, A, & Bentler, P (994) Corrections to test statistics and standard errors in covariance structure analysis Yuan, K-H, & Bentler, P M (2000) Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data Sociological Methodology, 30, 65 200 Retrieved from http://dxdoiorg/0/008-75000078 Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 5 / 6

Thank you for your time Questions/Comments? Kyle M Lang (KU) Streamlining Missing Data Analysis February 3, 203 6 / 6