Handling Missing Data in R with MICE

Similar documents
The mice package. Stef van Buuren 1,2. amst-r-dam, Oct. 29, TNO, Leiden. 2 Methodology and Statistics, FSBS, Utrecht University

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Pooling multiple imputations when the sample happens to be the population.

Don t be Fancy. Impute Your Dependent Variables!

Predictive mean matching imputation of semicontinuous variables

Basics of Modern Missing Data Analysis

Flexible multiple imputation by chained equations of the AVO-95 Survey

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Some methods for handling missing values in outcome variables. Roderick J. Little

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

MISSING or INCOMPLETE DATA

Implications of Missing Data Imputation for Agricultural Household Surveys: An Application to Technology Adoption

ECON 594: Lecture #6

Time-Invariant Predictors in Longitudinal Models

Longitudinal analysis of ordinal data

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

A Note on Bayesian Inference After Multiple Imputation

Inferences on missing information under multiple imputation and two-stage multiple imputation

Multiple Imputation. Paul E. Johnson 1 2. Why Impute? Amelia mice mi aregimpute Making Sense out of All of This. 1 Department of Political Science

Plausible Values for Latent Variables Using Mplus

Three-Level Multiple Imputation: A Fully Conditional Specification Approach. Brian Tinnell Keller

Bayesian Multilevel Latent Class Models for the Multiple. Imputation of Nested Categorical Data

Generalized Linear Models (GLZ)

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Time-Invariant Predictors in Longitudinal Models

Interactions and Squares: Don t Transform, Just Impute!

Missing Data and Multiple Imputation

Short course on Missing Data

A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Known unknowns : using multiple imputation to fill in the blanks for missing data

MISSING or INCOMPLETE DATA

MULTILEVEL IMPUTATION 1

Statistical Practice

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

Machine Learning Linear Classification. Prof. Matteo Matteucci

Generalized Linear Models for Non-Normal Data

Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys. Tom Rosenström University of Helsinki May 14, 2014

Determining Sufficient Number of Imputations Using Variance of Imputation Variances: Data from 2012 NAMCS Physician Workflow Mail Survey *

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

A weighted simulation-based estimator for incomplete longitudinal data models

STATISTICAL ANALYSIS WITH MISSING DATA

Comparison of methods for repeated measures binary data with missing values. Farhood Mohammadi. A thesis submitted in partial fulfillment of the

analysis of incomplete data in statistical surveys

arxiv: v1 [stat.me] 27 Feb 2017

F-tests for Incomplete Data in Multiple Regression Setup

Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models

arxiv: v2 [stat.me] 27 Nov 2017

Multiple Imputation for Complex Data Sets

Advising on Research Methods: A consultant's companion. Herman J. Ader Gideon J. Mellenbergh with contributions by David J. Hand

Methodology and Statistics for the Social and Behavioural Sciences Utrecht University, the Netherlands

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Time-Invariant Predictors in Longitudinal Models

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

PIRLS 2016 Achievement Scaling Methodology 1

Basic Medical Statistics Course

Estimating complex causal effects from incomplete observational data

LOGISTIC REGRESSION Joseph M. Hilbe

Flexible mediation analysis in the presence of non-linear relations: beyond the mediation formula.

Towards an MI-proper Predictive Mean Matching

A Comparative Study of Imputation Methods for Estimation of Missing Values of Per Capita Expenditure in Central Java

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Introduction to Statistical Analysis

Comparison for alternative imputation methods for ordinal data

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data

Correspondence Analysis of Longitudinal Data

Part 8: GLMs and Hierarchical LMs and GLMs

Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates

Quantile POD for Hit-Miss Data

A Re-Introduction to General Linear Models (GLM)

Classification. Chapter Introduction. 6.2 The Bayes classifier

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

Experimental Design and Data Analysis for Biologists

Analyzing Pilot Studies with Missing Observations

2 Naïve Methods. 2.1 Complete or available case analysis

Regression Methods for Survey Data

Whether to use MMRM as primary estimand.

Downloaded from:

LCA Distal Stata function users guide (Version 1.0)

A Comparison of Missing Data Handling Methods Catherine Truxillo, Ph.D., SAS Institute Inc, Cary, NC

arxiv: v5 [stat.me] 13 Feb 2018

Package accelmissing

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Chapter 4. Parametric Approach. 4.1 Introduction

Empirical Validation of the Critical Thinking Assessment Test: A Bayesian CFA Approach

Time Invariant Predictors in Longitudinal Models

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012

Non-linear panel data modeling

Measurement Error and Linear Regression of Astronomical Data. Brandon Kelly Penn State Summer School in Astrostatistics, June 2007

Comparing groups using predicted probabilities

7 Sensitivity Analysis

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Reconstruction of individual patient data for meta analysis via Bayesian approach

Logistic Regression Models for Multinomial and Ordinal Outcomes

Two-phase sampling approach to fractional hot deck imputation

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

LCA_Distal_LTB Stata function users guide (Version 1.1)

Transcription:

Handling Missing Data in R with MICE Handling Missing Data in R with MICE Stef van Buuren 1,2 1 Methodology and Statistics, FSBS, Utrecht University 2 Netherlands Organization for Applied Scientific Research TNO, Leiden Winnipeg, June 11, 2017

Handling Missing Data in R with MICE Why this course? Missing data are everywhere Ad-hoc fixes often do not work Multiple imputation is broadly applicable, yield correct statistical inferences, and there is good software Goal of the course: get comfortable with a modern and powerful way of solving missing data problems

Handling Missing Data in R with MICE Course materials https://github.com/stefvanbuuren/winnipeg

Handling Missing Data in R with MICE Reading materials 1 Van Buuren, S. and Groothuis-Oudshoorn, C.G.M. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1 67. https://www.jstatsoft.org/article/view/v045i03 2 Van Buuren, S. (2012). Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL. Chapters 1 6, 10. http://www.crcpress.com/product/isbn/9781439868249

Handling Missing Data in R with MICE Flexible Imputation of Missing Data (FIMD)

Handling Missing Data in R with MICE R software and examples R Install from https://cran.r-project.org RStudio: Install from https://www.rstudio.com R package mice 2.30 or higher: from CRAN or from https://github.com/stefvanbuuren/mice More examples: http://www.multiple-imputation.com

Handling Missing Data in R with MICE > Time table Time table (morning) Time Session L/P Description 09.00-09.15 L Overview 09.15-10.00 I L Introduction to missing data 10.00-10.30 I P Ad hoc methods + MICE 10.30-10.45 PAUSE 10.45-11.30 II L Multiple imputation 11.30-12.00 II P Boys data 12.00-13.15 PAUSE

Handling Missing Data in R with MICE > Time table Time table (afternoon) Time Session L/P Description 13.15-14.00 III L Generating plausible imputations 14.00-14.30 III P Algorithmic convergence and pooling 14.30-14.45 PAUSE 14.45-15.15 IV L Imputation in practice 15.15-15.45 IV P Post-processing and passive imputation 15.45-16.00 V L Guidelines for reporting

Handling Missing Data in R with MICE > I > SESSION I

Handling Missing Data in R with MICE > I > Problem of missing data Why are missing data interesting? Obviously the best way to treat missing data is not to have them. (Orchard and Woodbury 1972) Sooner or later (usually sooner), anyone who does statistical analysis runs into problems with missing data (Allison, 2002) Missing data problems are the heart of statistics

Handling Missing Data in R with MICE > I > Problem of missing data Causes of missing data Respondent skipped the item Data transmission/coding error Drop out in longitudinal research Refusal to cooperate Sample from population Question not asked, di erent forms Censoring

Handling Missing Data in R with MICE > I > Problem of missing data Consequences of missing data Less information than planned Enough statistical power? Di erent analyses, di erent n s Cannot calculate even the mean Systematic biases in the analysis Appropriate confidence interval, P-values? In general, missing data can severely complicate interpretation and analysis.

Handling Missing Data in R with MICE > I > Ad-hoc methods Listwise deletion Analyze only the complete records Also known as Complete Case Analysis (CCA) Advantages Simple (default in most software) Unbiased under MCAR Correct standard errors, significance levels Two special properties in regression

Handling Missing Data in R with MICE > I > Ad-hoc methods Listwise deletion Disadvantages Wasteful Large standard errors Biased under MAR, even for simple statistics like the mean Inconsistencies in reporting

Handling Missing Data in R with MICE > I > Ad-hoc methods Mean imputation Replace the missing values by the mean of the observed data Advantages Simple Unbiased for the mean, under MCAR

Handling Missing Data in R with MICE > I > Ad-hoc methods Mean imputation Frequency 0 10 20 30 40 50 Ozone (ppb) 0 50 100 150 0 50 100 150 0 50 150 250 Ozone (ppb) Solar Radiation (lang)

Handling Missing Data in R with MICE > I > Ad-hoc methods Mean imputation Disadvantages Disturbs the distribution Underestimates the variance Biases correlations to zero Biased under MAR AVOID (unless you know what you are doing)

Handling Missing Data in R with MICE > I > Ad-hoc methods Regression imputation Also known as prediction Fit model for Y obs under listwise deletion Predict Y mis for records with missing Y s Replace missing values by prediction Advantages Unbiased estimates of regression coe cients (under MAR) Good approximation to the (unknown) true data if explained variance is high Prediction is the favorite among non-statisticians

Handling Missing Data in R with MICE > I > Ad-hoc methods Regression imputation Frequency 0 5 10 15 20 25 Ozone (ppb) 0 50 100 150 0 50 100 150 0 50 150 250 Ozone (ppb) Solar Radiation (lang)

Handling Missing Data in R with MICE > I > Ad-hoc methods Regression imputation Disadvantages Artificially increases correlations Systematically underestimates the variance Too optimistic P-values and too short confidence intervals AVOID. Harmful to statistical inference.

Handling Missing Data in R with MICE > I > Ad-hoc methods Stochastic regression imputation Like regression imputation, but adds appropriate noise to the predictions to reflect uncertainty Advantages Preserves the distribution of Y obs Preserves the correlation between Y and X in the imputed data

Handling Missing Data in R with MICE > I > Ad-hoc methods Stochastic regression imputation Frequency 0 5 10 15 20 25 Ozone (ppb) 0 50 100 150 0 50 100 150 0 50 150 250 Ozone (ppb) Solar Radiation (lang)

Handling Missing Data in R with MICE > I > Ad-hoc methods Stochastic regression imputation Disadvantages Symmetric and constant error restrictive Single imputation does not take uncertainty imputed data into account, and incorrectly treats them as real Not so simple anymore

Handling Missing Data in R with MICE > I > Ad-hoc methods Single imputation methods, wrapup Underestimate uncertainty caused by the missing data Unbiased only under restrictive assumptions

Handling Missing Data in R with MICE > I > Ad-hoc methods Alternatives Maximum Likelihood, Direct Likelihood Weighting Multiple Imputation Little, R.J.A. Rubin D.B. (2002) Statistical Analysis with Missing Data. Second Edition. John Wiley Sons, New York.

Handling Missing Data in R with MICE > II > SESSION II

Handling Missing Data in R with MICE > II > What is multiple imputation Rising popularity of multiple imputation Number of publications (log) 200 100 50 20 10 5 early publications 'multiple imputation' in abstract 'multiple imputation' in title 2 1 1975 1980 1985 1990 1995 2000 2005 2010 Year

Handling Missing Data in R with MICE > II > What is multiple imputation Main steps used in multiple imputation - @ @ @ @ @R - @ @ @ @ @R - - - Incomplete data Imputed data Analysis results Pooled results

Handling Missing Data in R with MICE > II > What is multiple imputation Steps in mice incomplete data imputed data analysis results pooled results mice() with() pool() data frame mids mira mipo

Handling Missing Data in R with MICE > II > Goal Estimand Q is a quantity of scientific interest in the population. Q can be a vector of population means, population regression weights, population variances, and so on. Q may not depend on the particular sample, thus Q cannot be a standard error, sample mean, p-value, and so on.

Handling Missing Data in R with MICE > II > Goal Goal of multiple imputation Estimate Q by ˆQ or Q accompanied by a valid estimate of its uncertainty. What is the di erence between ˆQ or Q? ˆQ and Q both estimate Q ˆQ accounts for the sampling uncertainty Q accounts for the sampling and missing data uncertainty

Handling Missing Data in R with MICE > II > Multiple imputation theory Pooled estimate Q ˆQ` is the estimate of the `-th repeated imputation ˆQ` contains k parameters and is represented as a k 1 column vector The pooled estimate Q is simply the average Q = 1 m mx `=1 ˆQ` (1)

Handling Missing Data in R with MICE > II > Multiple imputation theory Within-imputation variance Average of the complete-data variances as Ū = 1 m mx Ū`, (2) where Ū` is the variance-covariance matrix of ˆQ` obtained for the `-th imputation `=1 Ū` is the variance is the estimate, not the variance in the data The within-imputation variance is large if the sample is small

Handling Missing Data in R with MICE > II > Multiple imputation theory Between-imputation variance Variance between the m complete-data estimates is given by B = 1 m 1 mx ( ˆQ` `=1 Q)( ˆQ` Q) 0, (3) where Q is the pooled estimate (c.f. equation 1) The between-imputation variance is large there many missing data

Handling Missing Data in R with MICE > II > Multiple imputation theory Total variance The total variance is not simply T = Ū + B The correct formula is T = Ū + B + B/m = Ū + 1+ 1 B (4) m for the total variance of Q, and hence of (Q The term B/m is the simulation error Q) if Q is unbiased

Handling Missing Data in R with MICE > II > Multiple imputation theory Three sources of variation In summary, the total variance T stems from three sources: 1 Ū, thevariancecausedbythefactthatwearetakingasample rather than the entire population. This is the conventional statistical measure of variability; 2 B, theextravariancecausedbythefactthattherearemissing values in the sample; 3 B/m, the extra simulation variance caused by the fact that Q itself is based on finite m.

Handling Missing Data in R with MICE > II > Multiple imputation theory Variance ratio s (1) Proportion of the variation attributable to the missing data = B + B/m, (5) T Relative increase in variance due to nonresponse r = B + B/m Ū These are related by r = /(1 ). (6)

Handling Missing Data in R with MICE > II > Multiple imputation theory Variance ratio s (2) Fraction of information about Q missing due to nonresponse = r +2/( + 3) 1+r This measure needs an estimate of the degrees of freedom. (7) Relation between and The literature often confuses and. = +1 +3 + 2 +3. (8)

Handling Missing Data in R with MICE > II > Statistical inference Statistical inference for Q (1) The 100(1 )% confidence interval of a Q is calculated as Q ± t (,1 /2) p T, (9) where t (,1 /2) is the quantile corresponding to probability 1 /2 of t. For example, use t(10, 0.975) = 2.23 for the 95% confidence interval for = 10.

Handling Missing Data in R with MICE > II > Statistical inference Statistical inference for Q (2) Suppose we test the null hypothesis Q = Q 0 for some specified value Q 0.Wecanfindthep-value of the test as the probability apple P s =Pr F 1, > (Q 0 where F 1, is an F distribution with 1 and degrees of freedom. T Q) 2 (10)

Handling Missing Data in R with MICE > II > Statistical inference Degrees of freedom (1) With missing data, n is e ectively lower. Thus, the degrees of freedom in statistical tests need to be adjusted. The old formula assumes n = 1: old = (m 1) 1+ 1 r 2 = m 1 2 (11)

Handling Missing Data in R with MICE > II > Statistical inference Degrees of freedom (2) The new formula is = old obs old + obs. (12) where the estimated observed-data degrees of freedom that accounts for the missing information is with com = n k. obs = com +1 com +3 com(1 ). (13)

Handling Missing Data in R with MICE > II > How many imputations? How large should m be? Classic advice: m =3, 5, 10. More recently: set m higher: 20 100. Some advice 1 Use m =5orm = 10 if the fraction of missing information is low, <0.2. 2 Develop your model with m = 5. Do final run with m equal to percentage of incomplete cases. 3 Repeat the analysis with m =5withdi erentseeds. Ifthereare large di erences for some parameters, this means that the data contain little information about them.

Handling Missing Data in R with MICE > II > How many imputations? The legacy

Handling Missing Data in R with MICE > II > How many imputations? Introductions to multiple imputation 1 Schafer, J.L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8(1), 3 15. 2 Sterne et al (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 338, b2393. 3 Van Buuren, S. (2012). Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL.

Handling Missing Data in R with MICE > III > SESSION III

Handling Missing Data in R with MICE > III > Creating imputations, univariate Relation between temperature and gas consumption Gas consumption (cubic feet) 2 3 4 5 6 7 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate We delete gas consumption of observation 47 Gas consumption (cubic feet) 2 3 4 5 6 7 a deleted observation 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Predict imputed value from regression line Gas consumption (cubic feet) 2 3 4 5 6 7 b 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Predicted value + noise Gas consumption (cubic feet) 2 3 4 5 6 7 c 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Predicted value + noise + parameter uncertainty Gas consumption (cubic feet) 2 3 4 5 6 7 d 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Imputation based on two predictors Gas consumption (cubic feet) 2 3 4 5 6 7 before insulation after insulation e 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Predictive mean matching: Y given X Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Add two regression lines Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Predicted given 5 C, after insulation Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Define a matching range ŷ ± Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Select potential donors Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Bayesian PMM: Draw a line Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Define a matching range ŷ ± Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Select potential donors Gas consumption (cubic feet) 1 2 3 4 5 6 7 8 before insulation after insulation 2 0 2 4 6 8 10 Temperature ( C)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Imputation of a binary variable logistic regression Pr(y i =1 X i, )= exp(x i ) 1+exp(X i ). (14)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Fit logistic model Probability 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 Linear predictor

Handling Missing Data in R with MICE > III > Creating imputations, univariate Draw parameter estimate Probability 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 Linear predictor

Handling Missing Data in R with MICE > III > Creating imputations, univariate Read o the probability Probability 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 2 1 0 1 2 3 Linear predictor

Handling Missing Data in R with MICE > III > Creating imputations, univariate Impute ordered categorical variable K ordered categories k =1,...,K ordered logit model, or proportional odds model Pr(y i = k X i, )= exp( k + X i ) P K k=1 exp( k + X i ) (15)

Handling Missing Data in R with MICE > III > Creating imputations, univariate Fit ordered logit model Probability 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 Linear predictor

Handling Missing Data in R with MICE > III > Creating imputations, univariate Read o the probability Probability 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 3 2 1 0 1 2 3 Linear predictor

Handling Missing Data in R with MICE > III > Creating imputations, univariate Other types of variables Count data Semi-continuous data Censored data Truncated data Rounded data

Handling Missing Data in R with MICE > III > Creating imputations, univariate Univariate imputation in mice Method Description Scale type pmm Predictive mean matching numeric norm Bayesian linear regression numeric norm.nob Linear regression, non-bayesian numeric norm.boot Linear regression with bootstrap numeric mean Unconditional mean imputation numeric 2L.norm Two-level linear model numeric logreg Logistic regression factor, 2 levels logreg.boot Logistic regression with bootstrap factor, 2 levels polyreg Multinomial logit model factor, > 2levels polr Ordered logit model ordered, > 2levels lda Linear discriminant analysis factor sample Simple random sample any

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Problems in multivariate imputation Predictors themselves can be incomplete Mixed measurement levels Order of imputation can be meaningful Too many predictor variables Relations could be nonlinear Higher order interactions Impossible combinations

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Three general strategies Monotone data imputation Joint modeling Fully conditional specification (FCS)

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Imputation of monotone pattern X 1 Y 1 Y 2 Y 3 Y 4

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Imputation of monotone pattern X 1 Y 1 Y 2 Y 3 Y 4

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Imputation of monotone pattern X 1 Y 1 Y 2 Y 3 Y 4

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Joint Modeling (JM) 1 Specify joint model P(Y, X, R) 2 Derive P(Y mis Y obs, X, R) 3 Use MCMC techniques to draw imputations Y mis

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Joint modeling: Software R/S Plus norm, cat, mix, pan, Amelia SAS proc MI, proc MIANALYZE STATA MI command Stand-alone Amelia, solas, norm, pan

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Joint Modeling: Pro s Yield correct statistical inference under the assumed JM E cient parametrization (if the model fits) Known theoretical properties Works very well for parameters close to the center Many applications

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Joint Modeling: Con s Lack of flexibility May lead to large models Can assume more than the complete data problem Can impute impossible data

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Fully Conditional Specification (FCS) 1 Specify P(Y mis Y obs, X, R) 2 Use MCMC techniques to draw imputations Y mis

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Multivariate Imputation by Chained Equations (MICE) MICE algorithm Specify imputation model for each incomplete column Fill in starting imputations And iterate Model: Fully Conditional Specification (FCS)

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Fully Conditional Specification: Con s Theoretical properties only known in special cases Cannot use computational shortcuts, like sweep-operator Joint distribution may not exist (incompatibility)

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Fully Conditional Specification: Pro s Easy and flexible Imputes close to the data, prevents impossible data Subset selection of predictors Modular, can preserve valuable work Works well, both in simulations and practice

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Fully Conditional Specification (FCS): Software R mice, transcan, mi, VIM, baboon SPSS V17 procedure multiple imputation SAS IVEware, SAS 9.3 STATA ice command, multiple imputation command Stand-alone Solas, Mplus

Handling Missing Data in R with MICE > III > Creating imputations, multivariate How many iterations? Quick convergence 5 10 iterations is adequate for most problems More iterations is is high inspect the generated imputations Monitor convergence to detect anomalies

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Non-convergence mean hgt 80 85 90 95100 110 28 30 32 34 36 38 mean wgt 37 38 39 40 26 27 28 29 mean bmi sd sd sd hgt wgt bmi 50 100 150 0 50 100 150 5 10 15 20 Iteration 5 10 15 20

Handling Missing Data in R with MICE > III > Creating imputations, multivariate Convergence mean hgt sd hgt 92 94 96 24 26 28 30 mean wgt sd wgt 36.0 37.0 38.0 mean bmi 25.0 26.0 27.0 sd bmi 16 17 18 19 20 2 3 4 5 5 10 15 20 Iteration 5 10 15 20

Handling Missing Data in R with MICE > IV > SESSION IV

Handling Missing Data in R with MICE > IV > Modeling choices Imputation model choices 1 MAR or MNAR 2 Form of the imputation model 3 Which predictors 4 Derived variables 5 What is m? 6 Order of imputation 7 Diagnostics, convergence

Handling Missing Data in R with MICE > IV > Which predictors Which predictors? 1 Include all variables that appear in the complete-data model 2 In addition, include the variables that are related to the nonresponse 3 In addition, include variables that explain a considerable amount of variance 4 Remove from the variables selected in steps 2 and 3 those variables that have too many missing values within the subgroup of incomplete cases. Function quickpred() and flux()

Handling Missing Data in R with MICE > IV > Derived variables Derived variables ratio of two variables sum score index variable quadratic relations interaction term conditional imputation compositions

Handling Missing Data in R with MICE > IV > Derived variables How to impute a ratio? weight/height ratio: whr=wgt/hgt kg/m. Easy if only one of wgt or hgt or whr is missing Methods POST: Impute wgt and hgt, and calculate whr after imputation JAV: Impute whr as just another variable PASSIVE1: Impute wgt and hgt, and calculate whr during imputation PASSIVE2: As PASSIVE1 with adapted predictor matrix

Handling Missing Data in R with MICE > IV > Derived variables Method POST > imp1 <- mice(boys) > long <- complete(imp1, "long", inc = TRUE) > long$whr <- with(long, wgt/(hgt/100)) > imp2 <- long2mids(long)

Handling Missing Data in R with MICE > IV > Derived variables Method JAV: Just another variable > boys$whr <- boys$wgt/(boys$hgt/100) > imp.jav <- mice(boys, m = 1, seed = 32093, maxit = 10)

Handling Missing Data in R with MICE > IV > Derived variables Method JAV 50 100 150 200 JAV passive passive 2 60 Weight/Height (kg/m) 50 40 30 20 10 50 100 150 200 Height (cm) 50 100 150 200

Handling Missing Data in R with MICE > IV > Derived variables Method PASSIVE > meth["whr"] <- "~I(wgt/(hgt/100))"

Handling Missing Data in R with MICE > IV > Derived variables Method PASSIVE, predictor matrix age hgt wgt bmi hc gen phb tv reg whr age 0 0 0 0 0 0 0 0 0 0 hgt 1 0 1 0 1 1 1 1 1 0 wgt 1 1 0 0 1 1 1 1 1 0 bmi 1 1 1 0 1 1 1 1 1 0 hc 1 1 1 1 0 1 1 1 1 1 gen 1 1 1 1 1 0 1 1 1 1 phb 1 1 1 1 1 1 0 1 1 1 tv 1 1 1 1 1 1 1 0 1 1 reg 1 1 1 1 1 1 1 1 0 1 whr 1 1 1 0 1 1 1 1 1 0

Handling Missing Data in R with MICE > IV > Derived variables Method PASSIVE 50 100 150 200 JAV passive passive 2 60 Weight/Height (kg/m) 50 40 30 20 10 50 100 150 200 Height (cm) 50 100 150 200

Handling Missing Data in R with MICE > IV > Derived variables Method PASSIVE2 > pred[c("wgt", "hgt", "hc", "reg"), "bmi"] <- 0 > pred[c("gen", "phb", "tv"), c("hgt", "wgt", "hc")] <- 0 > pred[, "whr"] <- 0

Handling Missing Data in R with MICE > IV > Derived variables Method PASSIVE2, predictor matrix age hgt wgt bmi hc gen phb tv reg whr age 0 0 0 0 0 0 0 0 0 0 hgt 1 0 1 0 1 1 1 1 1 0 wgt 1 1 0 0 1 1 1 1 1 0 bmi 1 1 1 0 1 1 1 1 1 0 hc 1 1 1 0 0 1 1 1 1 0 gen 1 0 0 1 0 0 1 1 1 0 phb 1 0 0 1 0 1 0 1 1 0 tv 1 0 0 1 0 1 1 0 1 0 reg 1 1 1 0 1 1 1 1 0 0 whr 1 1 1 1 1 1 1 1 1 0

Handling Missing Data in R with MICE > IV > Derived variables Method PASSIVE2 50 100 150 200 JAV passive passive 2 60 Weight/Height (kg/m) 50 40 30 20 10 50 100 150 200 Height (cm) 50 100 150 200

Handling Missing Data in R with MICE > IV > Derived variables Derived variables: summary Derived variables pose special challenges Plausible values respect data dependencies If you can, create derived variables after imputation If you cannot, use passive imputation Break up direct feedback loops using the predictor matrix

Handling Missing Data in R with MICE > IV > Diagnostics Standard diagnostic plots in mice Since mice 2.5, plots for imputed data: one-dimensional scatter: stripplot box-and-whisker plot: bwplot densities: densityplot scattergram: xyplot

Handling Missing Data in R with MICE > IV > Diagnostics Stripplot > library(mice) > imp <- mice(nhanes, seed = 29981) > stripplot(imp, pch = c(1, 19))

Handling Missing Data in R with MICE > IV > Diagnostics stripplot(imp, pch=c(1,19)) age bmi 1.0 1.5 2.0 2.5 3.0 1.0 1.2 1.4 1.6 1.8 2.0 0 1 2 3 4 5 hyp 20 25 30 35 150 200 250 0 1 2 3 4 5 chl 0 1 2 3 4 5 Imputation number 0 1 2 3 4 5

Handling Missing Data in R with MICE > IV > Diagnostics Alargerdataset > imp <- mice(boys, seed = 24331, maxit = 1) > bwplot(imp)

Handling Missing Data in R with MICE > IV > Diagnostics bwplot(imp) 0 5 10 15 20 age 50 100 150 200 hgt 0 20 40 60 80 100 120 wgt 0 1 2 3 4 5 bmi 0 1 2 3 4 5 hc 0 1 2 3 4 5 tv 15 20 25 30 40 50 60 0 5 10 15 20 25 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Imputation number

Handling Missing Data in R with MICE > IV > Diagnostics densityplot(imp) 0.00 0.01 0.02 0.03 0.04 hgt 0.00 0.01 0.02 0.03 wgt 0.00 0.05 0.10 0.15 0.20 0.25 bmi Density 0.00 0.05 0.10 0.15 50 100 150 200 hc 0.00 0.05 0.10 0 50 100 tv 10 15 20 25 30 30 40 50 60 70 0 10 20 30

Handling Missing Data in R with MICE > V > SESSION V

Handling Missing Data in R with MICE > V > Reporting guidelines Reporting guidelines 1 Amount of missing data 2 Reasons for missingness 3 Di erences between complete and incomplete data 4 Method used to account for missing data 5 Software 6 Number of imputed datasets 7 Imputation model 8 Derived variables 9 Diagnostics 10 Pooling 11 Listwise deletion 12 Sensitivity analysis