Basics of Modern Missing Data Analysis

Similar documents
Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Terrence D. Jorgensen*, Alexander M. Schoemann, Brent McPherson, Mijke Rhemtulla, Wei Wu, & Todd D. Little

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Don t be Fancy. Impute Your Dependent Variables!

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys. Tom Rosenström University of Helsinki May 14, 2014

MISSING or INCOMPLETE DATA

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Biostat 2065 Analysis of Incomplete Data

MISSING or INCOMPLETE DATA

Kyle M. Lang. Date defended: February 13, Chairperson Todd D. Little. Wei Wu. Paul E. Johnson

Short course on Missing Data

Fractional Imputation in Survey Sampling: A Comparative Review

Some methods for handling missing values in outcome variables. Roderick J. Little

Time-Invariant Predictors in Longitudinal Models

A note on multiple imputation for general purpose estimation

Implications of Missing Data Imputation for Agricultural Household Surveys: An Application to Technology Adoption

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Time Invariant Predictors in Longitudinal Models

Plausible Values for Latent Variables Using Mplus

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

EM Algorithm II. September 11, 2018

6 Pattern Mixture Models

STA 4273H: Statistical Machine Learning

Inferences on missing information under multiple imputation and two-stage multiple imputation

ASA Section on Survey Research Methods

Statistical Distribution Assumptions of General Linear Models

Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates

Probability and Information Theory. Sargur N. Srihari

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Multiple Imputation for Complex Data Sets

A Re-Introduction to General Linear Models (GLM)

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

PIRLS 2016 Achievement Scaling Methodology 1

Planned Missingness Designs and the American Community Survey (ACS)

analysis of incomplete data in statistical surveys

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

6. Fractional Imputation in Survey Sampling

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

2 Naïve Methods. 2.1 Complete or available case analysis

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Centering Predictor and Mediator Variables in Multilevel and Time-Series Models

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

Discussion of Identifiability and Estimation of Causal Effects in Randomized. Trials with Noncompliance and Completely Non-ignorable Missing Data

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

Bayesian Multilevel Latent Class Models for the Multiple. Imputation of Nested Categorical Data

Planned Missing Data in Mediation Analysis. Amanda N. Baraldi

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Three-Level Multiple Imputation: A Fully Conditional Specification Approach. Brian Tinnell Keller

Bayesian inference for factor scores

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

Measurement Error and Linear Regression of Astronomical Data. Brandon Kelly Penn State Summer School in Astrostatistics, June 2007

A Comparative Study of Imputation Methods for Estimation of Missing Values of Per Capita Expenditure in Central Java

Likelihood-based inference with missing data under missing-at-random

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Special Topic: Bayesian Finite Population Survey Sampling

Comparison for alternative imputation methods for ordinal data

Bayesian Networks. 10. Parameter Learning / Missing Values

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

A weighted simulation-based estimator for incomplete longitudinal data models

Pooling multiple imputations when the sample happens to be the population.

Health utilities' affect you are reported alongside underestimates of uncertainty

Introduction to Within-Person Analysis and RM ANOVA

Introduction An approximated EM algorithm Simulation studies Discussion

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Statistical Estimation

Bayesian Analysis of Latent Variable Models using Mplus

Bayesian Methods for Machine Learning

Time-Invariant Predictors in Longitudinal Models

BAYESIAN LATENT CLASS MODELS FOR THE MULTIPLE IMPUTATION OF CROSS-SECTIONAL, MULTILEVEL AND LONGITUDINAL CATEGORICAL DATA

Bayesian Networks in Educational Assessment

A multivariate multilevel model for the analysis of TIMMS & PIRLS data

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

Comparison of methods for repeated measures binary data with missing values. Farhood Mohammadi. A thesis submitted in partial fulfillment of the

A Re-Introduction to General Linear Models

Known unknowns : using multiple imputation to fill in the blanks for missing data

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Bayes: All uncertainty is described using probability.

Comment on Tests of Certain Types of Ignorable Nonresponse in Surveys Subject to Item Nonresponse or Attrition

Miscellanea A note on multiple imputation under complex sampling

Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

STAT 425: Introduction to Bayesian Analysis

Estimating the parameters of hidden binomial trials by the EM algorithm

Linear Dynamical Systems

Multilevel Statistical Models: 3 rd edition, 2003 Contents

The sbgcop Package. March 9, 2007

Parametric fractional imputation for missing data analysis

An Introduction to Path Analysis

Statistical methods for handling incomplete longitudinal data with emphasis on discrete outcomes with application

Transcription:

Basics of Modern Missing Data Analysis Kyle M. Lang Center for Research Methods and Data Analysis University of Kansas March 8, 2013

Topics to be Covered An introduction to the missing data problem Missing Data Mechanisms Ad hoc missing data techniques Modern missing data analysis EM MI FIML Syntax examples of MI and FIML

A Comment on Notation Y=(Y obs,y miss ) Partitioned Complete Data P(X) Probability of X P(X Y) Probability of X given Y (conditioned on Y) P(X Y,Z) Probability of X given Y and Z (conditioned on Y and Z) l - Loglikelihood

What is Missing Data? Unobserved data with true population values Nonresponse on sample surveys Attrition in longitudinal studies Un-administered materials in planned missingness designs Data lost unintentionally Note: Not all unobserved data are truly missing

Missing Data Patterns Three primary patterns of missing 1) Univariate 2) Monotone 3) Arbitrary

Univariate Pattern of Missing omissingness occurs on only a single variable. X Y Z 2 5 8 3 4 3 4 5 7 5 3 5 3 4. 6 2. 4 3. 6 1. 2 6.

Monotone Pattern of Missing opm t+1 PM t omost commonly due to attrition overy common in longitudinal data T1 T2 T3 T4 1 3 4 6 2 3 5 2 3 6 3 3 2 5 6. 3 3 4. 2 6.. 6 3.. 6... 7...

Arbitrary Pattern of Missing othere is no apparent pattern to the missing values. oprobably the most common manifestation of missing data. X Y Z 1 5 2. 5 5 6. 6.. 7 3 4. 5. 7 7 6. 4 3 5. 4 3

The Causes of Missingness Missing data are missing for a reason The Missing Data Mechanism Missing Completely at Random (MCAR) Missing at Random (MAR) Missing Not at Random (MNAR)

Missing Completely at Random MCAR missing can be viewed as a random sample of the complete data. There are no meaningful correlates of the missing data, either measured or unmeasured. A survey is administered to an undergraduate biology class Some item responses are not collected because a page of the measure was accidentally excluded from the survey packet, for a few of the students. Almost entirely recoverable

Missing at Random The missingness is correlated with a measured variable other than that which contains the missingness. The missingness can have a systematic cause, but the cause has been captured in the dataset. Highly religious subjects fail to respond to items assessing attitudes towards sex. If you have a measure of religiosity in the data, this is MAR missing. Very well recovered with appropriate techniques

Missing Not at Random The missingness is correlated with the variable containing the missingness or an unmeasured variable. Subjects with very high or very low household income fail to respond to an SES item Participant level on an item is dictating their response rates If there is no measure of religiosity in the previous example, that missingness is MNAR Very poorly recovered even with appropriate techniques

Ignorability Almost all missing data techniques require the missingness mechanism be ignorable Ignorability To safely ignore the missing data mechanism, Rubin (1976) showed that 1) A MAR mechanism must give rise to the loss of information 2) The parameters of substantive interest (θ) and those which contribute to the missingness (ξ) must be distinct

Ignorability II Specifically, for the first tenet to hold: P( R Y, Y, ) P( R Y, ), obs miss obs The probability of missingness conditioned on both Y obs and Y miss must be no more informative than that conditioned on only Y obs

Ignorability III For the second tenet to hold: Frequentist: The joint parameter space (θ,ξ) must be the Cartesian cross-product of the individual parameter spaces θ and ξ. Bayesian: Any possible joint prior applied to (θ,ξ) must factor into unique marginal distributions of θ and ξ.

Ad Hoc Techniques Listwise & Casewise Deletion Mean Substitution Regression Imputation / Conditional Mean Substitution

Listwise & Casewise Deletion Both entail deleting unobserved information Listwise deletion removes all cases with at least one unobserved value Casewise deletion removes pairs of values from the calculation of sufficient statistics Both can entail a substantial loss of sample size 100 X 100 Matrix, 20 missing values.2% of data missing 20% of data deleted In the case of a nonrandom missingness mechanism, the truncated dataset will cease to be generalizable to the larger population.

Mean Substitution Every missing value is replaced with the mean of its scale This can lead to a substantial loss of variability Causes regression towards the mean Can attenuate the true population relationships between variables

Regression Imputation The variable with missingness is regressed on all other variables, and the predicted values are used as replacements for the missing values Leads to artificially precise standard errors Amplifies the conformity of the missing value with the linear trend This can artificially inflate the correlations between variables

General Problem with Single Imputation Techniques Single imputation techniques assume that the imputed value is an unbiased estimate of the true population value Leads to artificially precise standard errors Biases Wald tests and Confidence Intervals Coverage under 95% for an alpha of.05 Increased Type I error rate

Example A company administers IQ tests when hiring IQs < 99 are not hired, so no job performance scores can be collected What is the relationships between job performance and IQ?

Example IQ = 99 Complete data Missing data Listwise deletion only analyzes the observed data

Example Mean Imputation No error Unrealistic values to replace true values Changes regression line Regression Imputation Doesn t change line, but assumes it wouldn t Still no error, false certainty that this is the right line

Example Complete Data Stochastic Regression Imputation (SRI) Adds error term to regression line More closely resembles original complete data But different random error terms would change the outcome Multiple Imputation (MI) averages different SRIs

Modern Missing Data Analysis Expectation Maximization Dempster, Laird, & Rubin (1977) Multiple Imputation Rubin (1976, 1987) Full Information Maximum Likelihood Anderson (1957)

Expectation Maximization (EM) Maximum likelihood approach to estimating a complete covariance matrix in the presence of missing data Calculates the expected values of sufficient statistics (E- Step) Maximizes the conditional loglikelihood derived from the above sufficient statistics (M-Step) Converges to a stationary covariance matrix when the change in the loglikelihood between two subsequent M- Steps falls below a specified convergence criterion

Factored complete data PDF Assuming the ignorability of the missing data mechanism: P( Y ) P( Y ) P( Y Y, ) obs miss obs EM replaces Y miss with the expected values of Y miss Y obs,θ

Expectation (E) Step Separate regression equations are calculated for every pattern of missingness The predicted values from these equations are used to fill in the missing values This complete data frame is used to calculate updated estimates of the sufficient statistics (i.e., sums of squares and sums of cross products)

Maximization (M) Step The sufficient statistics produced by the E-Step are used to compute the complete-data loglikelihood This loglikelihood is maximized as a complete data problem to reach new estimates of μ and Σ

EM Continued The E-Step is calculating: Q l Y P Y Y dy ( t) ( t) ( ) ( ) ( miss obs, ) miss The M-Step maximizes the above such that: Q ( 1) ( ) ( ) ( t t ) Q( t ), for all θ

Multiple Imputation Bayesian simulation approach to imputation Developed to address many of the shortcomings of single imputation and early ML techniques Accounts for our uncertainty in the true population values of Y miss Does not lead to downwardly biased standard errors

MI Continued Through an iterative simulation technique m>1 plausible values of Y miss are estimated Creates m complete datasets The variability in the values of Y miss between every imputation (i.e., between imputation variance) quantifies our uncertainty in the true population values of Y miss m analysis models are run, and the resulting estimates are combined for statistical inference

Data Augmentation Data Augmentation (Tanner & Wong, 1987) is a two step iterative procedure to create multiple imputations I-Step uses a series of stochastic regression equations to fill in the missing values P-Step pulls random draws from probability distributions derived from the complete data from the I-Step to introduce variability into the estimates of the parameters

Priors for MI Prior distributions must be specified to inform initial estimates of θ Informative Priors Stationary EM covariance matrix Sufficient statistics from a single regression imputation Uninformative Priors μ is drawn from a standard normal distribution Σ is drawn from a Wishart distribution

Imputation (I) Step Initial estimates of μ and Σ are used to calculate separate stochastic regression equations for each pattern of missingness The predicted values from the above equations are used to replace Y miss

Posterior (P) Step The complete data frame from the I-Step is used to parameterize a normal and χ 2 distribution New estimates of μ and Σ are estimated based on random draws from the above distributions

P-Step Continued New estimates of σ 2 are drawn according to the following relationship: s 2* Y s 2 Y ~ 2 New estimates of μ are drawn as follows: * 2* Y Y Y y ~ N( y, s )

Markov Chains The estimates from each subsequent P-Step create a Markov Chain θ t+1 is dependent on θ t After a sufficient number of iterations the elements in the chain become independent θ t+p is independent of θ t

Independent Draws To achieve valid imputations, a number of P-Steps must be discarded (burn-in iterations) before the first draw Insures that initial imputed estimates are independent of the prior Between every imputation after the first, many more P- Steps must be discarded Insures subsequent imputed values are independent m t is independent of m t+1 A common recommendation for the number of P-Steps to discard is twice the number of iterations it takes the EM algorithm to converge

Combining Parameters in MI All m sets of parameters must be combined into a single vector of statistics in order to reach an inferential conclusion Rubin (1987) developed simple formulae to combine all m point estimates and variances into single statistics which can be used for Wald tests or CIs Rubin s Rules

Rubin's Rules for Point Estimates The aggregate point estimate is simple the mean of all m statistics: m 1 ( j) Q m Qˆ j 1

Rubin s Rules for Variance Components The variability around Q Within Imputation Variance: m 1 ( j) U m U j 1 Between Imputation Variance has two components: m 1 ( j) 2 B ( m 1) [ Qˆ Q] j 1

Total Variance of MI Estimates The total variance of the MI estimates is a weighted sum of the between and within imputation variances: 1 T U (1 m ) B As m becomes larger the influence of B is weighted downward and T becomes smaller

Assumptions of MI Three assumptions of MI The probability distribution giving rise to the data is of a specified form Usually the normal distribution The missingness mechanism is ignorable The prior distribution used to inform the initial estimates of θ is appropriate Note: Assumptions 1 & 2 must also be made for ML techniques

Full Information Maximum Likelihood (FIML) FIML is a maximum likelihood estimator which is robust to the presence of missing data Calculates an individual loglikelihood for each case in the data frame Allows separate estimation of parameters for every pattern of missingness without the need to address the true population values of Y miss

Multivariate Normal PDF The multivariate normal PDF from which FIML derives the joint loglikelihood it maximizes is characterized by the following equation: f ( y) (2 ) 1 p/2 1/2 e 1 ( y )' ( y ) 2

Casewise Loglikelihood For each case in the data, the following loglikelihood is computed: p 1 1 2 2 2 1 i log(2 ) log ( yi )' ( yi ) Which becomes the following in the presence of missing data: p 1 1 2 2 2 i 1 i log(2 ) log i ( yi i )' i ( yi i ).

Joint Loglikelihood Each of the previous casewise loglikelihoods is summed to get: pi 1 1 2 2 2 1 log(2 ) log i ( yi i )' i ( yi i ). FIML then maximizes the above equation to reach estimates of μ and Σ

FIML Continued Because an individual loglikelihood is calculated for every case, the final joint loglikelihood is based only on Y obs Allows the estimation of parameters based on only observed information No need to address the true population values of Y miss

Example: Compare MI to FIML The MCAR simulation randomly removed Job Performance data (unrelated to anything) MAR simulation deleted JP data for low IQ (reason for missingness included in model) MNAR deleted JP data for low JP scores (reason for JP missingness not recoverable because it is due to JP itself)

Example: Compare FIML to LD The same simulation, but comparing FIML (which yield the same results as MI) to Listwise Deletion (LD) Whereas FIML/MI yield bias for parameters affected by missingness (JP), LD yields bias for all estimates LD only unbiased for MCAR (most missing data are not MCAR)

Inclusive vs. Exclusive Strategies Both FIML and MI can be implemented with either an inclusive or an exclusive approach. Inclusive: Including covariates that are correlated with the causes of missingness Exclusive: Running either the imputation or analysis models with only the variables of substantive interest included.

Inclusive Approach in MI An inclusive approach in MI entails including fully observed covariates in the imputation model (auxiliary variables). Increases the predictive power of the regressions used in the I- step Improves the precision and efficiency of the imputed values

Inclusive Approach in FIML Similar to MI, covariates are added to the model to add information on the true value of the missing information. Auxiliary variables must be included in the analysis model The Saturated Correlates approach (Graham, 2003) is the most widely recommended

Saturated Correlates Technique Three rules to implement in latent variable models: 1) Correlate auxiliary variables with manifest predictors 2) Correlate auxiliary variables with the residual terms of manifest indicators 3) Correlate auxiliary variables with other auxiliary variables

Problem with Calculating Incremental Fit Indices Incremental fit indices require the comparison of the fitted model to a null model Null model is usually specified as a model with no relationship between the indicators Under the saturated correlates approach this leaves out the auxiliary information from the null model Upwardly biases incremental fit indices

Take-Home Message Missing data cannot simply be ignored Ad hoc techniques lead to substantial bias in all but the most restrictive of cases Modern missing data techniques recover information very well provided conformity to a few loose assumptions All of the techniques described today are very easily implemented in nearly all common statistical packages More precisely and efficiently recovered information leads to more valid estimates of parameters, higher statistical power, more valid statistical inferences and better science

Special Thanks Terrence Jorgensen Wei Wu Todd Little Kris Preacher Sunthud Pornprasertmanit Alex Schoemann Mijke Rhemtulla Rawni Anderson Waylon Howard John Geldhof The CRMDA Missing Data Workgroup