Statistical Methods for Handling Missing Data

Similar documents
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Chapter 4: Imputation

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Recent Advances in the analysis of missing data with non-ignorable missingness

Fractional Imputation in Survey Sampling: A Comparative Review

Parametric fractional imputation for missing data analysis

A note on multiple imputation for general purpose estimation

6. Fractional Imputation in Survey Sampling

Likelihood-based inference with missing data under missing-at-random

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

EM Algorithm II. September 11, 2018

Nonresponse weighting adjustment using estimated response probability

Introduction An approximated EM algorithm Simulation studies Discussion

Robustness to Parametric Assumptions in Missing Data Models

Miscellanea A note on multiple imputation under complex sampling

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

analysis of incomplete data in statistical surveys

Biostat 2065 Analysis of Incomplete Data

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Two-phase sampling approach to fractional hot deck imputation

Comparison between conditional and marginal maximum likelihood for a class of item response models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

On the bias of the multiple-imputation variance estimator in survey sampling

Calibration Estimation for Semiparametric Copula Models under Missing Data

A weighted simulation-based estimator for incomplete longitudinal data models

Missing Data: Theory & Methods. Introduction

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

simple if it completely specifies the density of x

AN INSTRUMENTAL VARIABLE APPROACH FOR IDENTIFICATION AND ESTIMATION WITH NONIGNORABLE NONRESPONSE

Outline of GLMs. Definitions

Review and continuation from last week Properties of MLEs

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Propensity score adjusted method for missing data

Basics of Modern Missing Data Analysis

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

arxiv:math/ v1 [math.st] 23 Jun 2004

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Fractional imputation method of handling missing data and spatial statistics

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Estimation Methods for Time Series models Lecture 2

Data Integration for Big Data Analysis for finite population inference

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

MCMC algorithms for fitting Bayesian models

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Stat 5101 Lecture Notes

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

A measurement error model approach to small area estimation

Introduction to Survey Data Integration

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Bootstrap inference for the finite population total under complex sampling designs

Combining multiple observational data sources to estimate causal eects

Master s Written Examination

Bayesian Methods for Machine Learning

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

Combining data from two independent surveys: model-assisted approach

Estimation, Inference, and Hypothesis Testing

Generalized Linear Models. Kurt Hornik

Bayesian Modeling of Conditional Distributions

Covariate Balancing Propensity Score for General Treatment Regimes

Statistical Estimation

Lecture 5: LDA and Logistic Regression

The propensity score with continuous treatments

A Very Brief Summary of Statistical Inference, and Examples

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Weighting Methods. Harvard University STAT186/GOV2002 CAUSAL INFERENCE. Fall Kosuke Imai

Graybill Conference Poster Session Introductions

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Inferences on missing information under multiple imputation and two-stage multiple imputation

Estimation of the Conditional Variance in Paired Experiments

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Default Priors and Effcient Posterior Computation in Bayesian

Likelihood inference in the presence of nuisance parameters

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Computer Intensive Methods in Mathematical Statistics

Plausible Values for Latent Variables Using Mplus

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Stat 451 Lecture Notes Monte Carlo Integration

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Some methods for handling missing values in outcome variables. Roderick J. Little

Reconstruction of individual patient data for meta analysis via Bayesian approach

1. Fisher Information

STAT 730 Chapter 4: Estimation

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Applied Asymptotics Case studies in higher order inference

Central Limit Theorem ( 5.3)

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Bayesian inference for multivariate extreme value distributions

ECE531 Lecture 10b: Maximum Likelihood Estimation

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Transcription:

Statistical Methods for Handling Missing Data Jae-Kwang Kim Department of Statistics, Iowa State University July 5th, 2014

Outline Textbook : Statistical Methods for handling incomplete data by Kim and Shao (2013) Part 1: Basic Theory (Chapter 2-3) Part 2: Imputation (Chapter 4) Part 3: Propensity score approach (Chapter 5) Part 4: Nonignorable missing (Chapter 6) Jae-Kwang Kim (ISU) July 5th, 2014 2 / 181

Statistical Methods for Handling Missing Data Part 1: Basic Theory Jae-Kwang Kim Department of Statistics, Iowa State University

1 Introduction Definitions for likelihood theory The likelihood function of θ is defined as where f (y; θ) is the joint pdf of y. L(θ) = f (y; θ) Let ˆθ be the maximum likelihood estimator (MLE) of θ 0 if it satisfies L(ˆθ) = max θ Θ L(θ). A parametric family of densities, P = {f (y; θ); θ Θ}, is identifiable if for all y, f (y; θ 1) f (y; θ 2) for every θ 1 θ 2. Jae-Kwang Kim (ISU) Part 1 4 / 181

1 Introduction - Fisher information Definition 1 Score function: S(θ) = log L(θ) θ 2 Fisher information = curvature of the log-likelihood: I(θ) = 2 log L(θ) = θ θt θ S(θ) T 3 Observed (Fisher) information: I(ˆθ n) where ˆθ n is the MLE. 4 Expected (Fisher) information: I(θ) = E θ {I(θ)} The observed information is always positive. The observed information applies to a single dataset. The expected information is meaningful as a function of θ across the admissible values of θ. The expected information is an average quantity over all possible datasets. I(ˆθ) = I(ˆθ) for exponential family. Jae-Kwang Kim (ISU) Part 1 5 / 181

1 Introduction - Fisher information Lemma 1. Properties of score functions [Theorem 2.3 of KS] Under regularity conditions allowing the exchange of the order of integration and differentiation, E θ {S(θ)} = 0 and V θ {S(θ)} = I(θ). Jae-Kwang Kim (ISU) Part 1 6 / 181

1 Introduction - Fisher information Remark Under some regularity conditions, the MLE ˆθ converges in probability to the true parameter θ 0. Thus, we can apply a Taylor linearization on S(ˆθ) = 0 to get ˆθ θ 0 = {I(θ0)} 1 S(θ 0). Here, we use the fact that I (θ) = S(θ)/ θ T converges in probability to I(θ). Thus, the (asymptotic) variance of MLE is V (ˆθ). = {I(θ 0)} 1 V {S(θ 0)} {I(θ 0)} 1 = {I(θ 0)} 1, where the last equality follows from Lemma 1. Jae-Kwang Kim (ISU) Part 1 7 / 181

2 Observed Likelihood Basic Setup Let y = (y 1,..., y p) be a p-dimensional random vector with probability density function f (y; θ) whose dominating measure is µ. { 1 if yij is observed Let δ ij be the response indicator function of y ij with δ ij = 0 otherwise. δ i = (δ i1,, δ ip ): p-dimensional random vector with density P(δ y) assuming P(δ y) = P(δ y; φ) for some φ. Let (y i,obs, y i,mis ) be the observed part and missing part of y i, respectively. Let R(y obs, δ) = {y; y obs (y i, δ i ) = y i,obs, i = 1,..., n} be the set of all possible values of y with the same realized value of y obs, for given δ, where y obs (y i, δ i ) is a function that gives the value of y ij for δ ij = 1. Jae-Kwang Kim (ISU) Part 1 8 / 181

2 Observed Likelihood Definition: Observed likelihood Under the above setup, the observed likelihood of (θ, φ) is L obs (θ, φ) = f (y; θ)p(δ y; φ)dµ(y). R(y obs,δ) Under IID setup: The observed likelihood is n [ ] L obs (θ, φ) = f (y i ; θ)p(δ i y i ; φ)dµ(y i,mis ), where it is understood that, if y i = y i,obs and y i,mis is empty then there is nothing to integrate out. In the special case of scalar y, the observed likelihood is L obs (θ, φ) = [f (y i ; θ)π(y i ; φ)] [ ] f (y; θ) {1 π(y; φ)} dy, δ i =1 δ i =0 where π(y; φ) = P(δ = 1 y; φ). Jae-Kwang Kim (ISU) Part 1 9 / 181

2 Observed Likelihood Example 1 [Example 2.3 of KS] Let t 1, t 2,, t n be an IID sample from a distribution with density f θ (t) = θe θt I (t > 0). Instead of observing t i, we observe (y i, δ i ) where { ti if δ i = 1 y i = c if δ i = 0 and δ i = { 1 if ti c 0 if t i > c, where c is a known censoring time. The observed likelihood for θ can be derived as L obs (θ) = n [ ] {f θ (t i )} δ i {P (t i > c)} 1 δ i = θ n δ i exp( θ n y i ). Jae-Kwang Kim (ISU) Part 1 10 / 181

2 Observed Likelihood Definition: Missing At Random (MAR) P(δ y) is the density of the conditional distribution of δ given y. Let y obs = y obs (y, δ) where { yi if δ i = 1 y i,obs = if δ i = 0. The response mechanism is MAR if P (δ y 1 ) = P (δ y 2 ) { or P(δ y) = P(δ y obs )} for all y 1 and y 2 satisfying y obs (y 1, δ) = y obs (y 2, δ). MAR: the response mechanism P(δ y) depends on y only through y obs. Let y = (y obs, y mis ). By Bayes theorem, P (y mis y obs, δ) = P(δ y mis, y obs ) P (y P(δ y obs ) mis y obs ). MAR: P(y mis y obs, δ) = P(y mis y obs ). That is, y mis δ y obs. MAR: the conditional independence of δ and y mis given y obs. Jae-Kwang Kim (ISU) Part 1 11 / 181

2 Observed Likelihood Remark MCAR (Missing Completely at random): P(δ y) does not depend on y. MAR (Missing at random): P(δ y) = P(δ y obs ) NMAR (Not Missing at random): P(δ y) P(δ y obs ) Thus, MCAR is a special case of MAR. Jae-Kwang Kim (ISU) Part 1 12 / 181

2 Observed Likelihood Theorem 1: Likelihood factorization (Rubin, 1976) [Theorem 2.4 of KS] P φ (δ y) is the joint density of δ given y and f θ (y) is the joint density of y. Under conditions 1 the parameters θ and φ are distinct and 2 MAR condition holds, the observed likelihood can be written as L obs (θ, φ) = L 1(θ)L 2(φ), and the MLE of θ can be obtained by maximizing L 1(θ). Thus, we do not have to specify the model for response mechanism. The response mechanism is called ignorable if the above likelihood factorization holds. Jae-Kwang Kim (ISU) Part 1 13 / 181

2 Observed Likelihood Example 2 [Example 2.4 of KS] Bivariate data (x i, y i ) with pdf f (x, y) = f 1(y x)f 2(x). x i is always observed and y i is subject to missingness. Assume that the response status variable δ i of y i satisfies for some function Λ 1 ( ) of known form. P (δ i = 1 x i, y i ) = Λ 1 (φ 0 + φ 1x i + φ 2y i ) Let θ be the parameter of interest in the regression model f 1 (y x; θ). Let α be the parameter in the marginal distribution of x, denoted by f 2 (x i ; α). Define Λ 0(x) = 1 Λ 1(x). Three parameters θ: parameter of interest α and φ: nuisance parameter Jae-Kwang Kim (ISU) Part 1 14 / 181

2 Observed Likelihood Example 2 (Cont d) Observed likelihood L obs (θ, α, φ) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y x i ; θ) f 2 (x i ; α) Λ 0 (φ 0 + φ 1x i + φ 2y) dy = L 1 (θ, φ) L 2 (α) where L 2 (α) = n f2 (x i; α). Thus, we can safely ignore the marginal distribution of x if x is completely observed. Jae-Kwang Kim (ISU) Part 1 15 / 181

2 Observed Likelihood Example 2 (Cont d) If φ 2 = 0, then MAR holds and L 1 (θ, φ) = L 1a (θ) L 1b (φ) where and L 1a (θ) = f 1 (y i x i ; θ) δ i =1 L 1b (φ) = Λ 1 (φ 0 + φ 1x i ) Λ 0 (φ 0 + φ 1x i ). δ i =1 δ i =0 Thus, under MAR, the MLE of θ can be obtained by maximizing L 1a (θ), which is obtained by ignoring the missing part of the data. Jae-Kwang Kim (ISU) Part 1 16 / 181

2 Observed Likelihood Example 2 (Cont d) Instead of y i subject to missingness, if x i is subject to missingness, then the observed likelihood becomes L obs (θ, φ, α) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y i x; θ) f 2 (x; α) Λ 0 (φ 0 + φ 1x + φ 2y i ) dx If φ 1 = 0 then L 1 (θ, φ) L 2 (α). L obs (θ, α, φ) = L 1 (θ, α) L 2 (φ) and MAR holds. Although we are not interested in the marginal distribution of x, we still need to specify the model for the marginal distribution of x. Jae-Kwang Kim (ISU) Part 1 17 / 181

3 Mean Score Approach The observed likelihood is the marginal density of (y obs, δ). The observed likelihood is L obs (η) = f (y; θ)p(δ y; φ)dµ(y) = R(y obs,δ) f (y; θ)p(δ y; φ)dµ(y mis ) where y mis is the missing part of y and η = (θ, φ). Observed score equation: S obs (η) η log L obs(η) = 0 Computing the observed score function can be computationally challenging because the observed likelihood is an integral form. Jae-Kwang Kim (ISU) Part 1 18 / 181

3 Mean Score Approach Theorem 2: Mean Score Theorem (Fisher, 1922) [Theorem 2.5 of KS] Under some regularity conditions, the observed score function equals to the mean score function. That is, S obs (η) = S(η) where S(η) = E{S com(η) y obs, δ} S com(η) = log f (y, δ; η), η f (y, δ; η) = f (y; θ)p(δ y; φ). The mean score function is computed by taking the conditional expectation of the complete-sample score function given the observation. The mean score function is easier to compute than the observed score function. Jae-Kwang Kim (ISU) Part 1 19 / 181

3 Mean Score Approach Proof of Theorem 2 Since L obs (η) = f (y, δ; η) /f (y, δ y obs, δ; η), we have η ln L obs (η) = ln f (y, δ; η) η η ln f (y, δ y obs, δ; η), taking a conditional expectation of the above equation over the conditional distribution of (y, δ) given (y obs, δ), we have { } η ln L obs (η) = E η ln L obs (η) y obs, δ { } = E {S com (η) y obs, δ} E η ln f (y, δ y obs, δ; η) y obs, δ. Here, the first equality holds because L obs (η) is a function of (y obs, δ) only. The last term is equal to zero by Lemma 1, which states that the expected value of the score function is zero and the reference distribution in this case is the conditional distribution of (y, δ) given (y obs, δ). Jae-Kwang Kim (ISU) Part 1 20 / 181

3 Mean Score Approach Example 3 [Example 2.6 of KS] 1 Suppose that the study variable y follows from a normal distribution with mean x β and variance σ 2. The score equations for β and σ 2 under complete response are S 1(β, σ 2 ) = n ( yi x i β ) x i /σ 2 = 0 and S 2(β, σ 2 ) = n/(2σ 2 ) + n ( yi x i β )2 /(2σ 4 ) = 0. 2 Assume that y i are observed only for the first r elements and the MAR assumption holds. In this case, the mean score function reduces to and S 1(β, σ 2 ) = S 2(β, σ 2 ) = n/(2σ 2 ) + r ( yi x i β ) x i /σ 2 r ( yi x i β )2 /(2σ 4 ) + (n r)/(2σ 2 ). Jae-Kwang Kim (ISU) Part 1 21 / 181

3 Mean Score Approach Example 3 (Cont d) 3 The maximum likelihood estimator obtained by solving the mean score equations is ( r ) 1 r ˆβ = x i x i x i y i and ˆσ 2 = 1 r r ( y i x i ˆβ) 2. Thus, the resulting estimators can be also obtained by simply ignoring the missing part of the sample, which is consistent with the result in Example 2 (for φ 2 = 0). Jae-Kwang Kim (ISU) Part 1 22 / 181

3 Mean Score Approach Discussion of Example 3 We are interested in estimating θ for the conditional density f (y x; θ). Under MAR, the observed likelihood for θ is r n L obs (θ) = f (y i x i ; θ) f (y x i ; θ)dµ(y) = i=r+1 r f (y i x i ; θ). The same conclusion can follow from the mean score theorem. Under MAR, the mean score function is r n S(θ) = S(θ; x i, y i ) + E{S(θ; x i, Y ) x i } = r S(θ; x i, y i ) i=r+1 where S(θ; x, y) is the score function for θ and the second equality follows from Lemma 1. Jae-Kwang Kim (ISU) Part 1 23 / 181

3 Mean Score Approach Example 4 [Example 2.5 of KS] 1 Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success p i, where p i = p i (β) = exp (x i β) 1 + exp (x i β) for some unknown parameter β and x i is a vector of the covariates in the logistic regression model for y i. We assume that 1 is in the column space of x i. 2 Under complete response, the score function for β is S 1(β) = n (y i p i (β)) x i. Jae-Kwang Kim (ISU) Part 1 24 / 181

3 Mean Score Approach Example 4 (Cont d) 3 Let δ i be the response indicator function for y i with distribution Bernoulli(π i ) where π i = exp (x i φ 0 + y i φ 1) 1 + exp (x i φ0 + y iφ 1). We assume that x i is always observed, but y i is missing if δ i = 0. 4 Under missing data, the mean score function for β is S 1 (β, φ) = δi =1 {y i p i (β)} x i + δ i =0 1 w i (y; β, φ) {y p i (β)} x i, where w i (y; β, φ) is the conditional probability of y i = y given x i and δ i = 0: y=0 w i (y; β, φ) = P β (y i = y x i ) P φ (δ i = 0 y i = y, x i ) 1 z=0 P β (y i = z x i ) P φ (δ i = 0 y i = z, x i ) Thus, S 1 (β, φ) is also a function of φ. Jae-Kwang Kim (ISU) Part 1 25 / 181

3 Mean Score Approach Example 4 (Cont d) 5 If the response mechanism is MAR so that φ 1 = 0, then w i (y; β, φ) = P β (y i = y x i ) 1 z=0 P β (y i = z x i ) = P β (y i = y x i ) and so S 1 (β, φ) = δi =1 {y i p i (β)} x i = S 1 (β). 6 If MAR does not hold, then ( ˆβ, ˆφ) can be obtained by solving S 1 (β, φ) = 0 and S 2(β, φ) = 0 jointly, where S 2 (β, φ) = δi =1 {δ i π (φ; x i, y i )} (x i, y i ) + δ i =0 1 w i (y; β, φ) {δ i π i (φ; x i, y)} (x i, y). y=0 Jae-Kwang Kim (ISU) Part 1 26 / 181

3 Mean Score Approach Discussion of Example 4 We may not have a unique solution to S(η) = 0, where S(η) = [ S1 (β, φ), S 2(β, φ) ] when MAR does not hold, because of the non-identifiability problem associated with non-ignorable missing. To avoid this problem, often a reduced model is used for the response model. Pr (δ = 1 x, y) = Pr (δ = 1 u, y) where x = (u, z). The reduced response model introduces a smaller set of parameters and the over-identified situation can be resolved. (More discussion will be made in Part 4 lecture.) Computing the solution to S(η) is also difficult. EM algorithm, which will be presented soon, is a useful computational tool. Jae-Kwang Kim (ISU) Part 1 27 / 181

4 Observed information Definition 1 Observed score function: S obs (η) = η log L obs(η) 2 Fisher information from observed likelihood: I obs (η) = 2 log L η η T obs (η) 3 Expected (Fisher) information from observed likelihood: I obs (η) = E η {I obs (η)}. Lemma 2 [Theorem 2.6 of KS] Under regularity conditions, E{S obs (η)} = 0, and V{S obs (η)} = I obs (η), where I obs (η) = E η{i obs (η)} is the expected information from the observed likelihood. Jae-Kwang Kim (ISU) Part 1 28 / 181

4 Observed information Under missing data, the MLE ˆη is the solution to S(η) = 0. Under some regularity conditions, ˆη converges in probability to η 0 and has the asymptotic variance {I obs (η 0)} 1 with { I obs (η) = E } { } } η S obs(η) = E S 2 T obs (η) = E { S 2 (η) & B 2 = BB T. For variance estimation of ˆη, may use {I obs (ˆη)} 1. IID setup: The empirical (Fisher) information for the variance of ˆη is [Ĥ(ˆη) ] 1 = { n S 2 i (ˆη) } 1 where S i (η) = E{S i (η) y i,obs, δ i } (Redner and Walker, 1984). In general, I obs (ˆη) is preferred to I obs (ˆη) for variance estimation of ˆη. Jae-Kwang Kim (ISU) Part 1 29 / 181

4 Observed information Return to Example 1 Observed score function MLE for θ: S obs (θ) = n δ i log(θ) θ ˆθ = n y i n δ i n y i Fisher information: I obs (θ) = n δ i/θ 2 Expected information: I obs (θ) = n (1 e θc )/θ 2 = n(1 e θc )/θ 2. Which one do you prefer? Jae-Kwang Kim (ISU) Part 1 30 / 181

4 Observed information Motivation L com(η) = f (y, δ; η): complete-sample likelihood with no missing data Fisher information associated with L com(η): I com(η) = η Scom(η) = 2 log Lcom(η) T η ηt L obs (η): the observed likelihood Fisher information associated with L obs (η): I obs (η) = η S obs(η) = 2 T η η log L obs(η) T How to express I obs (η) in terms of I com(η) and S com(η)? Jae-Kwang Kim (ISU) Part 1 31 / 181

4 Observed information Theorem 3 (Louis, 1982; Oakes, 1999) [Theorem 2.7 of KS] Under regularity conditions allowing the exchange of the order of integration and differentiation, [ I obs (η) = E{I com(η) y obs, δ} E{Scom(η) y 2 obs, δ} S(η) 2] where S(η) = E{S com(η) y obs, δ}. = E{I com(η) y obs, δ} V {S com(η) y obs, δ}, Jae-Kwang Kim (ISU) Part 1 32 / 181

Proof of Theorem 3 By Theorem 2, the observed information associated with L obs (η) can be expressed as I obs (η) = η S (η) where S (η) = E {S com(η) y obs, δ; η}. Thus, we have η S (η) = S com(η; y)f (y, δ y η obs, δ; η) dµ(y) { } = Scom(η; y) f (y, δ y η obs, δ; η) dµ(y) { } + S com(η; y) η f (y, δ y obs, δ; η) dµ(y) = E { S com(η)/ η y obs, δ } { } + S com(η; y) η log f (y, δ y obs, δ; η) f (y, δ y obs, δ; η) dµ(y). The first term is equal to E {I com(η) y obs, δ} and the second term is equal to E { S com(η)s mis(η) y obs, δ } = E [{ S(η) } + Smis(η) S mis(η) y obs, δ ] = E { S mis(η)s mis(η) y obs, δ } because E { S(η)Smis(η) y obs, δ } = 0. Jae-Kwang Kim (ISU) Part 1 33 / 181

4. Observed information S mis (η): the score function with the conditional density f (y, δ y obs, δ). { } Expected missing information: I mis (η) = E S η T mis (η) I mis (η) = E { S mis (η) 2}. Missing information principle (Orchard and Woodbury, 1972): I mis (η) = I com(η) I obs (η), satisfying where I com(η) = E { S com(η)/ η T } is the expected information with complete-sample likelihood. An alternative expression of the missing information principle is V{S mis (η)} = V{S com(η)} V{ S(η)}. Note that V{S com(η)} = I com(η) and V{S obs (η)} = I obs (η). Jae-Kwang Kim (ISU) Part 1 34 / 181

4. Observed information Example 5 1 Consider the following bivariate normal distribution: ( ) [( ) ( y1i µ1 σ11 σ 12 N, µ 2 σ 12 σ 22 y 2i )], for i = 1, 2,, n. Assume for simplicity that σ 11, σ 12 and σ 22 are known constants and µ = (µ 1, µ 2) be the parameter of interest. 2 The complete sample score function for µ is n n ( ) 1 ( S com (µ) = S com (i) σ11 σ 12 y1i µ 1 (µ) = σ 12 σ 22 y 2i µ 2 The information matrix of µ based on the complete sample is ( ) 1 σ11 σ 12 I com (µ) = n. σ 12 σ 22 ). Jae-Kwang Kim (ISU) Part 1 35 / 181

4. Observed information Example 5 (Cont d) 3 Suppose that there are some missing values in y 1i and y 2i and the original sample is partitioned into four sets: H = both y 1 and y 2 respond K = only y 1 is observed L = only y 2 is observed M = both y 1 and y 2 are missing. Let n H, n K, n L, n M represent the size of H, K, L, M, respectively. 4 Assume that the response mechanism does not depend on the value of (y 1, y 2) and so it is MAR. In this case, the observed score function of µ based on a single observation in set K is E { } S com (i) (µ) y 1i, i K ( ) 1 ( ) σ11 σ 12 y1i µ 1 = σ 12 σ 22 E(y 2i y 1i ) µ 2 ( σ 1 11 = (y ) 1i µ 1). 0 Jae-Kwang Kim (ISU) Part 1 36 / 181

4. Observed information Example 5 (Cont d) 5 Similarly, we have E { } S com (i) (µ) y 2i, i L = ( 0 σ 1 22 (y 2i µ 2) ). 6 Therefore, and the observed information matrix of µ is I obs (µ) = n H ( σ11 σ 12 σ 12 σ 22 ) 1 + n K ( σ 1 11 0 0 0 ) + n L ( 0 0 0 σ 1 22 and the asymptotic variance of the MLE of µ can be obtained by the inverse of I obs (µ). ) Jae-Kwang Kim (ISU) Part 1 37 / 181

5. EM algorithm Interested in finding ˆη that maximizes L obs (η). The MLE can be obtained by solving S obs (η) = 0, which is equivalent to solving S(η) = 0 by Theorem 2. Computing the solution S(η) = 0 can be challenging because it often involves computing I obs (η) = S(η)/ η in order to apply Newton method: { 1 ˆη (t+1) = ˆη (t) + I obs (ˆη )} (t) S(ˆη (t) ). We may rely on Louis formula (Theorem 3) to compute I obs (η). EM algorithm provides an alternative method of solving S(η) = 0 by writing S(η) = E {S com(η) y obs, δ; η} and using the following iterative method: { ˆη (t+1) solve E S com(η) y obs, δ; ˆη (t)} = 0. Jae-Kwang Kim (ISU) Part 1 38 / 181

5. EM algorithm Definition Let η (t) be the current value of the parameter estimate of η. The EM algorithm can be defined as iteratively carrying out the following E-step and M-steps: E-step: Compute ( Q η η (t)) = E {ln f (y, δ; η) y obs, δ, η (t)} M-step: Find η (t+1) that maximizes Q(η η (t) ) w.r.t. η. Theorem 4 (Dempster et al., 1977) [Theorem 3.2] Let L obs (η) = f (y, δ; η) dµ(y) be the observed likelihood of η. If R(y obs,δ) Q(η (t+1) η (t) ) Q(η (t) η (t) ), then L obs (η (t+1) ) L obs (η (t) ). Jae-Kwang Kim (ISU) Part 1 39 / 181

5. EM algorithm Remark 1 Convergence of EM algorithm is linear. It can be shown that η (t+1) η (t) = Jmis ( η (t) η (t 1)) where J mis = I 1 comi mis is called the fraction of missing information. 2 Under MAR and for the exponential family of the distribution of the form f (y; θ) = b (y) exp { θ T (y) A (θ) }. The M-step computes θ (t+1) by the solution to { E T (y) y obs, θ (t)} = E {T (y) θ}. Jae-Kwang Kim (ISU) Part 1 40 / 181

5. EM algorithm 0.5 1.0 1.5 2.0 2.5 h 1(θ) h 2(θ) 0.2 0.4 0.6 0.8 1.0 θ Figure : Illustration of EM algorithm for exponential family (h 1 (θ) = E {T (y) y obs, θ}, h 2 (θ) = E {T (y) θ}) Jae-Kwang Kim (ISU) Part 1 41 / 181

5. EM algorithm Return to Example 4 E-step: where ( S 1 β β (t), φ (t)) = {y i p i (β)} x i + δ i =1 δ i =0 w ij(t) = Pr(Y i = j x i, δ i = 0; β (t), φ (t) ) = 1 w ij(t) {j p i (β)} x i, j=0 Pr(Y i = j x i ; β (t) )Pr(δ i = 0 x i, j; φ (t) ) 1 y=0 Pr(Y i = y x i ; β (t) )Pr(δ i = 0 x i, y; φ (t) ) and ( S 2 φ β (t), φ (t)) = {δ i π (x i, y i ; φ)} ( x ) i, y i δ i =1 + δ i =0 1 w ij(t) {δ i π i (x i, j; φ)} ( x i, j ). j=0 Jae-Kwang Kim (ISU) Part 1 42 / 181

5. EM algorithm Return to Example 4 (Cont d) M-step: The parameter estimates are updated by solving [ S 1 (β β (t), φ (t)), S 2 ( φ β (t), φ (t))] = (0, 0) for β and φ. For categorical missing data, the conditional expectation in the E-step can be computed using the weighted mean with weights w ij(t). Ibrahim (1990) called this method EM by weighting. Observed information matrix can also be obtained by the Louis formula (in Theorem 3) using the weighted mean in the E-step. Jae-Kwang Kim (ISU) Part 1 43 / 181

5. EM algorithm Example 6. Mixture model [Example 3.8 of KS] Observation where Y i = (1 W i ) Z 1i + W i Z 2i, ( ) Z 1i N µ 1, σ1 2 ( ) Z 2i N µ 2, σ2 2 W i Bernoulli (π). i = 1, 2,, n Parameter of interest θ = ( ) µ 1, µ 2, σ1, 2 σ2, 2 π Jae-Kwang Kim (ISU) Part 1 44 / 181

5. EM algorithm Example 6 (Cont d) Observed likelihood where and L obs (θ) = n pdf (y i θ) ( ) ( ) pdf (y θ) = (1 π) φ y µ 1, σ1 2 + πφ y µ 2, σ2 2 ( φ y µ, σ 2) = 1 exp 2πσ [ ] (y µ)2. 2σ 2 Jae-Kwang Kim (ISU) Part 1 45 / 181

5. EM algorithm Example 6 (Cont d) Full sample likelihood where pdf (y, w θ) = L full (θ) = n pdf (y i, w i θ) [ ( )] 1 w [ ( w φ y µ 1, σ1 2 φ y µ 2, σ2)] 2 π w (1 π) 1 w. Thus, ln L full (θ) = n + [ ( ) ( )] (1 w i ) ln φ y i µ 1, σ1 2 + w i ln φ y i µ 2, σ2 2 n {w i ln (π) + (1 w i ) ln (1 π)} Jae-Kwang Kim (ISU) Part 1 46 / 181

5. EM algorithm Example 6 (Cont d) E-M algorithm [E-step] ( Q θ θ (t)) = where r (t) i n [( 1 r (t) i + n ) = E (w i y i, θ (t) with E (w i y i, θ) = ) ( ) ln φ y i µ 1, σ1 2 { ( r (t) i ln (π) + 1 r (t) i ) ( )] + r (t) i ln φ y i µ 2, σ2 2 } ln (1 π) πφ ( ) y i µ 2, σ2 2 (1 π) φ (y i µ 1, σ1 2) + πφ (y i µ 2, σ2 2) [M-step] ( θ Q θ θ (t)) = 0. Jae-Kwang Kim (ISU) Part 1 47 / 181

5. EM algorithm Example 7. (Robust regression) [Example 3.12 of KS] Model: y i = β 0 + β 1x i + σe i with e i t(ν), ν: known. Missing data setup: e i = u i / w i where u i N(0, 1), w i χ 2 /ν. (x i, y i, w i ): complete data (x i, y i always observed, w i always missing) ) y i (x i, w i ) N (β 0 + β 1x i, σ 2 /w i and x i is fixed (thus, independent of w i ). EM algorithm can be used to estimate θ = (β 0, β 1, σ 2 ) Jae-Kwang Kim (ISU) Part 1 48 / 181

5. EM algorithm Example 7 (Cont d) E-step: Find the conditional distribution of w i given x i. By Bayes theorem, f (w i x i, y i ) f (w i )f (y i w i, x i ) { Gamma ν + 1 ( ) } 2, 2 yi β 0 β 1 x 2 1 i ν + σ Thus, E(w i x i, y i, θ (t) ) = ν + 1 ( ) 2, ν + d (t) i where d (t) i = (y i β (t) 0 β (t) 1 x i)/σ (t). Jae-Kwang Kim (ISU) July 5th, 2014 49 / 181

5. EM algorithm Example 7 (Cont d) M-step: ( µ (t) x ), µ (t) y = n w (t) i (x i, y i ) /( n w (t) i ) β (t+1) 0 = µ (t) y β (t+1) 1 µ (t) x n β (t+1) 1 = w (t) i (x i µ (t) x )(y i µ (t) n w (t) i (x i µ (t) x ) 2 σ 2(t+1) = 1 n ( w (t) i y i β (t) 0 β (t) 1 n x i where w (t) i = E(w i x i, y i, θ (t) ). y ) ) 2 Jae-Kwang Kim (ISU) July 5th, 2014 50 / 181

6. Summary Interested in finding the MLE that maximizes the observed likelihood function. Under MAR, the model specification of the response mechanism is not necessary. Mean score equation can be used to compute the MLE. EM algorithm is a useful computational tool for solving the mean score equation. The E-step of the EM algorithm may require some computational tool (see Part 2). The asymptotic variance of the MLE can be computed by the inverse of the observed information matrix, which can be computed using Louis formula. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181

REFERENCES Cheng, P. E. (1994), Nonparametric estimation of mean functionals with data missing at random, Journal of the American Statistical Association 89, 81 87. Dempster, A. P., N. M. Laird and D. B. Rubin (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B 39, 1 37. Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London A 222, 309 368. Fuller, W. A., M. M. Loughin and H. D. Baker (1994), Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey, Survey Methodology 20, 75 85. Hirano, K., G. Imbens and G. Ridder (2003), Efficient estimation of average treatment effects using the estimated propensity score, Econometrica 71, 1161 1189. Ibrahim, J. G. (1990), Incomplete data in generalized linear models, Journal of the American Statistical Association 85, 765 769. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181

Kim, J. K. (2011), Parametric fractional imputation for missing data analysis, Biometrika 98, 119 132. Kim, J. K. and C. L. Yu (2011), A semi-parametric estimation of mean functionals with non-ignorable missing data, Journal of the American Statistical Association 106, 157 165. Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), On the bias of the multiple imputation variance estimator in survey sampling, Journal of the Royal Statistical Society: Series B 68, 509 521. Kim, J. K. and M. K. Riddles (2012), Some theory for propensity-score-adjustment estimators in survey sampling, Survey Methodology 38, 157 165. Kott, P. S. and T. Chang (2010), Using calibration weighting to adjust for nonignorable unit nonresponse, Journal of the American Statistical Association 105, 1265 1275. Louis, T. A. (1982), Finding the observed information matrix when using the EM algorithm, Journal of the Royal Statistical Society: Series B 44, 226 233. Meng, X. L. (1994), Multiple-imputation inferences with uncongenial sources of input (with discussion), Statistical Science 9, 538 573. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181

Oakes, D. (1999), Direct calculation of the information matrix via the em algorithm, Journal of the Royal Statistical Society: Series B 61, 479 482. Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, Berkeley, California, pp. 695 715. Redner, R. A. and H. F. Walker (1984), Mixture densities, maximum likelihood and the EM algorithm, SIAM Review 26, 195 239. Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), Estimation of regression coefficients when some regressors are not always observed, Journal of the American Statistical Association 89, 846 866. Robins, J. M. and N. Wang (2000), Inference for imputation estimators, Biometrika 87, 113 124. Rubin, D. B. (1976), Inference and missing data, Biometrika 63, 581 590. Tanner, M. A. and W. H. Wong (1987), The calculation of posterior distribution by data augmentation, Journal of the American Statistical Association 82, 528 540. Jae-Kwang Kim (ISU) July 5th, 2014 51 / 181

Wang, N. and J. M. Robins (1998), Large-sample theory for parametric multiple imputation procedures, Biometrika 85, 935 948. Wang, S., J. Shao and J. K. Kim (2014), Identifiability and estimation in problems with nonignorable nonresponse, Statistica Sinica 24, 1097 1116. Wei, G. C. and M. A. Tanner (1990), A Monte Carlo implementation of the EM algorithm and the poor man s data augmentation algorithms, Journal of the American Statistical Association 85, 699 704. Zhou, M. and J. K. Kim (2012), An efficient method of estimation for longitudinal surveys with monotone missing data, Biometrika 99, 631 648.

Statistical Methods for Handling Missing Data Part 2: Imputation Jae-Kwang Kim Department of Statistics, Iowa State University Jae-Kwang Kim (ISU) Part 2 52 / 181

Introduction Basic setup Y: a vector of random variables with distribution F (y; θ). y 1,, y n are n independent realizations of Y. We are interested in estimating ψ which is implicitly defined by E{U(ψ; Y)} = 0. Under complete observation, a consistent estimator ˆψ n of ψ can be obtained by solving estimating equation for ψ: n U(ψ; y i ) = 0. A special case of estimating function is the score function. In this case, ψ = θ. Sandwich variance estimator is often used to estimate the variance of ˆψ n: where τ u = E{ U(ψ; y)/ ψ }. ˆV ( ˆψ n) = ˆτ u 1 ˆV (U)ˆτ u 1 Jae-Kwang Kim (ISU) Part 2 53 / 181

1. Introduction Missing data setup Suppose that y i is not fully observed. y i = (y obs,i, y mis,i ): (observed, missing) part of y i δ i : response indicator functions for y i. Under the existence of missing data, we can use the following estimators: ˆψ: solution to n E {U (ψ; y i ) y obs,i, δ i } = 0. (1) The equation in (1) is often called expected estimating equation. Jae-Kwang Kim (ISU) Part 2 54 / 181

1. Introduction Motivation (for imputation) Computing the conditional expectation in (1) can be a challenging problem. 1 The conditional expectation depends on unknown parameter values. That is, E {U (ψ; y i ) y obs,i, δ i } = E {U (ψ; y i ) y obs,i, δ i ; θ, φ}, where θ is the parameter in f (y; θ) and φ is the parameter in p(δ y; φ). 2 Even if we know η = (θ, φ), computing the conditional expectation is numerically difficult. Jae-Kwang Kim (ISU) Part 2 55 / 181

1. Introduction Imputation Imputation: Monte Carlo approximation of the conditional expectation (given the observed data). E {U (ψ; y i ) y obs,i, δ i } = 1 m m j=1 ( ) U ψ; y obs,i, y (j) mis,i 1 Bayesian approach: generate y mis,i from f (y mis,i y obs, δ) = f (y mis,i y obs, δ; η) p(η y obs, δ)dη 2 Frequentist approach: generate y mis,i from f (y mis,i y obs,i, δ; ˆη), where ˆη is a consistent estimator. Jae-Kwang Kim (ISU) Part 2 56 / 181

1. Introduction Imputation Questions 1 How to generate the Monte Carlo samples (or the imputed values)? 2 What is the asymptotic distribution of ˆψ I which solves 1 m m j=1 ( ) U ψ; y obs,i, y (j) mis,i = 0, where y (j) mis,i f (y mis,i y obs,i, δ; ˆη p) for some ˆη p? 3 How to estimate the variance of ˆψ I? Jae-Kwang Kim (ISU) Part 2 57 / 181

2. Basic Theory for Imputation Basic Setup (for Case 1: ψ = η) y = (y 1,, y n) f (y; θ) δ = (δ 1,, δ n) P(δ y; φ) y = (y obs, y mis ): (observed, missing) part of y. y (1) mis,, y (m) mis : m imputed values of y mis generated from f (y mis y obs, δ; ˆη p) = f (y; ˆθ p)p(δ y; ˆφ p) f (y; ˆθ p)p(δ y; ˆφ p)dµ(y mis), where ˆη p = (ˆθ p, ˆφ p) is a preliminary estimator of η = (θ, φ). Using m imputed values, imputed score function is computed as S imp,m (η ˆη p) 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) where S com(η; y) is the score function of η = (θ, φ) under complete response. Jae-Kwang Kim (ISU) Part 2 58 / 181

2. Basic Theory for Imputation Lemma 1 (Asymptotic results for m = ) Assume that ˆη p converges in probability to η. Let ˆη I,m be the solution to 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) = 0, where y (1) mis,, y (m) mis are the imputed values generated from f (y mis y obs, δ; ˆη p). Then, under some regularity conditions, for m, and ˆη imp, = ˆη MLE + J mis (ˆη p ˆη MLE) (2) V (ˆη imp, ). = I 1 obs + J mis {V (ˆη p) V (ˆη MLE)} J mis, where J mis = I 1 comi mis is the fraction of missing information. Jae-Kwang Kim (ISU) Part 2 59 / 181

2. Basic Theory for Imputation Remark For m =, the imputed score equation becomes the mean score equation. Equation (2) means that ˆη imp, = (I J mis) ˆη MLE + J mis ˆη p. (3) That is, ˆη imp, is a convex combination of ˆη MLE and ˆη p. Note that ˆη imp, is one-step EM update with initial estimate ˆη p. Let ˆη (t) be the t-th EM update of η that is computed by solving ( S η ˆη (t 1)) = 0 with ˆη (0) = ˆη p. Equation (3) implies that ˆη (t) = (I J mis) ˆη MLE + J mis ˆη (t 1). Thus, we can obtain ˆη (t) = ˆη MLE + (J mis) t 1 (ˆη (0) ˆη MLE ), which justifies lim t ˆη (t) = ˆη MLE. Jae-Kwang Kim (ISU) Part 2 60 / 181

2. Basic Theory for Imputation Theorem 1 (Asymptotic results for m < ) [Theorem 4.1 of KS] Let ˆη p be a preliminary n-consistent estimator of η with variance V p. Under some regularity conditions, the solution ˆη imp,m to S m (η ˆη p) 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) = 0 has mean η 0 and asymptotic variance equal to V (ˆη imp,m ) { }. = I 1 obs + J mis V p I 1 obs J mis + m 1 IcomI 1 misicom 1 (4) where J mis = I 1 comi mis. This theorem was originally presented by Wang and Robins (1998). Jae-Kwang Kim (ISU) Part 2 61 / 181

2. Basic Theory for Imputation Remark If we use ˆη p = ˆη MLE, then the asymptotic variance in (4) is V (ˆη imp,m ). = I 1 obs + m 1 IcomI 1 misicom. 1 In Bayesian imputation (or multiple imputation), the posterior values of η are independently generated from η N(ˆη MLE, I 1 obs ), which implies that we can use V p = I 1 obs + m 1 I 1 obs. Thus, the asymptotic variance in (4) for multiple imputation is V (ˆη imp,m ). = I 1 obs + m 1 JmisI 1 obs J 1 mis + m 1 IcomI 1 misicom. 1 The second term is the additional price we pay when generating the posterior values, rather than using ˆη MLE directly. Jae-Kwang Kim (ISU) Part 2 62 / 181

2. Basic Theory for Imputation Basic Setup (for Case 2: ψ η) Parameter ψ defined by E{U(ψ; y)} = 0. Under complete response, a consistent estimator of ψ can be obtained by solving U (ψ; y) = 0. Assume that some part of y, denoted by y mis, is not observed and m imputed values, say y (1) mis,, y (m) mis, are generated from f (ymis y obs, δ; ˆη MLE ), where ˆη MLE is the MLE of η 0. The imputed estimating function using m imputed values is computed as Ū imp,m (ψ ˆη MLE ) = 1 m m U(ψ; y (j) ), (5) where y (j) = (y obs, y (j) mis ). Let ˆψ imp,m be the solution to Ū imp,m (ψ ˆη MLE ) = 0. We are interested in the asymptotic properties of ˆψ imp,m. j=1 Jae-Kwang Kim (ISU) Part 2 63 / 181

2. Basic Theory for Imputation Theorem 2 [Theorem 4.2 of KS] Suppose that the parameter of interest ψ 0 is estimated by solving U (ψ) = 0 under complete response. Then, under some regularity conditions, the solution to E {U (ψ) y obs, δ; ˆη MLE } = 0 (6) has mean ψ 0 and the asymptotic variance τ 1 Ωτ 1, where τ = E { U (ψ 0) / ψ } Ω = V { Ū (ψ 0 η 0) + κs obs (η 0) } and κ = E {U (ψ 0) S mis(η 0)} I 1 obs. This theorem was originally presented by Robins and Wang (2000). Jae-Kwang Kim (ISU) Part 2 64 / 181

2. Basic Theory for Imputation Remark Writing Ū(ψ η) E{U(ψ) y obs, δ; η}, the solution to (6) can be treated as the solution to the joint estimating equation [ ] U1(ψ, η) U (ψ, η) = 0, U 2(η) where U 1(ψ, η) = Ū (ψ η) and U2(η) = S obs (η). We can apply the Taylor expansion to get ( ) ( ) ( ) 1 [ ˆψ ψ0 B11 B 12 U1(ψ 0, η 0) = ˆη η 0 B 21 B 22 U 2(η 0) ] where ( B11 B 12 Thus, as B 21 = 0, B 21 B 22 ) = [ E ( U1/ ψ ) E ( U 1/ η ) E ( U 2/ ψ ) E ( U 2/ η ) ˆψ { } = ψ 0 B 1 11 U 1(ψ 0, η 0) B 12B 1 22 U 2(η 0). In Theorem 2, B 11 = τ, B 12 = E(US mis ), and B 22 = I obs. Jae-Kwang Kim (ISU) Part 2 65 / 181 ].

3. Monte Carlo EM Motivation: Monte Carlo samples in the EM algorithm can be used as imputed values. Monte Carlo EM 1 In the EM algorithm defined by [E-step] Compute ( Q η η (t)) = E {ln f (y, δ; η) y obs, δ; η (t)} [M-step] Find η (t+1) that maximizes Q ( η η (t)), E-step is computationally cumbersome because it involves integral. 2 Wei and Tanner (1990): In the E-step, first draw y (1) mis,, y (m) mis f (y mis y obs, δ; η (t)) and approximate ( Q η η (t)) 1 = m m ln f j=1 ( y obs, y (j) mis, δ; η ). Jae-Kwang Kim (ISU) Part 2 66 / 181

3. Monte Carlo EM Example 1 [Example 3.15 of KS] Suppose that y i f (y i x i ; θ) Assume that x i is always observed but we observe y i only when δ i = 1 where δ i Bernoulli [π i (φ)] and π i (φ) = exp (φ0 + φ1x i + φ 2y i ) 1 + exp (φ 0 + φ 1x i + φ 2y i ). To implement the MCEM method, in the E-step, we need to generate samples from f (y i x i, δ i = 0; ˆθ, ˆφ) = f (y i x i ; ˆθ){1 π i ( ˆφ)} f (yi x i ; ˆθ){1 π i ( ˆφ)}dy i. Jae-Kwang Kim (ISU) Part 2 67 / 181

3. Monte Carlo EM Example 1 (Cont d) We can use the following rejection method to generate samples from f (y i x i, δ i = 0; ˆθ, ˆφ): 1 Generate yi from f (y i x i ; ˆθ). 2 Using yi, compute π i ( ˆφ) = Accept yi with probability 1 πi ( ˆφ). 3 If yi is not accepted, then goto Step 1. exp( ˆφ 0 + ˆφ 1 x i + ˆφ 2 y i ) 1 + exp( ˆφ 0 + ˆφ 1 x i + ˆφ 2 y i ). Jae-Kwang Kim (ISU) Part 2 68 / 181

3. Monte Carlo EM Example 1 (Cont d) Using the m imputed values of y i, denoted by y (1) i,, y (m) i, and the M-step can be implemented by solving and n j=1 n m ( ) S θ; x i, y (j) i = 0 j=1 where S (θ; x i, y i ) = log f (y i x i ; θ)/ θ. m { } ( ) δ i π(φ; x i, y (j) i ) 1, x i, y (j) i = 0, Jae-Kwang Kim (ISU) Part 2 69 / 181

3. Monte Carlo EM Example 2 (GLMM) [Example 3.18] Basic Setup: Let y ij be a binary random variable (that takes 0 or 1) with probability p ij = Pr (y ij = 1 x ij, a i ) and we assume that logit (p ij ) = x ijβ + a i where x ij is a p-dimensional covariate associate with j-th repetition of unit i, β is the parameter of interest that can represent the treatment effect due to x, and a i represents the random effect associate with unit i. We assume that a i are iid with N ( 0, σ 2). Missing data : a i Observed likelihood: L obs ( β, σ 2) = i { j p (x ij, a i ; β) y ij [1 p (x ij, a i ; β)] 1 y 1 ( ij σ φ ai ) } da i σ where φ ( ) is the pdf of the standard normal distribution. Jae-Kwang Kim (ISU) Part 2 70 / 181

3. Monte Carlo EM Example 2 (Cont d) MCEM approach: generate a i from Metropolis-Hastings algorithm: 1 Generate ai from f 2 (a i ; ˆσ). 2 Set { where ( ρ f (a i x i, y i ; ˆβ, ˆσ) f 1(y i x i, a i ; ˆβ)f 2(a i ; ˆσ). a (t) i = a (t 1) i ai a (t 1) i w.p. ρ(a (t 1) i, ai ) w.p. 1 ρ(a (t 1) i, ai ) ) f 1 (y i x i, a, ai i ; ˆβ ) = min ( ), 1 f 1 y i x i, a (t 1) ; ˆβ. i Jae-Kwang Kim (ISU) Part 2 71 / 181

3. Monte Carlo EM Remark Monte Carlo EM can be used as a frequentist approach to imputation. Convergence is not guaranteed (for fixed m). E-step can be computationally heavy. (May use MCMC method). Jae-Kwang Kim (ISU) Part 2 72 / 181

4. Parametric Fractional Imputation Parametric fractional imputation 1 More than one (say m) imputed values of y mis,i : y (1) mis,i,, y (m) mis,i from some (initial) density h (y mis,i ). 2 Create weighted data set {( w ij, yij ) } ; j = 1, 2,, m; i = 1, 2, n where m j=1 w ij = 1, yij = (y obs,i, y (j) mis,i ) w ij f (y ij, δ i ; ˆη)/h(y (j) mis,i ), ˆη is the maximum likelihood estimator of η, and f (y, δ; η) is the joint density of (y, δ). 3 The weight wij weights. are the normalized importance weights and can be called fractional If y mis,i is categorical, then simply use all possible values of y mis,i as the imputed values and then assign their conditional probabilities as the fractional weights. Jae-Kwang Kim (ISU) Part 2 73 / 181

4. Parametric Fractional Imputation Remark Importance sampling idea: For sufficiently large m, m j=1 ij g ( yij ) g(yi ) f (y i,δ i ; ˆη) = f (yi,δ i ; ˆη) h(y h(y mis,i ) mis,i)dy mis,i w h(y mis,i ) h(y mis,i)dy mis,i = E {g (y i ) y obs,i, δ i ; ˆη} for any g such that the expectation exists. In the importance sampling literature, h( ) is called proposal distribution and f ( ) is called target distribution. Do not need to compute the conditional distribution f (y mis,i y obs,i, δ i ; η). Only the joint distribution f (y obs,i, y mis,i, δ i ; η) is needed because f (y obs,i, y (j) mis,i, δ i; ˆη)/h(y (j) i,mis ) m k=1 f (y obs,i, y (k) mis,i, δ i; ˆη)/h(y (k) i,mis ) = f (y m k=1 (j) mis,i y obs,i, δ i ; ˆη)/h(y (j) i,mis ) (k) f (ymis,i y obs,i, δ i ; ˆη)/h(y (k) i,mis ). Jae-Kwang Kim (ISU) Part 2 74 / 181

4. Parametric Fractional Imputation EM algorithm by fractional imputation 1 Imputation-step: generate y (j) mis,i h (y i,mis ). 2 Weighting-step: compute where m j=1 w ij(t) = 1. 3 M-step: update w ij(t) f (y ij, δ i ; ˆη (t) )/h(y (j) i,mis ) ˆη (t+1) : solution to n m j=1 w ij(t)s ( η; y ij, δ i ) = 0. 4 Repeat Step2 and Step 3 until convergence. Imputation Step + Weighting Step = E-step. We may add an optional step that checks if w ij(t) is too large for some j. In this case, h(y i,mis ) needs to be changed. Jae-Kwang Kim (ISU) Part 2 75 / 181

4. Parametric Fractional imputation The imputed values are not changed for each EM iteration. Only the fractional weights are changed. 1 Computationally efficient (because we use importance sampling only once). 2 Convergence is achieved (because the imputed values are not changed). For sufficiently large t, ˆη (t) ˆη. Also, for sufficiently large m, ˆη ˆη MLE. For estimation of ψ in E{U(ψ; Y )} = 0, simply use 1 n n m j=1 w ij U(ψ; y ij ) = 0. Linearization variance estimation (using the result of Theorem 2) is discussed in Kim (2011). Jae-Kwang Kim (ISU) Part 2 76 / 181

4. Parametric Fractional imputation Return to Example 1 Fractional imputation 1 Imputation Step: Generate y (1) i,, y (m) i from f (y i x i ; ˆθ ) (0). 2 Weighting Step: Using the m imputed values generated from Step 1, compute the fractional weights by ( f y (j) wij(t) i x i ; ˆθ ) (t) { ( f y (j) i x i ; ˆθ ) 1 π(x i, y (j) i ; ˆφ } (t) ) (0) where ( π(x i, y i ; ˆφ) exp ˆφ0 + ˆφ 1 x i + ˆφ ) 2 y i = ( 1 + exp ˆφ0 + ˆφ 1 x i + ˆφ ). 2 y i Jae-Kwang Kim (ISU) Part 2 77 / 181

4. Parametric Fractional imputation Return to Example 1 (Cont d) Using the imputed data and the fractional weights, the M-step can be implemented by solving and n j=1 n m ( ) wij(t)s θ; x i, y (j) i = 0 j=1 m { } ( ) w ij(t) δ i π(φ; x i, y (j) i ) 1, x i, y (j) i = 0, (7) where S (θ; x i, y i ) = log f (y i x i ; θ)/ θ. Jae-Kwang Kim (ISU) Part 2 78 / 181

4. Parametric Fractional imputation Example 3: Categorical missing data Original data i Y 1i Y 2i X i 1 1 1 3 2 1? 4 3? 0 5 4 0 1 6 5 0 0 7 6?? 8 Jae-Kwang Kim (ISU) Part 2 79 / 181

4. Parametric Fractional imputation Example 3 (Con d) Y 1 and Y 2 are dichotomous and X is continuous. Model Pr (Y 1 = 1 X ) = Λ (α 0 + α 1X ) Pr (Y 2 = 1 X, Y 1) = Λ (β 0 + β 1X + β 2Y 1) where Λ (x) = {1 + exp ( x)} 1 Assume MAR Jae-Kwang Kim (ISU) Part 2 80 / 181

4. Parametric Fractional imputation Example 3 (Con d) Imputed data i Fractional Weight Y 1i Y 2i X i 1 1 1 1 3 2 Pr (Y 2 = 0 Y 1 = 1, X = 4) 1 0 4 Pr (Y 2 = 1 Y 1 = 1, X = 4) 1 1 4 3 Pr (Y 1 = 0 Y 2 = 0, X = 5) 0 0 5 Pr (Y 1 = 1 Y 2 = 0, X = 5) 1 0 5 4 1 0 1 6 5 1 0 0 7 6 Pr (Y 1 = 0, Y 1 = 0 X = 8) 0 0 8 Pr (Y 1 = 0, Y 1 = 1 X = 8) 0 1 8 Pr (Y 1 = 1, Y 1 = 0 X = 8) 1 0 8 Pr (Y 1 = 1, Y 1 = 1 X = 8) 1 1 8 Jae-Kwang Kim (ISU) Part 2 81 / 181

4. Parametric Fractional imputation Example 3 (Con d) Implementation of EM algorithm using fractional imputation E-step: compute the mean score functions using the fractional weights. M-step: solve the mean score function. Because we have a completed data with weights, we can estimate other parameters such as θ = Pr (Y 2 = 1 X > 5). Jae-Kwang Kim (ISU) Part 2 82 / 181

4. Parametric Fractional imputation Example 4: Measurement error model [Example 4.13 of KS] Interested in estimating θ in f (y x; θ). Instead of observing x, we observe z which can be highly correlated with x. Thus, z is an instrumental variable for x: and for a b. f (y x, z) = f (y x) f (y z = a) f (y z = b) In addition to original sample, we have a separate calibration sample that observes (x i, z i ). Jae-Kwang Kim (ISU) Part 2 83 / 181

4. Parametric Fractional imputation Table : Data Structure Z X Y Calibration Sample o o Original Sample o o Jae-Kwang Kim (ISU) Part 2 84 / 181

4. Parametric Fractional imputation Example 4 (Cont d) The goal is to generate x in the original sample from f (x i z i, y i ) f (x i z i ) f (y i x i, z i ) = f (x i z i ) f (y i x i ) Obtain a consistent estimator ˆf (x z) from calibration sample. E-step 1 Generate x (1) i,, x (m) i from ˆf (x i z i ). 2 Compute the fractional weights associated with x (j) i w ij f (y i x (j) ; ˆθ) M-step: Solve the weighted score equation for θ. i by Jae-Kwang Kim (ISU) Part 2 85 / 181

5. Multiple imputation Features 1 Imputed values are generated from y (j) i,mis f (y i,mis y i,obs, δ i ; η ) where η is generated from the posterior distribution π (η y obs ). 2 Variance estimation formula is simple (Rubin s formula). ˆV MI ( ψ m) = W m + (1 + 1 m )Bm where W M = m 1 m k=1 ˆV I (k), B m = (m 1) 1 m k=1 ( ˆψ(k) ψ m ) 2, ψ m = m 1 m k=1 ˆψ (k) is the average of m imputed estimators, and ˆV I (k) is the imputed version of the variance estimator of ˆψ under complete response. Jae-Kwang Kim (ISU) Part 2 86 / 181

5. Multiple imputation The computation for Bayesian imputation can be implemented by the data augmentation (Tanner and Wong, 1987) technique, which is a special application of the Gibb s sampling method: 1 I-step: Generate y mis f (y mis y obs, δ; η ) 2 P-step: Generate η π (η y obs, y mis, δ) Needs some tools for checking the convergence to a stable distribution. Consistency of variance estimator is questionable (Kim et al., 2006). Jae-Kwang Kim (ISU) Part 2 87 / 181

6. Simulation Study Simulation 1 Bivariate data (x i, y i ) of size n = 200 with x i N(3, 1) y i N( 2 + x i, 1) x i always observed, y i subject to missingness. MCAR (δ Bernoulli(0.6)) Parameters of interest 1 θ 1 = E(Y ) 2 θ 2 = Pr(Y < 1) Multiple imputation (MI) and fractional imputation (FI) are applied with m = 50. For estimation of θ 2, the following method-of-moment estimator is used. ˆθ 2,MME = n 1 n I (y i < 1) Jae-Kwang Kim (ISU) Part 2 88 / 181

6. Simulation Study Table 1 Monte Carlo bias and variance of the point estimators. Parameter Estimator Bias Variance Std Var Complete sample 0.00 0.0100 100 θ 1 MI 0.00 0.0134 134 FI 0.00 0.0133 133 Complete sample 0.00 0.00129 100 θ 2 MI 0.00 0.00137 106 FI 0.00 0.00137 106 Table 2 Monte Carlo relative bias of the variance estimator. Parameter Imputation Relative bias (%) V (ˆθ 1) MI -0.24 FI 1.21 V (ˆθ 2) MI 23.08 FI 2.05 Jae-Kwang Kim (ISU) Part 2 89 / 181

6. Simulation study Rubin s formula is based on the following decomposition: V (ˆθ MI ) = V (ˆθ n) + V (ˆθ MI ˆθ n) where ˆθ n is the complete-sample estimator of θ. Basically, W m term estimates V (ˆθ n) and (1 + m 1 )B m term estimates V (ˆθ MI ˆθ n). For general case, we have V (ˆθ MI ) = V (ˆθ n) + V (ˆθ MI ˆθ n) + 2Cov(ˆθ MI ˆθ n, ˆθ n) and Rubin s variance estimator ignores the covariance term. Thus, a sufficient condition for the validity of unbiased variance estimator is Cov(ˆθ MI ˆθ n, ˆθ n) = 0. Meng (1994) called the condition congeniality of ˆθ n. Congeniality holds when ˆθ n is the MLE of θ. Jae-Kwang Kim (ISU) Part 2 90 / 181

6. Simulation study For example, there are two estimators of θ = P(Y < 1) when Y follows from N(µ, σ 2 ). 1 Maximum likelihood method: ˆθ MLE = 1 φ(z; ˆµ, ˆσ2 )dz 2 Method of moments: ˆθ MME = n 1 n I (y i < 1) In the simulation setup, the imputed estimator of θ 2 can be expressed as n ˆθ 2,I = n 1 [δ i I (y i < 1) + (1 δ i )E{I (y i < 1) x i ; ˆµ, ˆσ}]. Thus, imputed estimator of θ 2 borrows strength by making use of extra information associated with f (y x). Thus, when the congeniality conditions does not hold, the imputed estimator improves the efficiency (due to the imputation model that uses extra information) but the variance estimator does not recognize this improvement. Jae-Kwang Kim (ISU) Part 2 91 / 181

6. Simulation Study Simulation 2 Bivariate data (x i, y i ) of size n = 100 with ( ) Y i = β 0 + β 1x i + β 2 xi 2 1 + e i (8) where (β 0, β 1, β 2) = (0, 0.9, 0.06), x i N (0, 1), e i N (0, 0.16), and x i and e i are independent. The variable x i is always observed but the probability that y i responds is 0.5. In MI, the imputer s model is Y i = β 0 + β 1x i + e i. That is, imputer s model uses extra information of β 2 = 0. From the imputed data, we fit model (8) and computed power of a test H 0 : β 2 = 0 with 0.05 significant level. In addition, we also considered the Complete-Case (CC) method that simply uses the complete cases only for the regression analysis Jae-Kwang Kim (ISU) Part 2 92 / 181

6. Simulation Study Table 3 Simulation results for the Monte Carlo experiment based on 10,000 Monte Carlo samples. Method E(ˆθ) V (ˆθ) R.B. ( ˆV ) Power MI 0.028 0.00056 1.81 0.044 FI 0.046 0.00146 0.02 0.314 CC 0.060 0.00234-0.01 0.285 Table 3 shows that MI provides efficient point estimator than CC method but variance estimation is very conservative (more than 100% overestimation). Because of the serious positive bias of MI variance estimator, the statistical power of the test based on MI is actually lower than the CC method. Jae-Kwang Kim (ISU) Part 2 93 / 181

7. Summary Imputation can be viewed as a Monte Carlo tool for computing the conditional expectation. Monte Carlo EM is very popular but the E-step can be computationally heavy. Parametric fractional imputation is a useful tool for frequentist imputation. Multiple imputation is motivated from a Bayesian framework. The frequentist validity of multiple imputation requires the condition of congeniality. Uncongeniality may lead to overestimation of variance which can seriously increase type-2 errors. Jae-Kwang Kim (ISU) Part 2 94 / 181

REFERENCES Cheng, P. E. (1994), Nonparametric estimation of mean functionals with data missing at random, Journal of the American Statistical Association 89, 81 87. Dempster, A. P., N. M. Laird and D. B. Rubin (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B 39, 1 37. Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London A 222, 309 368. Fuller, W. A., M. M. Loughin and H. D. Baker (1994), Regression weighting in the presence of nonresponse with application to the 1987-1988 Nationwide Food Consumption Survey, Survey Methodology 20, 75 85. Hirano, K., G. Imbens and G. Ridder (2003), Efficient estimation of average treatment effects using the estimated propensity score, Econometrica 71, 1161 1189. Ibrahim, J. G. (1990), Incomplete data in generalized linear models, Journal of the American Statistical Association 85, 765 769. Kim, J. K. (2011), Parametric fractional imputation for missing data analysis, Biometrika 98, 119 132. Kim, J. K. and C. L. Yu (2011), A semi-parametric estimation of mean functionals with non-ignorable missing data, Journal of the American Statistical Association 106, 157 165. Jae-Kwang Kim (ISU) Part 2 94 / 181