Statistical Methods for Handling Missing Data

Size: px

Start display at page:

Download "Statistical Methods for Handling Missing Data"

Russell Leonard
6 years ago
Views:

1 Statistical Methods for Handling Missing Data Jae-Kwang Kim Department of Statistics, Iowa State University July 5th, 2014

2 Outline Textbook : Statistical Methods for handling incomplete data by Kim and Shao (2013) Part 1: Basic Theory (Chapter 2-3) Part 2: Imputation (Chapter 4) Part 3: Propensity score approach (Chapter 5) Part 4: Nonignorable missing (Chapter 6) Jae-Kwang Kim (ISU) July 5th, / 181

3 Statistical Methods for Handling Missing Data Part 1: Basic Theory Jae-Kwang Kim Department of Statistics, Iowa State University

4 1 Introduction Definitions for likelihood theory The likelihood function of θ is defined as where f (y; θ) is the joint pdf of y. L(θ) = f (y; θ) Let ˆθ be the maximum likelihood estimator (MLE) of θ 0 if it satisfies L(ˆθ) = max θ Θ L(θ). A parametric family of densities, P = {f (y; θ); θ Θ}, is identifiable if for all y, f (y; θ 1) f (y; θ 2) for every θ 1 θ 2. Jae-Kwang Kim (ISU) Part 1 4 / 181

5 1 Introduction - Fisher information Definition 1 Score function: S(θ) = log L(θ) θ 2 Fisher information = curvature of the log-likelihood: I(θ) = 2 log L(θ) = θ θt θ S(θ) T 3 Observed (Fisher) information: I(ˆθ n) where ˆθ n is the MLE. 4 Expected (Fisher) information: I(θ) = E θ {I(θ)} The observed information is always positive. The observed information applies to a single dataset. The expected information is meaningful as a function of θ across the admissible values of θ. The expected information is an average quantity over all possible datasets. I(ˆθ) = I(ˆθ) for exponential family. Jae-Kwang Kim (ISU) Part 1 5 / 181

6 1 Introduction - Fisher information Lemma 1. Properties of score functions [Theorem 2.3 of KS] Under regularity conditions allowing the exchange of the order of integration and differentiation, E θ {S(θ)} = 0 and V θ {S(θ)} = I(θ). Jae-Kwang Kim (ISU) Part 1 6 / 181

7 1 Introduction - Fisher information Remark Under some regularity conditions, the MLE ˆθ converges in probability to the true parameter θ 0. Thus, we can apply a Taylor linearization on S(ˆθ) = 0 to get ˆθ θ 0 = {I(θ0)} 1 S(θ 0). Here, we use the fact that I (θ) = S(θ)/ θ T converges in probability to I(θ). Thus, the (asymptotic) variance of MLE is V (ˆθ). = {I(θ 0)} 1 V {S(θ 0)} {I(θ 0)} 1 = {I(θ 0)} 1, where the last equality follows from Lemma 1. Jae-Kwang Kim (ISU) Part 1 7 / 181

8 2 Observed Likelihood Basic Setup Let y = (y 1,..., y p) be a p-dimensional random vector with probability density function f (y; θ) whose dominating measure is µ. { 1 if yij is observed Let δ ij be the response indicator function of y ij with δ ij = 0 otherwise. δ i = (δ i1,, δ ip ): p-dimensional random vector with density P(δ y) assuming P(δ y) = P(δ y; φ) for some φ. Let (y i,obs, y i,mis ) be the observed part and missing part of y i, respectively. Let R(y obs, δ) = {y; y obs (y i, δ i ) = y i,obs, i = 1,..., n} be the set of all possible values of y with the same realized value of y obs, for given δ, where y obs (y i, δ i ) is a function that gives the value of y ij for δ ij = 1. Jae-Kwang Kim (ISU) Part 1 8 / 181

9 2 Observed Likelihood Definition: Observed likelihood Under the above setup, the observed likelihood of (θ, φ) is L obs (θ, φ) = f (y; θ)p(δ y; φ)dµ(y). R(y obs,δ) Under IID setup: The observed likelihood is n [ ] L obs (θ, φ) = f (y i ; θ)p(δ i y i ; φ)dµ(y i,mis ), where it is understood that, if y i = y i,obs and y i,mis is empty then there is nothing to integrate out. In the special case of scalar y, the observed likelihood is L obs (θ, φ) = [f (y i ; θ)π(y i ; φ)] [ ] f (y; θ) {1 π(y; φ)} dy, δ i =1 δ i =0 where π(y; φ) = P(δ = 1 y; φ). Jae-Kwang Kim (ISU) Part 1 9 / 181

10 2 Observed Likelihood Example 1 [Example 2.3 of KS] Let t 1, t 2,, t n be an IID sample from a distribution with density f θ (t) = θe θt I (t > 0). Instead of observing t i, we observe (y i, δ i ) where { ti if δ i = 1 y i = c if δ i = 0 and δ i = { 1 if ti c 0 if t i > c, where c is a known censoring time. The observed likelihood for θ can be derived as L obs (θ) = n [ ] {f θ (t i )} δ i {P (t i > c)} 1 δ i = θ n δ i exp( θ n y i ). Jae-Kwang Kim (ISU) Part 1 10 / 181

11 2 Observed Likelihood Definition: Missing At Random (MAR) P(δ y) is the density of the conditional distribution of δ given y. Let y obs = y obs (y, δ) where { yi if δ i = 1 y i,obs = if δ i = 0. The response mechanism is MAR if P (δ y 1 ) = P (δ y 2 ) { or P(δ y) = P(δ y obs )} for all y 1 and y 2 satisfying y obs (y 1, δ) = y obs (y 2, δ). MAR: the response mechanism P(δ y) depends on y only through y obs. Let y = (y obs, y mis ). By Bayes theorem, P (y mis y obs, δ) = P(δ y mis, y obs ) P (y P(δ y obs ) mis y obs ). MAR: P(y mis y obs, δ) = P(y mis y obs ). That is, y mis δ y obs. MAR: the conditional independence of δ and y mis given y obs. Jae-Kwang Kim (ISU) Part 1 11 / 181

12 2 Observed Likelihood Remark MCAR (Missing Completely at random): P(δ y) does not depend on y. MAR (Missing at random): P(δ y) = P(δ y obs ) NMAR (Not Missing at random): P(δ y) P(δ y obs ) Thus, MCAR is a special case of MAR. Jae-Kwang Kim (ISU) Part 1 12 / 181

13 2 Observed Likelihood Theorem 1: Likelihood factorization (Rubin, 1976) [Theorem 2.4 of KS] P φ (δ y) is the joint density of δ given y and f θ (y) is the joint density of y. Under conditions 1 the parameters θ and φ are distinct and 2 MAR condition holds, the observed likelihood can be written as L obs (θ, φ) = L 1(θ)L 2(φ), and the MLE of θ can be obtained by maximizing L 1(θ). Thus, we do not have to specify the model for response mechanism. The response mechanism is called ignorable if the above likelihood factorization holds. Jae-Kwang Kim (ISU) Part 1 13 / 181

14 2 Observed Likelihood Example 2 [Example 2.4 of KS] Bivariate data (x i, y i ) with pdf f (x, y) = f 1(y x)f 2(x). x i is always observed and y i is subject to missingness. Assume that the response status variable δ i of y i satisfies for some function Λ 1 ( ) of known form. P (δ i = 1 x i, y i ) = Λ 1 (φ 0 + φ 1x i + φ 2y i ) Let θ be the parameter of interest in the regression model f 1 (y x; θ). Let α be the parameter in the marginal distribution of x, denoted by f 2 (x i ; α). Define Λ 0(x) = 1 Λ 1(x). Three parameters θ: parameter of interest α and φ: nuisance parameter Jae-Kwang Kim (ISU) Part 1 14 / 181

15 2 Observed Likelihood Example 2 (Cont d) Observed likelihood L obs (θ, α, φ) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y x i ; θ) f 2 (x i ; α) Λ 0 (φ 0 + φ 1x i + φ 2y) dy = L 1 (θ, φ) L 2 (α) where L 2 (α) = n f2 (x i; α). Thus, we can safely ignore the marginal distribution of x if x is completely observed. Jae-Kwang Kim (ISU) Part 1 15 / 181

16 2 Observed Likelihood Example 2 (Cont d) If φ 2 = 0, then MAR holds and L 1 (θ, φ) = L 1a (θ) L 1b (φ) where and L 1a (θ) = f 1 (y i x i ; θ) δ i =1 L 1b (φ) = Λ 1 (φ 0 + φ 1x i ) Λ 0 (φ 0 + φ 1x i ). δ i =1 δ i =0 Thus, under MAR, the MLE of θ can be obtained by maximizing L 1a (θ), which is obtained by ignoring the missing part of the data. Jae-Kwang Kim (ISU) Part 1 16 / 181

17 2 Observed Likelihood Example 2 (Cont d) Instead of y i subject to missingness, if x i is subject to missingness, then the observed likelihood becomes L obs (θ, φ, α) = f 1 (y i x i ; θ) f 2 (x i ; α) Λ 1 (φ 0 + φ 1x i + φ 2y i ) δi =1 δi =0 f 1 (y i x; θ) f 2 (x; α) Λ 0 (φ 0 + φ 1x + φ 2y i ) dx If φ 1 = 0 then L 1 (θ, φ) L 2 (α). L obs (θ, α, φ) = L 1 (θ, α) L 2 (φ) and MAR holds. Although we are not interested in the marginal distribution of x, we still need to specify the model for the marginal distribution of x. Jae-Kwang Kim (ISU) Part 1 17 / 181

18 3 Mean Score Approach The observed likelihood is the marginal density of (y obs, δ). The observed likelihood is L obs (η) = f (y; θ)p(δ y; φ)dµ(y) = R(y obs,δ) f (y; θ)p(δ y; φ)dµ(y mis ) where y mis is the missing part of y and η = (θ, φ). Observed score equation: S obs (η) η log L obs(η) = 0 Computing the observed score function can be computationally challenging because the observed likelihood is an integral form. Jae-Kwang Kim (ISU) Part 1 18 / 181

19 3 Mean Score Approach Theorem 2: Mean Score Theorem (Fisher, 1922) [Theorem 2.5 of KS] Under some regularity conditions, the observed score function equals to the mean score function. That is, S obs (η) = S(η) where S(η) = E{S com(η) y obs, δ} S com(η) = log f (y, δ; η), η f (y, δ; η) = f (y; θ)p(δ y; φ). The mean score function is computed by taking the conditional expectation of the complete-sample score function given the observation. The mean score function is easier to compute than the observed score function. Jae-Kwang Kim (ISU) Part 1 19 / 181

20 3 Mean Score Approach Proof of Theorem 2 Since L obs (η) = f (y, δ; η) /f (y, δ y obs, δ; η), we have η ln L obs (η) = ln f (y, δ; η) η η ln f (y, δ y obs, δ; η), taking a conditional expectation of the above equation over the conditional distribution of (y, δ) given (y obs, δ), we have { } η ln L obs (η) = E η ln L obs (η) y obs, δ { } = E {S com (η) y obs, δ} E η ln f (y, δ y obs, δ; η) y obs, δ. Here, the first equality holds because L obs (η) is a function of (y obs, δ) only. The last term is equal to zero by Lemma 1, which states that the expected value of the score function is zero and the reference distribution in this case is the conditional distribution of (y, δ) given (y obs, δ). Jae-Kwang Kim (ISU) Part 1 20 / 181

21 3 Mean Score Approach Example 3 [Example 2.6 of KS] 1 Suppose that the study variable y follows from a normal distribution with mean x β and variance σ 2. The score equations for β and σ 2 under complete response are S 1(β, σ 2 ) = n ( yi x i β ) x i /σ 2 = 0 and S 2(β, σ 2 ) = n/(2σ 2 ) + n ( yi x i β )2 /(2σ 4 ) = 0. 2 Assume that y i are observed only for the first r elements and the MAR assumption holds. In this case, the mean score function reduces to and S 1(β, σ 2 ) = S 2(β, σ 2 ) = n/(2σ 2 ) + r ( yi x i β ) x i /σ 2 r ( yi x i β )2 /(2σ 4 ) + (n r)/(2σ 2 ). Jae-Kwang Kim (ISU) Part 1 21 / 181

22 3 Mean Score Approach Example 3 (Cont d) 3 The maximum likelihood estimator obtained by solving the mean score equations is ( r ) 1 r ˆβ = x i x i x i y i and ˆσ 2 = 1 r r ( y i x i ˆβ) 2. Thus, the resulting estimators can be also obtained by simply ignoring the missing part of the sample, which is consistent with the result in Example 2 (for φ 2 = 0). Jae-Kwang Kim (ISU) Part 1 22 / 181

23 3 Mean Score Approach Discussion of Example 3 We are interested in estimating θ for the conditional density f (y x; θ). Under MAR, the observed likelihood for θ is r n L obs (θ) = f (y i x i ; θ) f (y x i ; θ)dµ(y) = i=r+1 r f (y i x i ; θ). The same conclusion can follow from the mean score theorem. Under MAR, the mean score function is r n S(θ) = S(θ; x i, y i ) + E{S(θ; x i, Y ) x i } = r S(θ; x i, y i ) i=r+1 where S(θ; x, y) is the score function for θ and the second equality follows from Lemma 1. Jae-Kwang Kim (ISU) Part 1 23 / 181

24 3 Mean Score Approach Example 4 [Example 2.5 of KS] 1 Suppose that the study variable y is randomly distributed with Bernoulli distribution with probability of success p i, where p i = p i (β) = exp (x i β) 1 + exp (x i β) for some unknown parameter β and x i is a vector of the covariates in the logistic regression model for y i. We assume that 1 is in the column space of x i. 2 Under complete response, the score function for β is S 1(β) = n (y i p i (β)) x i. Jae-Kwang Kim (ISU) Part 1 24 / 181

25 3 Mean Score Approach Example 4 (Cont d) 3 Let δ i be the response indicator function for y i with distribution Bernoulli(π i ) where π i = exp (x i φ 0 + y i φ 1) 1 + exp (x i φ0 + y iφ 1). We assume that x i is always observed, but y i is missing if δ i = 0. 4 Under missing data, the mean score function for β is S 1 (β, φ) = δi =1 {y i p i (β)} x i + δ i =0 1 w i (y; β, φ) {y p i (β)} x i, where w i (y; β, φ) is the conditional probability of y i = y given x i and δ i = 0: y=0 w i (y; β, φ) = P β (y i = y x i ) P φ (δ i = 0 y i = y, x i ) 1 z=0 P β (y i = z x i ) P φ (δ i = 0 y i = z, x i ) Thus, S 1 (β, φ) is also a function of φ. Jae-Kwang Kim (ISU) Part 1 25 / 181

26 3 Mean Score Approach Example 4 (Cont d) 5 If the response mechanism is MAR so that φ 1 = 0, then w i (y; β, φ) = P β (y i = y x i ) 1 z=0 P β (y i = z x i ) = P β (y i = y x i ) and so S 1 (β, φ) = δi =1 {y i p i (β)} x i = S 1 (β). 6 If MAR does not hold, then ( ˆβ, ˆφ) can be obtained by solving S 1 (β, φ) = 0 and S 2(β, φ) = 0 jointly, where S 2 (β, φ) = δi =1 {δ i π (φ; x i, y i )} (x i, y i ) + δ i =0 1 w i (y; β, φ) {δ i π i (φ; x i, y)} (x i, y). y=0 Jae-Kwang Kim (ISU) Part 1 26 / 181

27 3 Mean Score Approach Discussion of Example 4 We may not have a unique solution to S(η) = 0, where S(η) = [ S1 (β, φ), S 2(β, φ) ] when MAR does not hold, because of the non-identifiability problem associated with non-ignorable missing. To avoid this problem, often a reduced model is used for the response model. Pr (δ = 1 x, y) = Pr (δ = 1 u, y) where x = (u, z). The reduced response model introduces a smaller set of parameters and the over-identified situation can be resolved. (More discussion will be made in Part 4 lecture.) Computing the solution to S(η) is also difficult. EM algorithm, which will be presented soon, is a useful computational tool. Jae-Kwang Kim (ISU) Part 1 27 / 181

28 4 Observed information Definition 1 Observed score function: S obs (η) = η log L obs(η) 2 Fisher information from observed likelihood: I obs (η) = 2 log L η η T obs (η) 3 Expected (Fisher) information from observed likelihood: I obs (η) = E η {I obs (η)}. Lemma 2 [Theorem 2.6 of KS] Under regularity conditions, E{S obs (η)} = 0, and V{S obs (η)} = I obs (η), where I obs (η) = E η{i obs (η)} is the expected information from the observed likelihood. Jae-Kwang Kim (ISU) Part 1 28 / 181

29 4 Observed information Under missing data, the MLE ˆη is the solution to S(η) = 0. Under some regularity conditions, ˆη converges in probability to η 0 and has the asymptotic variance {I obs (η 0)} 1 with { I obs (η) = E } { } } η S obs(η) = E S 2 T obs (η) = E { S 2 (η) & B 2 = BB T. For variance estimation of ˆη, may use {I obs (ˆη)} 1. IID setup: The empirical (Fisher) information for the variance of ˆη is [Ĥ(ˆη) ] 1 = { n S 2 i (ˆη) } 1 where S i (η) = E{S i (η) y i,obs, δ i } (Redner and Walker, 1984). In general, I obs (ˆη) is preferred to I obs (ˆη) for variance estimation of ˆη. Jae-Kwang Kim (ISU) Part 1 29 / 181

30 4 Observed information Return to Example 1 Observed score function MLE for θ: S obs (θ) = n δ i log(θ) θ ˆθ = n y i n δ i n y i Fisher information: I obs (θ) = n δ i/θ 2 Expected information: I obs (θ) = n (1 e θc )/θ 2 = n(1 e θc )/θ 2. Which one do you prefer? Jae-Kwang Kim (ISU) Part 1 30 / 181

31 4 Observed information Motivation L com(η) = f (y, δ; η): complete-sample likelihood with no missing data Fisher information associated with L com(η): I com(η) = η Scom(η) = 2 log Lcom(η) T η ηt L obs (η): the observed likelihood Fisher information associated with L obs (η): I obs (η) = η S obs(η) = 2 T η η log L obs(η) T How to express I obs (η) in terms of I com(η) and S com(η)? Jae-Kwang Kim (ISU) Part 1 31 / 181

32 4 Observed information Theorem 3 (Louis, 1982; Oakes, 1999) [Theorem 2.7 of KS] Under regularity conditions allowing the exchange of the order of integration and differentiation, [ I obs (η) = E{I com(η) y obs, δ} E{Scom(η) y 2 obs, δ} S(η) 2] where S(η) = E{S com(η) y obs, δ}. = E{I com(η) y obs, δ} V {S com(η) y obs, δ}, Jae-Kwang Kim (ISU) Part 1 32 / 181

33 Proof of Theorem 3 By Theorem 2, the observed information associated with L obs (η) can be expressed as I obs (η) = η S (η) where S (η) = E {S com(η) y obs, δ; η}. Thus, we have η S (η) = S com(η; y)f (y, δ y η obs, δ; η) dµ(y) { } = Scom(η; y) f (y, δ y η obs, δ; η) dµ(y) { } + S com(η; y) η f (y, δ y obs, δ; η) dµ(y) = E { S com(η)/ η y obs, δ } { } + S com(η; y) η log f (y, δ y obs, δ; η) f (y, δ y obs, δ; η) dµ(y). The first term is equal to E {I com(η) y obs, δ} and the second term is equal to E { S com(η)s mis(η) y obs, δ } = E [{ S(η) } + Smis(η) S mis(η) y obs, δ ] = E { S mis(η)s mis(η) y obs, δ } because E { S(η)Smis(η) y obs, δ } = 0. Jae-Kwang Kim (ISU) Part 1 33 / 181

34 4. Observed information S mis (η): the score function with the conditional density f (y, δ y obs, δ). { } Expected missing information: I mis (η) = E S η T mis (η) I mis (η) = E { S mis (η) 2}. Missing information principle (Orchard and Woodbury, 1972): I mis (η) = I com(η) I obs (η), satisfying where I com(η) = E { S com(η)/ η T } is the expected information with complete-sample likelihood. An alternative expression of the missing information principle is V{S mis (η)} = V{S com(η)} V{ S(η)}. Note that V{S com(η)} = I com(η) and V{S obs (η)} = I obs (η). Jae-Kwang Kim (ISU) Part 1 34 / 181

35 4. Observed information Example 5 1 Consider the following bivariate normal distribution: ( ) [( ) ( y1i µ1 σ11 σ 12 N, µ 2 σ 12 σ 22 y 2i )], for i = 1, 2,, n. Assume for simplicity that σ 11, σ 12 and σ 22 are known constants and µ = (µ 1, µ 2) be the parameter of interest. 2 The complete sample score function for µ is n n ( ) 1 ( S com (µ) = S com (i) σ11 σ 12 y1i µ 1 (µ) = σ 12 σ 22 y 2i µ 2 The information matrix of µ based on the complete sample is ( ) 1 σ11 σ 12 I com (µ) = n. σ 12 σ 22 ). Jae-Kwang Kim (ISU) Part 1 35 / 181

36 4. Observed information Example 5 (Cont d) 3 Suppose that there are some missing values in y 1i and y 2i and the original sample is partitioned into four sets: H = both y 1 and y 2 respond K = only y 1 is observed L = only y 2 is observed M = both y 1 and y 2 are missing. Let n H, n K, n L, n M represent the size of H, K, L, M, respectively. 4 Assume that the response mechanism does not depend on the value of (y 1, y 2) and so it is MAR. In this case, the observed score function of µ based on a single observation in set K is E { } S com (i) (µ) y 1i, i K ( ) 1 ( ) σ11 σ 12 y1i µ 1 = σ 12 σ 22 E(y 2i y 1i ) µ 2 ( σ 1 11 = (y ) 1i µ 1). 0 Jae-Kwang Kim (ISU) Part 1 36 / 181

37 4. Observed information Example 5 (Cont d) 5 Similarly, we have E { } S com (i) (µ) y 2i, i L = ( 0 σ 1 22 (y 2i µ 2) ). 6 Therefore, and the observed information matrix of µ is I obs (µ) = n H ( σ11 σ 12 σ 12 σ 22 ) 1 + n K ( σ ) + n L ( σ 1 22 and the asymptotic variance of the MLE of µ can be obtained by the inverse of I obs (µ). ) Jae-Kwang Kim (ISU) Part 1 37 / 181

38 5. EM algorithm Interested in finding ˆη that maximizes L obs (η). The MLE can be obtained by solving S obs (η) = 0, which is equivalent to solving S(η) = 0 by Theorem 2. Computing the solution S(η) = 0 can be challenging because it often involves computing I obs (η) = S(η)/ η in order to apply Newton method: { 1 ˆη (t+1) = ˆη (t) + I obs (ˆη )} (t) S(ˆη (t) ). We may rely on Louis formula (Theorem 3) to compute I obs (η). EM algorithm provides an alternative method of solving S(η) = 0 by writing S(η) = E {S com(η) y obs, δ; η} and using the following iterative method: { ˆη (t+1) solve E S com(η) y obs, δ; ˆη (t)} = 0. Jae-Kwang Kim (ISU) Part 1 38 / 181

39 5. EM algorithm Definition Let η (t) be the current value of the parameter estimate of η. The EM algorithm can be defined as iteratively carrying out the following E-step and M-steps: E-step: Compute ( Q η η (t)) = E {ln f (y, δ; η) y obs, δ, η (t)} M-step: Find η (t+1) that maximizes Q(η η (t) ) w.r.t. η. Theorem 4 (Dempster et al., 1977) [Theorem 3.2] Let L obs (η) = f (y, δ; η) dµ(y) be the observed likelihood of η. If R(y obs,δ) Q(η (t+1) η (t) ) Q(η (t) η (t) ), then L obs (η (t+1) ) L obs (η (t) ). Jae-Kwang Kim (ISU) Part 1 39 / 181

40 5. EM algorithm Remark 1 Convergence of EM algorithm is linear. It can be shown that η (t+1) η (t) = Jmis ( η (t) η (t 1)) where J mis = I 1 comi mis is called the fraction of missing information. 2 Under MAR and for the exponential family of the distribution of the form f (y; θ) = b (y) exp { θ T (y) A (θ) }. The M-step computes θ (t+1) by the solution to { E T (y) y obs, θ (t)} = E {T (y) θ}. Jae-Kwang Kim (ISU) Part 1 40 / 181

41 5. EM algorithm h 1(θ) h 2(θ) θ Figure : Illustration of EM algorithm for exponential family (h 1 (θ) = E {T (y) y obs, θ}, h 2 (θ) = E {T (y) θ}) Jae-Kwang Kim (ISU) Part 1 41 / 181

42 5. EM algorithm Return to Example 4 E-step: where ( S 1 β β (t), φ (t)) = {y i p i (β)} x i + δ i =1 δ i =0 w ij(t) = Pr(Y i = j x i, δ i = 0; β (t), φ (t) ) = 1 w ij(t) {j p i (β)} x i, j=0 Pr(Y i = j x i ; β (t) )Pr(δ i = 0 x i, j; φ (t) ) 1 y=0 Pr(Y i = y x i ; β (t) )Pr(δ i = 0 x i, y; φ (t) ) and ( S 2 φ β (t), φ (t)) = {δ i π (x i, y i ; φ)} ( x ) i, y i δ i =1 + δ i =0 1 w ij(t) {δ i π i (x i, j; φ)} ( x i, j ). j=0 Jae-Kwang Kim (ISU) Part 1 42 / 181

43 5. EM algorithm Return to Example 4 (Cont d) M-step: The parameter estimates are updated by solving [ S 1 (β β (t), φ (t)), S 2 ( φ β (t), φ (t))] = (0, 0) for β and φ. For categorical missing data, the conditional expectation in the E-step can be computed using the weighted mean with weights w ij(t). Ibrahim (1990) called this method EM by weighting. Observed information matrix can also be obtained by the Louis formula (in Theorem 3) using the weighted mean in the E-step. Jae-Kwang Kim (ISU) Part 1 43 / 181

44 5. EM algorithm Example 6. Mixture model [Example 3.8 of KS] Observation where Y i = (1 W i ) Z 1i + W i Z 2i, ( ) Z 1i N µ 1, σ1 2 ( ) Z 2i N µ 2, σ2 2 W i Bernoulli (π). i = 1, 2,, n Parameter of interest θ = ( ) µ 1, µ 2, σ1, 2 σ2, 2 π Jae-Kwang Kim (ISU) Part 1 44 / 181

45 5. EM algorithm Example 6 (Cont d) Observed likelihood where and L obs (θ) = n pdf (y i θ) ( ) ( ) pdf (y θ) = (1 π) φ y µ 1, σ1 2 + πφ y µ 2, σ2 2 ( φ y µ, σ 2) = 1 exp 2πσ [ ] (y µ)2. 2σ 2 Jae-Kwang Kim (ISU) Part 1 45 / 181

46 5. EM algorithm Example 6 (Cont d) Full sample likelihood where pdf (y, w θ) = L full (θ) = n pdf (y i, w i θ) [ ( )] 1 w [ ( w φ y µ 1, σ1 2 φ y µ 2, σ2)] 2 π w (1 π) 1 w. Thus, ln L full (θ) = n + [ ( ) ( )] (1 w i ) ln φ y i µ 1, σ1 2 + w i ln φ y i µ 2, σ2 2 n {w i ln (π) + (1 w i ) ln (1 π)} Jae-Kwang Kim (ISU) Part 1 46 / 181

47 5. EM algorithm Example 6 (Cont d) E-M algorithm [E-step] ( Q θ θ (t)) = where r (t) i n [( 1 r (t) i + n ) = E (w i y i, θ (t) with E (w i y i, θ) = ) ( ) ln φ y i µ 1, σ1 2 { ( r (t) i ln (π) + 1 r (t) i ) ( )] + r (t) i ln φ y i µ 2, σ2 2 } ln (1 π) πφ ( ) y i µ 2, σ2 2 (1 π) φ (y i µ 1, σ1 2) + πφ (y i µ 2, σ2 2) [M-step] ( θ Q θ θ (t)) = 0. Jae-Kwang Kim (ISU) Part 1 47 / 181

48 5. EM algorithm Example 7. (Robust regression) [Example 3.12 of KS] Model: y i = β 0 + β 1x i + σe i with e i t(ν), ν: known. Missing data setup: e i = u i / w i where u i N(0, 1), w i χ 2 /ν. (x i, y i, w i ): complete data (x i, y i always observed, w i always missing) ) y i (x i, w i ) N (β 0 + β 1x i, σ 2 /w i and x i is fixed (thus, independent of w i ). EM algorithm can be used to estimate θ = (β 0, β 1, σ 2 ) Jae-Kwang Kim (ISU) Part 1 48 / 181

49 5. EM algorithm Example 7 (Cont d) E-step: Find the conditional distribution of w i given x i. By Bayes theorem, f (w i x i, y i ) f (w i )f (y i w i, x i ) { Gamma ν + 1 ( ) } 2, 2 yi β 0 β 1 x 2 1 i ν + σ Thus, E(w i x i, y i, θ (t) ) = ν + 1 ( ) 2, ν + d (t) i where d (t) i = (y i β (t) 0 β (t) 1 x i)/σ (t). Jae-Kwang Kim (ISU) July 5th, / 181

50 5. EM algorithm Example 7 (Cont d) M-step: ( µ (t) x ), µ (t) y = n w (t) i (x i, y i ) /( n w (t) i ) β (t+1) 0 = µ (t) y β (t+1) 1 µ (t) x n β (t+1) 1 = w (t) i (x i µ (t) x )(y i µ (t) n w (t) i (x i µ (t) x ) 2 σ 2(t+1) = 1 n ( w (t) i y i β (t) 0 β (t) 1 n x i where w (t) i = E(w i x i, y i, θ (t) ). y ) ) 2 Jae-Kwang Kim (ISU) July 5th, / 181

51 6. Summary Interested in finding the MLE that maximizes the observed likelihood function. Under MAR, the model specification of the response mechanism is not necessary. Mean score equation can be used to compute the MLE. EM algorithm is a useful computational tool for solving the mean score equation. The E-step of the EM algorithm may require some computational tool (see Part 2). The asymptotic variance of the MLE can be computed by the inverse of the observed information matrix, which can be computed using Louis formula. Jae-Kwang Kim (ISU) July 5th, / 181

52 REFERENCES Cheng, P. E. (1994), Nonparametric estimation of mean functionals with data missing at random, Journal of the American Statistical Association 89, Dempster, A. P., N. M. Laird and D. B. Rubin (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B 39, Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London A 222, Fuller, W. A., M. M. Loughin and H. D. Baker (1994), Regression weighting in the presence of nonresponse with application to the Nationwide Food Consumption Survey, Survey Methodology 20, Hirano, K., G. Imbens and G. Ridder (2003), Efficient estimation of average treatment effects using the estimated propensity score, Econometrica 71, Ibrahim, J. G. (1990), Incomplete data in generalized linear models, Journal of the American Statistical Association 85, Jae-Kwang Kim (ISU) July 5th, / 181

53 Kim, J. K. (2011), Parametric fractional imputation for missing data analysis, Biometrika 98, Kim, J. K. and C. L. Yu (2011), A semi-parametric estimation of mean functionals with non-ignorable missing data, Journal of the American Statistical Association 106, Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), On the bias of the multiple imputation variance estimator in survey sampling, Journal of the Royal Statistical Society: Series B 68, Kim, J. K. and M. K. Riddles (2012), Some theory for propensity-score-adjustment estimators in survey sampling, Survey Methodology 38, Kott, P. S. and T. Chang (2010), Using calibration weighting to adjust for nonignorable unit nonresponse, Journal of the American Statistical Association 105, Louis, T. A. (1982), Finding the observed information matrix when using the EM algorithm, Journal of the Royal Statistical Society: Series B 44, Meng, X. L. (1994), Multiple-imputation inferences with uncongenial sources of input (with discussion), Statistical Science 9, Jae-Kwang Kim (ISU) July 5th, / 181

54 Oakes, D. (1999), Direct calculation of the information matrix via the em algorithm, Journal of the Royal Statistical Society: Series B 61, Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and applications, in Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, Berkeley, California, pp Redner, R. A. and H. F. Walker (1984), Mixture densities, maximum likelihood and the EM algorithm, SIAM Review 26, Robins, J. M., A. Rotnitzky and L. P. Zhao (1994), Estimation of regression coefficients when some regressors are not always observed, Journal of the American Statistical Association 89, Robins, J. M. and N. Wang (2000), Inference for imputation estimators, Biometrika 87, Rubin, D. B. (1976), Inference and missing data, Biometrika 63, Tanner, M. A. and W. H. Wong (1987), The calculation of posterior distribution by data augmentation, Journal of the American Statistical Association 82, Jae-Kwang Kim (ISU) July 5th, / 181

55 Wang, N. and J. M. Robins (1998), Large-sample theory for parametric multiple imputation procedures, Biometrika 85, Wang, S., J. Shao and J. K. Kim (2014), Identifiability and estimation in problems with nonignorable nonresponse, Statistica Sinica 24, Wei, G. C. and M. A. Tanner (1990), A Monte Carlo implementation of the EM algorithm and the poor man s data augmentation algorithms, Journal of the American Statistical Association 85, Zhou, M. and J. K. Kim (2012), An efficient method of estimation for longitudinal surveys with monotone missing data, Biometrika 99,

56 Statistical Methods for Handling Missing Data Part 2: Imputation Jae-Kwang Kim Department of Statistics, Iowa State University Jae-Kwang Kim (ISU) Part 2 52 / 181

57 Introduction Basic setup Y: a vector of random variables with distribution F (y; θ). y 1,, y n are n independent realizations of Y. We are interested in estimating ψ which is implicitly defined by E{U(ψ; Y)} = 0. Under complete observation, a consistent estimator ˆψ n of ψ can be obtained by solving estimating equation for ψ: n U(ψ; y i ) = 0. A special case of estimating function is the score function. In this case, ψ = θ. Sandwich variance estimator is often used to estimate the variance of ˆψ n: where τ u = E{ U(ψ; y)/ ψ }. ˆV ( ˆψ n) = ˆτ u 1 ˆV (U)ˆτ u 1 Jae-Kwang Kim (ISU) Part 2 53 / 181

58 1. Introduction Missing data setup Suppose that y i is not fully observed. y i = (y obs,i, y mis,i ): (observed, missing) part of y i δ i : response indicator functions for y i. Under the existence of missing data, we can use the following estimators: ˆψ: solution to n E {U (ψ; y i ) y obs,i, δ i } = 0. (1) The equation in (1) is often called expected estimating equation. Jae-Kwang Kim (ISU) Part 2 54 / 181

59 1. Introduction Motivation (for imputation) Computing the conditional expectation in (1) can be a challenging problem. 1 The conditional expectation depends on unknown parameter values. That is, E {U (ψ; y i ) y obs,i, δ i } = E {U (ψ; y i ) y obs,i, δ i ; θ, φ}, where θ is the parameter in f (y; θ) and φ is the parameter in p(δ y; φ). 2 Even if we know η = (θ, φ), computing the conditional expectation is numerically difficult. Jae-Kwang Kim (ISU) Part 2 55 / 181

60 1. Introduction Imputation Imputation: Monte Carlo approximation of the conditional expectation (given the observed data). E {U (ψ; y i ) y obs,i, δ i } = 1 m m j=1 ( ) U ψ; y obs,i, y (j) mis,i 1 Bayesian approach: generate y mis,i from f (y mis,i y obs, δ) = f (y mis,i y obs, δ; η) p(η y obs, δ)dη 2 Frequentist approach: generate y mis,i from f (y mis,i y obs,i, δ; ˆη), where ˆη is a consistent estimator. Jae-Kwang Kim (ISU) Part 2 56 / 181

61 1. Introduction Imputation Questions 1 How to generate the Monte Carlo samples (or the imputed values)? 2 What is the asymptotic distribution of ˆψ I which solves 1 m m j=1 ( ) U ψ; y obs,i, y (j) mis,i = 0, where y (j) mis,i f (y mis,i y obs,i, δ; ˆη p) for some ˆη p? 3 How to estimate the variance of ˆψ I? Jae-Kwang Kim (ISU) Part 2 57 / 181

62 2. Basic Theory for Imputation Basic Setup (for Case 1: ψ = η) y = (y 1,, y n) f (y; θ) δ = (δ 1,, δ n) P(δ y; φ) y = (y obs, y mis ): (observed, missing) part of y. y (1) mis,, y (m) mis : m imputed values of y mis generated from f (y mis y obs, δ; ˆη p) = f (y; ˆθ p)p(δ y; ˆφ p) f (y; ˆθ p)p(δ y; ˆφ p)dµ(y mis), where ˆη p = (ˆθ p, ˆφ p) is a preliminary estimator of η = (θ, φ). Using m imputed values, imputed score function is computed as S imp,m (η ˆη p) 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) where S com(η; y) is the score function of η = (θ, φ) under complete response. Jae-Kwang Kim (ISU) Part 2 58 / 181

63 2. Basic Theory for Imputation Lemma 1 (Asymptotic results for m = ) Assume that ˆη p converges in probability to η. Let ˆη I,m be the solution to 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) = 0, where y (1) mis,, y (m) mis are the imputed values generated from f (y mis y obs, δ; ˆη p). Then, under some regularity conditions, for m, and ˆη imp, = ˆη MLE + J mis (ˆη p ˆη MLE) (2) V (ˆη imp, ). = I 1 obs + J mis {V (ˆη p) V (ˆη MLE)} J mis, where J mis = I 1 comi mis is the fraction of missing information. Jae-Kwang Kim (ISU) Part 2 59 / 181

64 2. Basic Theory for Imputation Remark For m =, the imputed score equation becomes the mean score equation. Equation (2) means that ˆη imp, = (I J mis) ˆη MLE + J mis ˆη p. (3) That is, ˆη imp, is a convex combination of ˆη MLE and ˆη p. Note that ˆη imp, is one-step EM update with initial estimate ˆη p. Let ˆη (t) be the t-th EM update of η that is computed by solving ( S η ˆη (t 1)) = 0 with ˆη (0) = ˆη p. Equation (3) implies that ˆη (t) = (I J mis) ˆη MLE + J mis ˆη (t 1). Thus, we can obtain ˆη (t) = ˆη MLE + (J mis) t 1 (ˆη (0) ˆη MLE ), which justifies lim t ˆη (t) = ˆη MLE. Jae-Kwang Kim (ISU) Part 2 60 / 181

65 2. Basic Theory for Imputation Theorem 1 (Asymptotic results for m < ) [Theorem 4.1 of KS] Let ˆη p be a preliminary n-consistent estimator of η with variance V p. Under some regularity conditions, the solution ˆη imp,m to S m (η ˆη p) 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) = 0 has mean η 0 and asymptotic variance equal to V (ˆη imp,m ) { }. = I 1 obs + J mis V p I 1 obs J mis + m 1 IcomI 1 misicom 1 (4) where J mis = I 1 comi mis. This theorem was originally presented by Wang and Robins (1998). Jae-Kwang Kim (ISU) Part 2 61 / 181

66 2. Basic Theory for Imputation Remark If we use ˆη p = ˆη MLE, then the asymptotic variance in (4) is V (ˆη imp,m ). = I 1 obs + m 1 IcomI 1 misicom. 1 In Bayesian imputation (or multiple imputation), the posterior values of η are independently generated from η N(ˆη MLE, I 1 obs ), which implies that we can use V p = I 1 obs + m 1 I 1 obs. Thus, the asymptotic variance in (4) for multiple imputation is V (ˆη imp,m ). = I 1 obs + m 1 JmisI 1 obs J 1 mis + m 1 IcomI 1 misicom. 1 The second term is the additional price we pay when generating the posterior values, rather than using ˆη MLE directly. Jae-Kwang Kim (ISU) Part 2 62 / 181

67 2. Basic Theory for Imputation Basic Setup (for Case 2: ψ η) Parameter ψ defined by E{U(ψ; y)} = 0. Under complete response, a consistent estimator of ψ can be obtained by solving U (ψ; y) = 0. Assume that some part of y, denoted by y mis, is not observed and m imputed values, say y (1) mis,, y (m) mis, are generated from f (ymis y obs, δ; ˆη MLE ), where ˆη MLE is the MLE of η 0. The imputed estimating function using m imputed values is computed as Ū imp,m (ψ ˆη MLE ) = 1 m m U(ψ; y (j) ), (5) where y (j) = (y obs, y (j) mis ). Let ˆψ imp,m be the solution to Ū imp,m (ψ ˆη MLE ) = 0. We are interested in the asymptotic properties of ˆψ imp,m. j=1 Jae-Kwang Kim (ISU) Part 2 63 / 181

68 2. Basic Theory for Imputation Theorem 2 [Theorem 4.2 of KS] Suppose that the parameter of interest ψ 0 is estimated by solving U (ψ) = 0 under complete response. Then, under some regularity conditions, the solution to E {U (ψ) y obs, δ; ˆη MLE } = 0 (6) has mean ψ 0 and the asymptotic variance τ 1 Ωτ 1, where τ = E { U (ψ 0) / ψ } Ω = V { Ū (ψ 0 η 0) + κs obs (η 0) } and κ = E {U (ψ 0) S mis(η 0)} I 1 obs. This theorem was originally presented by Robins and Wang (2000). Jae-Kwang Kim (ISU) Part 2 64 / 181

69 2. Basic Theory for Imputation Remark Writing Ū(ψ η) E{U(ψ) y obs, δ; η}, the solution to (6) can be treated as the solution to the joint estimating equation [ ] U1(ψ, η) U (ψ, η) = 0, U 2(η) where U 1(ψ, η) = Ū (ψ η) and U2(η) = S obs (η). We can apply the Taylor expansion to get ( ) ( ) ( ) 1 [ ˆψ ψ0 B11 B 12 U1(ψ 0, η 0) = ˆη η 0 B 21 B 22 U 2(η 0) ] where ( B11 B 12 Thus, as B 21 = 0, B 21 B 22 ) = [ E ( U1/ ψ ) E ( U 1/ η ) E ( U 2/ ψ ) E ( U 2/ η ) ˆψ { } = ψ 0 B 1 11 U 1(ψ 0, η 0) B 12B 1 22 U 2(η 0). In Theorem 2, B 11 = τ, B 12 = E(US mis ), and B 22 = I obs. Jae-Kwang Kim (ISU) Part 2 65 / 181 ].

70 3. Monte Carlo EM Motivation: Monte Carlo samples in the EM algorithm can be used as imputed values. Monte Carlo EM 1 In the EM algorithm defined by [E-step] Compute ( Q η η (t)) = E {ln f (y, δ; η) y obs, δ; η (t)} [M-step] Find η (t+1) that maximizes Q ( η η (t)), E-step is computationally cumbersome because it involves integral. 2 Wei and Tanner (1990): In the E-step, first draw y (1) mis,, y (m) mis f (y mis y obs, δ; η (t)) and approximate ( Q η η (t)) 1 = m m ln f j=1 ( y obs, y (j) mis, δ; η ). Jae-Kwang Kim (ISU) Part 2 66 / 181

71 3. Monte Carlo EM Example 1 [Example 3.15 of KS] Suppose that y i f (y i x i ; θ) Assume that x i is always observed but we observe y i only when δ i = 1 where δ i Bernoulli [π i (φ)] and π i (φ) = exp (φ0 + φ1x i + φ 2y i ) 1 + exp (φ 0 + φ 1x i + φ 2y i ). To implement the MCEM method, in the E-step, we need to generate samples from f (y i x i, δ i = 0; ˆθ, ˆφ) = f (y i x i ; ˆθ){1 π i ( ˆφ)} f (yi x i ; ˆθ){1 π i ( ˆφ)}dy i. Jae-Kwang Kim (ISU) Part 2 67 / 181

72 3. Monte Carlo EM Example 1 (Cont d) We can use the following rejection method to generate samples from f (y i x i, δ i = 0; ˆθ, ˆφ): 1 Generate yi from f (y i x i ; ˆθ). 2 Using yi, compute π i ( ˆφ) = Accept yi with probability 1 πi ( ˆφ). 3 If yi is not accepted, then goto Step 1. exp( ˆφ 0 + ˆφ 1 x i + ˆφ 2 y i ) 1 + exp( ˆφ 0 + ˆφ 1 x i + ˆφ 2 y i ). Jae-Kwang Kim (ISU) Part 2 68 / 181

73 3. Monte Carlo EM Example 1 (Cont d) Using the m imputed values of y i, denoted by y (1) i,, y (m) i, and the M-step can be implemented by solving and n j=1 n m ( ) S θ; x i, y (j) i = 0 j=1 where S (θ; x i, y i ) = log f (y i x i ; θ)/ θ. m { } ( ) δ i π(φ; x i, y (j) i ) 1, x i, y (j) i = 0, Jae-Kwang Kim (ISU) Part 2 69 / 181

74 3. Monte Carlo EM Example 2 (GLMM) [Example 3.18] Basic Setup: Let y ij be a binary random variable (that takes 0 or 1) with probability p ij = Pr (y ij = 1 x ij, a i ) and we assume that logit (p ij ) = x ijβ + a i where x ij is a p-dimensional covariate associate with j-th repetition of unit i, β is the parameter of interest that can represent the treatment effect due to x, and a i represents the random effect associate with unit i. We assume that a i are iid with N ( 0, σ 2). Missing data : a i Observed likelihood: L obs ( β, σ 2) = i { j p (x ij, a i ; β) y ij [1 p (x ij, a i ; β)] 1 y 1 ( ij σ φ ai ) } da i σ where φ ( ) is the pdf of the standard normal distribution. Jae-Kwang Kim (ISU) Part 2 70 / 181

75 3. Monte Carlo EM Example 2 (Cont d) MCEM approach: generate a i from Metropolis-Hastings algorithm: 1 Generate ai from f 2 (a i ; ˆσ). 2 Set { where ( ρ f (a i x i, y i ; ˆβ, ˆσ) f 1(y i x i, a i ; ˆβ)f 2(a i ; ˆσ). a (t) i = a (t 1) i ai a (t 1) i w.p. ρ(a (t 1) i, ai ) w.p. 1 ρ(a (t 1) i, ai ) ) f 1 (y i x i, a, ai i ; ˆβ ) = min ( ), 1 f 1 y i x i, a (t 1) ; ˆβ. i Jae-Kwang Kim (ISU) Part 2 71 / 181

76 3. Monte Carlo EM Remark Monte Carlo EM can be used as a frequentist approach to imputation. Convergence is not guaranteed (for fixed m). E-step can be computationally heavy. (May use MCMC method). Jae-Kwang Kim (ISU) Part 2 72 / 181

77 4. Parametric Fractional Imputation Parametric fractional imputation 1 More than one (say m) imputed values of y mis,i : y (1) mis,i,, y (m) mis,i from some (initial) density h (y mis,i ). 2 Create weighted data set {( w ij, yij ) } ; j = 1, 2,, m; i = 1, 2, n where m j=1 w ij = 1, yij = (y obs,i, y (j) mis,i ) w ij f (y ij, δ i ; ˆη)/h(y (j) mis,i ), ˆη is the maximum likelihood estimator of η, and f (y, δ; η) is the joint density of (y, δ). 3 The weight wij weights. are the normalized importance weights and can be called fractional If y mis,i is categorical, then simply use all possible values of y mis,i as the imputed values and then assign their conditional probabilities as the fractional weights. Jae-Kwang Kim (ISU) Part 2 73 / 181

78 4. Parametric Fractional Imputation Remark Importance sampling idea: For sufficiently large m, m j=1 ij g ( yij ) g(yi ) f (y i,δ i ; ˆη) = f (yi,δ i ; ˆη) h(y h(y mis,i ) mis,i)dy mis,i w h(y mis,i ) h(y mis,i)dy mis,i = E {g (y i ) y obs,i, δ i ; ˆη} for any g such that the expectation exists. In the importance sampling literature, h( ) is called proposal distribution and f ( ) is called target distribution. Do not need to compute the conditional distribution f (y mis,i y obs,i, δ i ; η). Only the joint distribution f (y obs,i, y mis,i, δ i ; η) is needed because f (y obs,i, y (j) mis,i, δ i; ˆη)/h(y (j) i,mis ) m k=1 f (y obs,i, y (k) mis,i, δ i; ˆη)/h(y (k) i,mis ) = f (y m k=1 (j) mis,i y obs,i, δ i ; ˆη)/h(y (j) i,mis ) (k) f (ymis,i y obs,i, δ i ; ˆη)/h(y (k) i,mis ). Jae-Kwang Kim (ISU) Part 2 74 / 181

79 4. Parametric Fractional Imputation EM algorithm by fractional imputation 1 Imputation-step: generate y (j) mis,i h (y i,mis ). 2 Weighting-step: compute where m j=1 w ij(t) = 1. 3 M-step: update w ij(t) f (y ij, δ i ; ˆη (t) )/h(y (j) i,mis ) ˆη (t+1) : solution to n m j=1 w ij(t)s ( η; y ij, δ i ) = 0. 4 Repeat Step2 and Step 3 until convergence. Imputation Step + Weighting Step = E-step. We may add an optional step that checks if w ij(t) is too large for some j. In this case, h(y i,mis ) needs to be changed. Jae-Kwang Kim (ISU) Part 2 75 / 181

80 4. Parametric Fractional imputation The imputed values are not changed for each EM iteration. Only the fractional weights are changed. 1 Computationally efficient (because we use importance sampling only once). 2 Convergence is achieved (because the imputed values are not changed). For sufficiently large t, ˆη (t) ˆη. Also, for sufficiently large m, ˆη ˆη MLE. For estimation of ψ in E{U(ψ; Y )} = 0, simply use 1 n n m j=1 w ij U(ψ; y ij ) = 0. Linearization variance estimation (using the result of Theorem 2) is discussed in Kim (2011). Jae-Kwang Kim (ISU) Part 2 76 / 181

81 4. Parametric Fractional imputation Return to Example 1 Fractional imputation 1 Imputation Step: Generate y (1) i,, y (m) i from f (y i x i ; ˆθ ) (0). 2 Weighting Step: Using the m imputed values generated from Step 1, compute the fractional weights by ( f y (j) wij(t) i x i ; ˆθ ) (t) { ( f y (j) i x i ; ˆθ ) 1 π(x i, y (j) i ; ˆφ } (t) ) (0) where ( π(x i, y i ; ˆφ) exp ˆφ0 + ˆφ 1 x i + ˆφ ) 2 y i = ( 1 + exp ˆφ0 + ˆφ 1 x i + ˆφ ). 2 y i Jae-Kwang Kim (ISU) Part 2 77 / 181

82 4. Parametric Fractional imputation Return to Example 1 (Cont d) Using the imputed data and the fractional weights, the M-step can be implemented by solving and n j=1 n m ( ) wij(t)s θ; x i, y (j) i = 0 j=1 m { } ( ) w ij(t) δ i π(φ; x i, y (j) i ) 1, x i, y (j) i = 0, (7) where S (θ; x i, y i ) = log f (y i x i ; θ)/ θ. Jae-Kwang Kim (ISU) Part 2 78 / 181

83 4. Parametric Fractional imputation Example 3: Categorical missing data Original data i Y 1i Y 2i X i ? 4 3? ?? 8 Jae-Kwang Kim (ISU) Part 2 79 / 181

84 4. Parametric Fractional imputation Example 3 (Con d) Y 1 and Y 2 are dichotomous and X is continuous. Model Pr (Y 1 = 1 X ) = Λ (α 0 + α 1X ) Pr (Y 2 = 1 X, Y 1) = Λ (β 0 + β 1X + β 2Y 1) where Λ (x) = {1 + exp ( x)} 1 Assume MAR Jae-Kwang Kim (ISU) Part 2 80 / 181

85 4. Parametric Fractional imputation Example 3 (Con d) Imputed data i Fractional Weight Y 1i Y 2i X i Pr (Y 2 = 0 Y 1 = 1, X = 4) Pr (Y 2 = 1 Y 1 = 1, X = 4) Pr (Y 1 = 0 Y 2 = 0, X = 5) Pr (Y 1 = 1 Y 2 = 0, X = 5) Pr (Y 1 = 0, Y 1 = 0 X = 8) Pr (Y 1 = 0, Y 1 = 1 X = 8) Pr (Y 1 = 1, Y 1 = 0 X = 8) Pr (Y 1 = 1, Y 1 = 1 X = 8) Jae-Kwang Kim (ISU) Part 2 81 / 181

86 4. Parametric Fractional imputation Example 3 (Con d) Implementation of EM algorithm using fractional imputation E-step: compute the mean score functions using the fractional weights. M-step: solve the mean score function. Because we have a completed data with weights, we can estimate other parameters such as θ = Pr (Y 2 = 1 X > 5). Jae-Kwang Kim (ISU) Part 2 82 / 181

87 4. Parametric Fractional imputation Example 4: Measurement error model [Example 4.13 of KS] Interested in estimating θ in f (y x; θ). Instead of observing x, we observe z which can be highly correlated with x. Thus, z is an instrumental variable for x: and for a b. f (y x, z) = f (y x) f (y z = a) f (y z = b) In addition to original sample, we have a separate calibration sample that observes (x i, z i ). Jae-Kwang Kim (ISU) Part 2 83 / 181

88 4. Parametric Fractional imputation Table : Data Structure Z X Y Calibration Sample o o Original Sample o o Jae-Kwang Kim (ISU) Part 2 84 / 181

89 4. Parametric Fractional imputation Example 4 (Cont d) The goal is to generate x in the original sample from f (x i z i, y i ) f (x i z i ) f (y i x i, z i ) = f (x i z i ) f (y i x i ) Obtain a consistent estimator ˆf (x z) from calibration sample. E-step 1 Generate x (1) i,, x (m) i from ˆf (x i z i ). 2 Compute the fractional weights associated with x (j) i w ij f (y i x (j) ; ˆθ) M-step: Solve the weighted score equation for θ. i by Jae-Kwang Kim (ISU) Part 2 85 / 181

90 5. Multiple imputation Features 1 Imputed values are generated from y (j) i,mis f (y i,mis y i,obs, δ i ; η ) where η is generated from the posterior distribution π (η y obs ). 2 Variance estimation formula is simple (Rubin s formula). ˆV MI ( ψ m) = W m + (1 + 1 m )Bm where W M = m 1 m k=1 ˆV I (k), B m = (m 1) 1 m k=1 ( ˆψ(k) ψ m ) 2, ψ m = m 1 m k=1 ˆψ (k) is the average of m imputed estimators, and ˆV I (k) is the imputed version of the variance estimator of ˆψ under complete response. Jae-Kwang Kim (ISU) Part 2 86 / 181

91 5. Multiple imputation The computation for Bayesian imputation can be implemented by the data augmentation (Tanner and Wong, 1987) technique, which is a special application of the Gibb s sampling method: 1 I-step: Generate y mis f (y mis y obs, δ; η ) 2 P-step: Generate η π (η y obs, y mis, δ) Needs some tools for checking the convergence to a stable distribution. Consistency of variance estimator is questionable (Kim et al., 2006). Jae-Kwang Kim (ISU) Part 2 87 / 181

92 6. Simulation Study Simulation 1 Bivariate data (x i, y i ) of size n = 200 with x i N(3, 1) y i N( 2 + x i, 1) x i always observed, y i subject to missingness. MCAR (δ Bernoulli(0.6)) Parameters of interest 1 θ 1 = E(Y ) 2 θ 2 = Pr(Y < 1) Multiple imputation (MI) and fractional imputation (FI) are applied with m = 50. For estimation of θ 2, the following method-of-moment estimator is used. ˆθ 2,MME = n 1 n I (y i < 1) Jae-Kwang Kim (ISU) Part 2 88 / 181

93 6. Simulation Study Table 1 Monte Carlo bias and variance of the point estimators. Parameter Estimator Bias Variance Std Var Complete sample θ 1 MI FI Complete sample θ 2 MI FI Table 2 Monte Carlo relative bias of the variance estimator. Parameter Imputation Relative bias (%) V (ˆθ 1) MI FI 1.21 V (ˆθ 2) MI FI 2.05 Jae-Kwang Kim (ISU) Part 2 89 / 181

94 6. Simulation study Rubin s formula is based on the following decomposition: V (ˆθ MI ) = V (ˆθ n) + V (ˆθ MI ˆθ n) where ˆθ n is the complete-sample estimator of θ. Basically, W m term estimates V (ˆθ n) and (1 + m 1 )B m term estimates V (ˆθ MI ˆθ n). For general case, we have V (ˆθ MI ) = V (ˆθ n) + V (ˆθ MI ˆθ n) + 2Cov(ˆθ MI ˆθ n, ˆθ n) and Rubin s variance estimator ignores the covariance term. Thus, a sufficient condition for the validity of unbiased variance estimator is Cov(ˆθ MI ˆθ n, ˆθ n) = 0. Meng (1994) called the condition congeniality of ˆθ n. Congeniality holds when ˆθ n is the MLE of θ. Jae-Kwang Kim (ISU) Part 2 90 / 181

95 6. Simulation study For example, there are two estimators of θ = P(Y < 1) when Y follows from N(µ, σ 2 ). 1 Maximum likelihood method: ˆθ MLE = 1 φ(z; ˆµ, ˆσ2 )dz 2 Method of moments: ˆθ MME = n 1 n I (y i < 1) In the simulation setup, the imputed estimator of θ 2 can be expressed as n ˆθ 2,I = n 1 [δ i I (y i < 1) + (1 δ i )E{I (y i < 1) x i ; ˆµ, ˆσ}]. Thus, imputed estimator of θ 2 borrows strength by making use of extra information associated with f (y x). Thus, when the congeniality conditions does not hold, the imputed estimator improves the efficiency (due to the imputation model that uses extra information) but the variance estimator does not recognize this improvement. Jae-Kwang Kim (ISU) Part 2 91 / 181

96 6. Simulation Study Simulation 2 Bivariate data (x i, y i ) of size n = 100 with ( ) Y i = β 0 + β 1x i + β 2 xi e i (8) where (β 0, β 1, β 2) = (0, 0.9, 0.06), x i N (0, 1), e i N (0, 0.16), and x i and e i are independent. The variable x i is always observed but the probability that y i responds is 0.5. In MI, the imputer s model is Y i = β 0 + β 1x i + e i. That is, imputer s model uses extra information of β 2 = 0. From the imputed data, we fit model (8) and computed power of a test H 0 : β 2 = 0 with 0.05 significant level. In addition, we also considered the Complete-Case (CC) method that simply uses the complete cases only for the regression analysis Jae-Kwang Kim (ISU) Part 2 92 / 181

97 6. Simulation Study Table 3 Simulation results for the Monte Carlo experiment based on 10,000 Monte Carlo samples. Method E(ˆθ) V (ˆθ) R.B. ( ˆV ) Power MI FI CC Table 3 shows that MI provides efficient point estimator than CC method but variance estimation is very conservative (more than 100% overestimation). Because of the serious positive bias of MI variance estimator, the statistical power of the test based on MI is actually lower than the CC method. Jae-Kwang Kim (ISU) Part 2 93 / 181

98 7. Summary Imputation can be viewed as a Monte Carlo tool for computing the conditional expectation. Monte Carlo EM is very popular but the E-step can be computationally heavy. Parametric fractional imputation is a useful tool for frequentist imputation. Multiple imputation is motivated from a Bayesian framework. The frequentist validity of multiple imputation requires the condition of congeniality. Uncongeniality may lead to overestimation of variance which can seriously increase type-2 errors. Jae-Kwang Kim (ISU) Part 2 94 / 181

99 REFERENCES Cheng, P. E. (1994), Nonparametric estimation of mean functionals with data missing at random, Journal of the American Statistical Association 89, Dempster, A. P., N. M. Laird and D. B. Rubin (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B 39, Fisher, R. A. (1922), On the mathematical foundations of theoretical statistics, Philosophical Transactions of the Royal Society of London A 222, Fuller, W. A., M. M. Loughin and H. D. Baker (1994), Regression weighting in the presence of nonresponse with application to the Nationwide Food Consumption Survey, Survey Methodology 20, Hirano, K., G. Imbens and G. Ridder (2003), Efficient estimation of average treatment effects using the estimated propensity score, Econometrica 71, Ibrahim, J. G. (1990), Incomplete data in generalized linear models, Journal of the American Statistical Association 85, Kim, J. K. (2011), Parametric fractional imputation for missing data analysis, Biometrika 98, Kim, J. K. and C. L. Yu (2011), A semi-parametric estimation of mean functionals with non-ignorable missing data, Journal of the American Statistical Association 106, Jae-Kwang Kim (ISU) Part 2 94 / 181

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score