Chapter 4: Imputation

Size: px

Start display at page:

Download "Chapter 4: Imputation"

Karin Porter
5 years ago
Views:

1 Chapter 4: Imputation Jae-Kwang Kim Department of Statistics, Iowa State University

2 Outline 1 Introduction 2 Basic Theory for imputation 3 Variance estimation after imputation 4 Replication variance estimation 5 Multiple imputation 6 Fractional imputation Ch 4 2 / 79

3 Introduction Basic setup Y: a vector of random variables with distribution F (y; θ). y 1,, y n are n independent realizations of Y. We are interested in estimating ψ which is implicitly defined by E{U(ψ; Y)} = 0. Under complete observation, a consistent estimator ˆψ n of ψ can be obtained by solving estimating equation for ψ: U(ψ; y i ) = 0. A special case of estimating function is the score function. In this case, ψ = θ. Sandwich variance estimator is often used to estimate the variance of ˆψ n : ˆV ( ˆψ n ) = ˆτ 1 u where τ u = E{ U(ψ; y)/ ψ }. ˆV (U)ˆτ 1 u Ch 4 3 / 79

4 1. Introduction Missing data setup Suppose that y i is not fully observed. y i = (y obs,i, y mis,i ): (observed, missing) part of y i δ i : response indicator functions for y i. Under the existence of missing data, we can use the following estimators: ˆψ: solution to E {U (ψ; y i ) y obs,i, δ i } = 0. (1) The equation in (1) is often called expected estimating equation. Ch 4 4 / 79

5 1. Introduction Motivation (for imputation) Computing the conditional expectation in (1) can be a challenging problem. 1 The conditional expectation depends on unknown parameter values. That is, E {U (ψ; y i ) y obs,i, δ i } = E {U (ψ; y i ) y obs,i, δ i ; θ, φ}, where θ is the parameter in f (y; θ) and φ is the parameter in p(δ y; φ). 2 Even if we know η = (θ, φ), computing the conditional expectation is numerically difficult. Ch 4 5 / 79

6 1. Introduction Imputation Imputation: Monte Carlo approximation of the conditional expectation (given the observed data). E {U (ψ; y i ) y obs,i, δ i } = 1 m m U j=1 ( ) ψ; y obs,i, y (j) mis,i 1 Bayesian approach: generate y mis,i from f (y mis,i y obs, δ) = f (y mis,i y obs, δ; η) p(η y obs, δ)dη 2 Frequentist approach: generate y mis,i from f (y mis,i y obs,i, δ; ˆη), where ˆη is a consistent estimator. Ch 4 6 / 79

7 Example 4.1 Basic Setup Let (x, y) be a vector of bivariate random variables. Assume that x i are always observed and y i are subject to missingness in the sample, and the probability of missingness does not depend on the value of y i. In this case, an imputed estimator of θ = E(Y ) based on single imputation can be computed by ˆθ I = 1 n {δ i y i + (1 δ i ) y i } (2) where y i is an imputed value for y i. Imputation model y i N ( β 0 + β 1 x i, σ 2 e), for some (β 0, β 1, σ 2 e). Ch 4 7 / 79

8 Example 4.1 (Cont d) Deterministic imputation: Use yi = ˆβ 0 + ˆβ 1 x i where ( ) ( ) ˆβ 0, ˆβ 1 = ȳ r ˆβ 1 x r, Sxxr 1 S xyr. Note that and ) V (ˆθ I ) E (ˆθ I θ = 0 = 1 ( 1 n σ2 y + r 1 ) σe 2 = σ2 { ( y 1 1 r ) ρ 2}. n r n Stochastic imputation: Use yi = ˆβ 0 + ˆβ 1 x i + ei, where e i (0, ˆσ e). 2 The imputed estimator under stochastic imputation satisfies ) V (ˆθ I = 1 n σ2 y + ( 1 r 1 n ) σe 2 + n r n 2 σ2 e where the third term represents the additional variance due to stochastic imputation. Ch 4 8 / 79

9 Remark Deterministic imputation is unbiased for the estimating the mean but may not be unbiased for estimating the proportion. For example, if θ = Pr(Y < c) = E{I (Y < c)}, the imputed estimator ˆθ = n 1 {δ i I (y i < c) + (1 δ i )I (y i < c)} is unbiased if E{I (Y < c)} = E{I (Y < c)}, which holds only when the marginal distribution of y is the same as the marginal distribution of y. In general, under deterministic imputation, we have E(y) = E(y ) but V (y) > V (y ). For regression imputation, V (y ) = σ 2 y (1 ρ 2 ) < σ 2 y = V (y). Imputation increases the variance (of the imputed estimator) because the imputed values are positively correlated. Variance estimation is complicated because of the correlation between the imputed values. Ch 4 9 / 79

10 2 Basic Theory for Imputation Lemma 4.1 Let ˆθ be the solution to Û (θ) = 0, where Û(θ) is a function of complete observations } y 1,, y n and parameter θ. Let θ 0 be the solution to E {Û (θ) = 0. Then, under some regularity conditions, ˆθ θ 0 = [E { U(θ 0 )}] 1 Û (θ0 ), where U(θ) = Û(θ)/ θ and notation A n = Bn means that A n = 1 + R n for some R n which converges to zero in probability. B 1 n Ch 4 10 / 79

11 Remark (about Lemma 4.1) Its proof is based on Taylor linearization: ) Û(ˆθ) = Û(θ 0 ) + U(θ 0 ) (ˆθ θ 0 = Û(θ 0) + E{ U(θ 0 )} (ˆθ θ0 ), where the second (approximate) equality follows by and ˆθ = θ 0 + o p (1). { } Need to assume that E U(θ 0 ) U(θ 0 ) = E{ U(θ 0 )} + o p (1) Also, we need conditions for ˆθ p θ 0. is nonsingular. Lemma 4.1 can be used to establish the asymptotic normality of ˆθ. (Use Slutsky Theorem). Ch 4 11 / 79

12 where S com (η; y) is the score function of η = (θ, φ) under complete response. Ch 4 12 / Basic Theory for Imputation Basic Setup (for Case 1: ψ = η) y = (y 1,, y n ) f (y; θ) δ = (δ 1,, δ n ) P(δ y; φ) y = (y obs, y mis ): (observed, missing) part of y. y (1) mis,, y (m) mis : m imputed values of y mis generated from f (y mis y obs, δ; ˆη p ) = f (y; ˆθ p )P(δ y; ˆφ p ) f (y; ˆθp )P(δ y; ˆφ p )dµ(y mis ), where ˆη p = (ˆθ p, ˆφ p ) is a preliminary estimator of η = (θ, φ). Using m imputed values, imputed score function is computed as S imp,m (η ˆη p ) m 1 m j=1 S com ( η; y obs, y (j) mis, δ )

13 2. Basic Theory for Imputation Lemma 4.2 (Asymptotic results for m = ) Assume that ˆη p converges in probability to η. Let ˆη I,m be the solution to 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) = 0, where y (1) mis,, y (m) mis are the imputed values generated from f (y mis y obs, δ; ˆη p ). Then, under some regularity conditions, for m, and ˆη imp, = ˆη MLE + J mis (ˆη p ˆη MLE ) (3) V (ˆη imp, ). = I 1 obs + J mis {V (ˆη p ) V (ˆη MLE )} J mis, where J mis = I 1 comi mis is the fraction of missing information. Ch 4 13 / 79

14 Remark Equation (3) means that ˆη imp, = (I J mis ) ˆη MLE + J misˆη p. (4) That is, ˆη imp, is a convex combination of ˆη MLE and ˆη p. Note that ˆη imp, is one-step EM update with initial estimate ˆη p. Let ˆη (t) be the t-th EM update of η that is computed by solving ( S η ˆη (t 1)) = 0 with ˆη (0) = ˆη p. Equation (4) implies that Thus, we can obtain ˆη (t) = (I J mis ) ˆη MLE + J misˆη (t 1). ˆη (t) = ˆη MLE + (J mis ) t 1 (ˆη (0) ˆη MLE ), which justifies lim t ˆη (t) = ˆη MLE. Ch 4 14 / 79

15 Proof for Lemma 4.2 Ch 4 15 / 79

16 Wang and Robins (1998) Theorem 4.1 (Asymptotic results for m < ) Let ˆη p be a preliminary n-consistent estimator of η with variance V p. Under some regularity conditions, the solution ˆη imp,m to S m (η ˆη p ) 1 m m j=1 S com ( η; y obs, y (j) mis, δ ) = 0 has mean η 0 and asymptotic variance equal to V (ˆη imp,m ). = I 1 obs + J mis where J mis = I 1 comi mis. { Vp I 1 } obs J mis + m 1 IcomI 1 mis Icom 1 (5) Ch 4 16 / 79

17 2. Basic Theory for Imputation Remark If we use ˆη p = ˆη MLE, then the asymptotic variance in (5) is V (ˆη imp,m ). = I 1 obs + m 1 IcomI 1 mis Icom. 1 In Bayesian imputation (or multiple imputation), the posterior values of η are independently generated from η N(ˆη MLE, I 1 obs ), which implies that we can use V p = I 1 obs + m 1 I 1 obs. Thus, the asymptotic variance in (5) for multiple imputation is V (ˆη imp,m ). = I 1 obs + m 1 J mis I 1 obs J mis + m 1 IcomI 1 mis Icom. 1 The second term is the additional price we pay when generating the posterior values, rather than using ˆη MLE directly. Ch 4 17 / 79

18 Remark (about Theorem 4.1) Variance term (5) has three components. Writing ˆη imp,m = ˆη MLE + (ˆη imp, ˆη MLE ) + (ˆη imp,m ˆη imp, ), we can establish that the three terms are independent and satisfies V (ˆη MLE = I 1 obs V (ˆη imp, ) { ˆη MLE = Jmis Vp I 1 V (ˆη imp,m ˆη imp, ) = m 1 I 1 obs} J mis comi mis Icom 1 Ch 4 18 / 79

19 Sketched proof for Theorem 4.1 By Lemma 4.1 applied to S(η ˆη p ) = 0, we have Similarly, we can write Thus, V (ˆη imp,m ˆη imp, ) and ˆη imp, η 0 = I 1 com S(η ˆη p ). ˆη imp,m η 0 = I 1 com S m(η ˆη p ). = I 1 comv { S m (η ˆη p ) S(η ˆη p ) } I 1 com V { S m(η ˆη p ) S(η ˆη p ) } = 1 m V {S com(η) y obs, δ; ˆη p } = 1 m V {S mis(η) y obs, δ; ˆη p } = 1 m E { S mis (η) 2 y obs, δ; ˆη p } = 1 m E { S mis (η) 2 y obs, δ; η }. Ch 4 19 / 79

20 2. Basic Theory for Imputation Basic Setup (for Case 2: ψ η) Parameter ψ defined by E{U(ψ; y)} = 0. Under complete response, a consistent estimator of ψ can be obtained by solving U (ψ; y) = 0. Assume that some part of y, denoted by y mis, is not observed and m imputed values, say y (1) mis,, y (m) mis, are generated from f (y mis y obs, δ; ˆη MLE ), where ˆη MLE is the MLE of η 0. The imputed estimating function using m imputed values is computed as Ūimp,m (ψ ˆη MLE ) = 1 m U(ψ; y (j) ), (6) m where y (j) = (y obs, y (j) mis ). Let ˆψ imp,m be the solution to Ū imp,m (ψ ˆη MLE ) = 0. We are j=1 interested in the asymptotic properties of ˆψ imp,m. Ch 4 20 / 79

21 Theorem 4.2 Theorem 4.2 Suppose that the parameter of interest ψ 0 is estimated by solving U (ψ) = 0 under complete response. Then, under some regularity conditions, the solution to E {U (ψ) y obs, δ; ˆη MLE } = 0 (7) has mean ψ 0 and the asymptotic variance τ 1 Ωτ 1, where τ = E { U (ψ 0 ) / ψ } Ω = V { Ū (ψ 0 η 0 ) + κs obs (η 0 ) } and κ = E {U (ψ 0 ) S mis (η 0 )} I 1 obs. Ch 4 21 / 79

22 Sketched Proof Writing Ū(ψ η) E{U(ψ) y obs, δ; η}, the solution to (7) can be treated as the solution to the joint estimating equation [ ] U1 (ψ, η) U (ψ, η) = 0, U 2 (η) where U 1 (ψ, η) = Ū (ψ η) and U 2 (η) = S obs (η). We can apply the Taylor expansion to get ( ) ) ( ˆψ B11 B 12 ˆη where = ( ψ0 η 0 ( B11 B 12 B 21 B 22 ) = B 21 B 22 ) 1 [ U1 (ψ 0, η 0 ) U 2 (η 0 ) [ E ( U1 / ψ ) E ( U 1 / η ) E ( U 2 / ψ ) E ( U 2 / η ) ]. ] Ch 4 22 / 79

23 Sketched Proof (Cont d) Note that B 11 = E{ U(ψ)/ ψ } B 21 = 0 B 12 = E{U(ψ)S mis (η 0 )} B 22 = I obs Thus, ˆψ = ψ 0 B11 1 { U1 (ψ 0, η 0 ) B 12 B22 1 U 2(η 0 ) }. Ch 4 23 / 79

24 Alternative approach Use Randles (1982) theorem: An estimator ˆθ( ˆβ) is asymptotically equivalent to ˆθ(β) if { } E β ˆθ(β) = 0. Thus, writing ˆθ k (β) = ˆθ(β) + ks obs (β), we have 1 ˆθ( ˆβ) = ˆθ k ( ˆβ) holds for any k. 2 If we can find k = k such that ˆθ k ( ˆβ) satisfies Randles condition, then we can safely ignore the effect of the sampling variability of ˆβ and assume that ˆθ k ( ˆβ) = ˆθ k (β). Ch 4 24 / 79

25 Example 4.2 Under the setup of Example 4.1, we are interested in obtaining the asymptotic variance of the regression imputation estimator ˆθ Id = 1 { ( )} δ i y i + (1 δ i ) ˆβ 0 + ˆβ 1 x i, n ( ) where ˆβ = ˆβ 0, ˆβ 1 is the solution to S obs (β) = 1 σ 2 e δ i (y i β 0 β 1 x i )(1, x i ). The imputed estimator is the solution to { } E U(θ) y obs, δ; ˆβ = 0 where U(θ) = (y i θ) /σe. 2 Ch 4 25 / 79

26 Example 4.2 (Cont d) Since E {U(θ) y obs, δ; β} Ū(θ β) = {δ i y i + (1 δ i ) (β 0 + β 1 x i ) θ} /σe. 2 Thus, using the linearization formula in Theorem 4.2, we have Ū l (θ β) = Ū(θ β) + (κ 1, κ 2 )S obs (β) (8) where (κ 1, κ 2 ) = I 1 obs E {S mis(β)u(θ)}. (9) Ch 4 26 / 79

27 Example 4.2 (Cont d) In this example, we have ( κ0 κ 1 ) = [ { }] 1 { } E δ i (1, x i ) (1, x i ) E (1 δ i ) (1, x i ) = E { ( 1 + (n/r)(1 g x r ), (n/r)g) }, where g = ( x n x r )/ n δ i(x i x r ) 2 /r. Thus, Ū l (θ β) σ 2 e = δ i (y i θ) + (1 δ i ) (β 0 + β 1 x i θ) + δ i (y i β 0 β 1 x i ) (κ 0 + κ 1 x i ). Ch 4 27 / 79

28 Example 4.2 (Cont d) Note that the solution to Ūl (θ β) = 0 leads to ˆθ Id,l = 1 n = 1 n {β 0 + β 1 x i + δ i (1 + κ 0 + κ 1 x i ) (y i β 0 β 1 x i )} d i, where 1 + κ 0 + κ 1 x i = (n/r){1 + g(x i x r )}. Thus, ˆθ Id is asymptotically equivalent to ˆθ Id,l, which is the sample mean of d i, the influence function of unit i to ˆθ Id. Under uniform response mechanism, 1 + κ 0 + κ 1 x i = n/r and the asymptotic variance of ˆθ l is equal to 1 n β2 1σx r σ2 e = 1 ( 1 n σ2 y + r 1 ) σe 2 n Ch 4 28 / 79

29 Remark Let X 1,, X n be IID sample from f (x; θ 0 ), θ 0 Θ and we are interested in estimating γ 0 = γ(θ 0 ), where γ( ) : Θ R k. An estimator ˆγ = ˆγ n is called asymptotically linear if there exist a random vector ψ(x) such that n (ˆγn γ 0 ) = 1 n ψ(x i ) + o p (1) (10) with E θ0 {ψ(x )} = 0 and E θ0 {ψ(x )ψ(x ) } is finite and non-singular. Here, Z n = o p (1) means that Z n converges to zero in probability. Ch 4 29 / 79

30 Remark The function ψ(x) is referred to as an influence function. The phrase influence function was used by Hampel (JASA, 1974) and is motivated by the fact that to the first order ψ(x) is the influence of a single observation on the estimator ˆγ = ˆγ(X 1,, X n ). The asymptotic properties of an asymptotically linear estimator, ˆγ n can be summarized by considering only its influence function. Since ψ(x ) has zero mean, the CLT tells us that 1 n ψ(x i ) L N [ 0, E θ0 {ψ(x )ψ(x ) } ]. (11) Thus, combining (10) with (11) and applying Slutsky s theorem, we have n (ˆγn γ 0 ) L N [ 0, E θ0 {ψ(x )ψ(x ) } ]. Ch 4 30 / 79

31 3 Variance estimation after imputation Ch 4 31 / 79

32 Deterministic Imputation In Example 4.2, the imputed estimator can be written as ˆθ Id ( ˆβ). Note that we can write the deterministic imputed estimator as where ŷ i ( ˆβ) = ˆβ 0 + ˆβ 1 x i. ˆθ Id = n 1 ŷ i ( ˆβ), In general, the asymptotic variance of ˆθ Id = ˆθ( ˆβ) is different from the asymptotic variance of ˆθ(β) Ch 4 32 / 79

33 As in Example 4.2, if we can find d i = d i (β) such that ˆθ Id ( ˆβ) = n 1 d i ( ˆβ) = n 1 d i (β), then the asymptotic variance of ˆθ Id is equal to the asymptotic variance of d n = n 1 n d i(β). Note that, if (x i, y i, δ i ) are IID, then d i = d(x i, y i, δ i ) are also IID. Thus, the variance of d n = n 1 n d i is unbiasedly estimated by ˆV ( d n ) = 1 n 1 n 1 ( ) 2 di d n. (12) Unfortunately, we cannot compute ˆV ( d n ) in (12) since d i = d i (β) is a function of unknown parameters. Thus, we use ˆd i = d i ( ˆβ) in (12) to get a consistent variance estimator of the imputed estimator. Ch 4 33 / 79

34 Stochastic Imputation Instead of the deterministic imputation, suppose that a stochastic imputation is used such that ˆθ I = n 1 { ( )} δ i y i + (1 δ i ) ˆβ 0 + ˆβ 1 x i + êi, where êi are the additional noise terms in the stochastic imputation. Often êi are randomly selected from the empirical distribution of the sample residuals in the respondents. The variance of the imputed estimator can be decomposed into two parts: ) ) ) V (ˆθ I = V (ˆθ Id + V (ˆθ I ˆθ Id (13) where the first part is the deterministic part and the second part is the additional variance due to stochastic imputation. The first part can be estimated by the linearization method discussed above. The second part is called the imputation variance. Ch 4 34 / 79

35 If we require the imputation mechanism to satisfy (1 δ i ) êi = 0 then the imputation variance is equal to zero. Often the variance of ˆθ I ˆθ Id = n 1 n (1 δ i) êi can be computed under the known imputation mechanism. For example, if simple random sampling without replacement is used then ) V (ˆθ I ˆθ Id = E where m = n r. { V (ˆθ I y obs, δ)} = n 2 (1 m/r) (r 1) 1 δ i êi 2 Ch 4 35 / 79

36 Imputed estimator for general parameters We now discuss a general case of parameter estimation when the parameter of interest ψ is estimated by the solution ˆψ n to U(ψ; y i ) = 0 (14) under complete response of y 1,, y n. Under the existence of missing data, we can use the imputed estimating equation Ū m (ψ) m 1 m j=1 U(ψ; y (j) i ) = 0, (15) where y (j) i = (y i,obs, y (j) i,mis ) and y (j) i,mis are randomly generated from the conditional distribution h (y i,mis y i,obs, δ i ; ˆη p ) where ˆη p is estimated by solving Û p (η) U p (η; y i,obs ) = 0. (16) Ch 4 36 / 79

37 To apply the linearization method, we first compute the conditional expectation of U(ψ; y i ) given (y i,obs, δ i ) evaluated at ˆη p. That is, compute Ū (ψ ˆη p ) = Ū i (ψ ˆη p ) = E {U(ψ; y i ) y i,obs, δ i ; ˆη p }. (17) Let ˆψ R be the solution to Ū (ψ ˆη p) = 0. Using the linearization technique, we have { } Ū (ψ ˆη p ) = Ū (ψ η 0) + E η Ū (ψ η 0) (ˆη p η 0 ) (18) and 0 = Û p (ˆη p ) = Û p (η 0 ) + E { } η Ûp (η 0 ) (ˆη p η 0 ). (19) Ch 4 37 / 79

38 Thus, combining (18) and (19), we have Ū (ψ ˆη p ) = Ū (ψ η 0) + κ(ψ)ûp (η 0 ) (20) where κ(ψ) = E { } [ { }] 1 η Ū (ψ η 0) E η Ûp (η 0 ). Thus, writing Ū l (ψ η 0 ) = {Ūi (ψ η 0 ) + κ(ψ)û p (η 0 ; y i,obs )} = q i (ψ η 0 ), and q i (ψ η 0 ) = Ū i (ψ η 0 ) + κ(ψ)û p (η 0 ; y i,obs ), the variance of Ū (ψ ˆη p ) is asymptotically equal to the variance of Ū l (ψ η 0 ). Ch 4 38 / 79

39 Thus, the sandwich-type variance estimator for ˆψ R is ( ) ˆV ˆψ R = ˆτ q 1 ˆΩ q ˆτ q 1 (21) where ( ) ˆτ q = n 1 q i ˆψ R ˆη p ˆΩ q = n 1 (n 1) 1 (ˆq i q n ) 2, q i (ψ η) = q i (ψ η) / ψ, q n = n 1 n ˆq i, and ˆq i = q i ( ˆψ R ˆη p ). Note that ( ) ˆτ q = n 1 q i ˆψ R ˆη p = n 1 because ˆη p is the solution to (16). E { U( ˆψ R ; y i ) y i,obs, δ i ; ˆη p } Ch 4 39 / 79

40 Example 4.3 Assume that the original sample is decomposed into G disjoint groups (often called imputation cells) and the sample observations are IID within the same cell. That is, i.i.d. y i i S g ( µ g, σg 2 ) (22) where S g is the set of sample indices in cell g. Assume that n g sample elements in cell g and r g elements are observed in the cell. For deterministic imputation, let ˆµ g = rg 1 i S g δ i y i be the g-th cell mean of y among the respondents. The (deterministically) imputed estimator of θ = E(Y ) is, under MAR, ˆθ Id = n 1 G g=1 i S g {δ i y i + (1 δ i )ˆµ g } = n 1 G n g ˆµ g. (23) g=1 Ch 4 40 / 79

41 Example 4.3 (Cont d) Using the linearization technique in (20), the imputed estimator can be expressed as ˆθ Id = n 1 G g=1 i S g { µ g + n } g δ i (y i µ g ) r g and the plug-in variance estimator can be expressed as ˆV (ˆθ Id ) = 1 n 1 n 1 (24) ( ˆdi d ) 2 n (25) where ˆd i = ˆµ g + (n g /r g )δ i (y i ˆµ g ) and d n = n 1 n ˆd i. Ch 4 41 / 79

42 Example 4.3 (Cont d) If a stochastic imputation is used where an imputed value is randomly selected from the set of respondents in the same cell, then we can write ˆθ Is = n 1 G g=1 i S g {δ i y i + (1 δ i )y i }. (26) Writing ˆθ Is = ˆθ Id + n 1 G g=1 i S g (1 δ i ) (yi ˆµ g ), the variance of the second term can be estimated by n 2 G g=1 i S g (1 δ i ) (yi ˆµ g ) 2 if the imputed values are generated independently, conditional on the respondents. Ch 4 42 / 79

43 4.4 Replication variance estimation Ch 4 43 / 79

44 Replication variance estimation (under complete response) Let ˆθ n be the complete-sample estimator of θ. The replication variance estimator of ˆθ n takes the form of ˆV rep (ˆθ n ) = L k=1 c k (ˆθ (k) n ˆθ n ) 2 (27) where L is the number of replicates, c k is the replication factor associated with replication k, and ˆθ n (k) is the k-th replicate of ˆθ n. If ˆθ n = n y i/n, then we can write ˆθ n (k) = n w (k) i y i for some replication weights w (k) 1, w (k) 2,, w n (k). Ch 4 44 / 79

45 Replication variance estimation (under complete response) For example, in the jackknife method, we have L = n, c k = (n 1)/n, and { w (k) (n 1) 1 if i k i = 0 if i = k. If we use the above jackknife method to ˆθ n = n y i/n, the resulting jackknife estimator in (27) is algebraically equivalent to n 1 (n 1) 1 n (y i ȳ n ) 2. Ch 4 45 / 79

46 Replication variance estimation (under complete response) Under some regularity conditions, for ˆθ n = g(ȳ n ), the replication variance estimator of ˆθ n, defined by where ˆθ (k) n ˆV rep (ˆθ n ) = = g(ȳ n (k) ), satisfies L k=1 c k (ˆθ (k) n ˆθ n ) 2, (28) ˆV rep (ˆθ n ) = { g (ȳ n ) }2 ˆV rep (ȳ n ). Thus, the replication variance estimator (28) is asymptotically equivalent to the linearized variance estimator. Ch 4 46 / 79

47 Remark If the parameter of interest, denoted by ψ, is estimated by ˆψ which is obtained by solving an estimating equation n U(ψ; y i) = 0, then a consistent variance estimator can be obtained by the sandwich formula: The complete-sample variance estimator of ˆψ is ( ˆV ˆψ) = ˆτ u 1 ˆΩ u ˆτ u 1 (29) where ( ) ˆτ u = n 1 U ˆψ; y i ˆΩ u = n 1 (n 1) 1 (û i ū n ) 2, U (ψ; y) = U (ψ; y) / ψ, ū n = n 1 n ûi, and û i = U( ˆψ; y i ). Ch 4 47 / 79

48 If we want to use the replication method of the form (27), we can construct the replication variance estimator of ˆψ by ˆV rep ( ˆψ) = where ˆψ (k) is computed by L ( 2 c k ˆψ (k) ˆψ) (30) k=1 Û (k) (ψ) w (k) i U(ψ; y i ) = 0. (31) Ch 4 48 / 79

49 One-step approximation of ˆψ (k) is to use or, even more simply, to use { ˆψ (k) 1 1 = ˆψ U (k) ( ˆψ)} Û(k) ( ˆψ) (32) ˆψ (k) 1 = ˆψ { ˆψ)} 1 U( Û(k) ( ˆψ). (33) The replication variance estimator using (33) is algebraically equivalent to { U( ˆψ) } 1 [ k=1 c k {Û(k) ( ˆψ) Û( ˆψ)} 2 ] { U( ˆψ)} 1, which is very close to the sandwich variance formula in (29). Ch 4 49 / 79

50 Back to Example 4.1 For the regression imputation in Example 4.1, ˆθ Id = 1 { ( )} δ i y i + (1 δ i ) ˆβ 0 + ˆβ 1 x i. n The replication variance estimator of ˆθ Id is computed by ) L ) ˆV rep (ˆθ Id = c k (ˆθ (k) 2 Id ˆθ Id (34) where ˆθ (k) Id = k=1 { ( )} w (k) i δ i y i + (1 δ i ) ˆβ(k) (k) 0 + ˆβ 1 x i and ( ˆβ (k) 0, ˆβ (k) 1 ) is the solution to w (k) i δ i (y i β 0 β 1 x i ) (1, x i ) = (0, 0). Ch 4 50 / 79

51 Example 4.4 We now return to the setup of Example In this case, the deterministically imputed estimator of θ = E(Y ) is constructed by ˆθ Id = n 1 {δ i y i + (1 δ i )ˆp 0i } (35) where ˆp 0i is the predictor of y i given x i and δ i = 0. That is, ˆp 0i = p(x i ; ˆβ){1 π(x i, 1; ˆφ)} {1 p(x i ; ˆβ)}{1 π(x i, 0; ˆφ)} + p(x i ; ˆβ){1 π(x i, 1; ˆφ)}, where ˆβ and ˆφ are jointly estimated by the EM algorithm. Ch 4 51 / 79

52 Example 4.4 (Cont d) E-step ( S 1 β β (t), φ (t)) = {y i p i (β)} x i + δ i =1 δ i =0 where w ij(t) = Pr (Y i = j x i, δ i = 0; β (t), φ (t)) = 1 w ij(t) {j p i (β)} x i, j=0 Pr ( Y i = j x i ; β (t)) Pr ( δ i = 0 x i, j; φ (t)) 1 y=0 Pr ( Y i = y x i ; β (t)) Pr ( δ i = 0 x i, y; φ (t)) and ( S 2 φ β (t), φ (t)) = {δ i π (x i, y i ; φ)} ( x ) i, y i δ i =1 + δ i =0 1 w ij(t) {δ i π i (x i, j; φ)} ( x i, j ). j=0 Ch 4 52 / 79

53 Example 4.4 (Cont d) M-step The parameter estimates are updated by solving [ S 1 (β β (t), φ (t)), S 2 ( φ β (t), φ (t))] = (0, 0) for β and φ. Ch 4 53 / 79

54 Example 4.4 (Cont d) For replication variance estimation, we can use (34) with where ˆp (k) 0i = { } Id = w (k) i δ i y i + (1 δ i )ˆp (k) 0i. (36) ˆθ (k) p(x i ; ˆβ (k) ){1 π(x i, 1; ˆφ (k) )} {1 p(x i ; ˆβ (k) )}π(x i, 0; ˆφ (k) ) + p(x i ; ˆβ (k) ){1 π(x i, 1; ˆφ (k) )}. and ( ˆβ (k), ˆφ (k) ) is obtained by solving the mean score equations with weights replaced by the replication weights w (k) i. Ch 4 54 / 79

55 Example 4.4 (Cont d) That is, ( ˆβ (k), ˆφ (k) ) is the solution to S (k) 1 (β, φ) w (k) i {y i p(x i ; β)} x i δ i =1 + δ i =0 w (k) i 1 y=0 w iy(β, φ){y p(x i ; β)}x i = 0 S (k) 2 (β, φ) w (k) i {δ i π(x i, y i ; φ)} (x i, y i ) δ i =1 + δ i =0 w (k) i 1 y=0 w iy(β, φ){δ i π(x i, y; β)}(x i, y) = 0 and w iy(β, φ) = p(x i ; β)π(x i, 1; φ) {1 p(x i ; β)}π(x i, 0; φ) + p(x i ; β)π(x i, 1; φ). Ch 4 55 / 79

56 6 Fractional Imputation Ch 4 56 / 79

57 Monte Carlo EM Remark Monte Carlo EM can be used as a frequentist approach to imputation. Convergence is not guaranteed (for fixed m). E-step can be computationally heavy. (May use MCMC method). Ch 4 57 / 79

58 If y mis,i is categorical, then simply use Ch 4all possible values of y mis,i as the58 / 79 Parametric Fractional Imputation (Kim, 2011) Parametric fractional imputation 1 More than one (say m) imputed values of y mis,i : y (1) mis,i,, y (m) mis,i from some (initial) density h (y mis,i ). 2 Create weighted data set {( w ij, yij ) } ; j = 1, 2,, m; i = 1, 2, n where m j=1 w ij = 1, y ij = (y obs,i, y (j) mis,i ) w ij f (y ij, δ i ; ˆη)/h(y (j) mis,i ), ˆη is the maximum likelihood estimator of η, and f (y, δ; η) is the joint density of (y, δ). 3 The weight w ij are the normalized importance weights and can be called fractional weights.

59 Remark Importance sampling idea: For sufficiently large m, m j=1 ij g ( yij ) g(yi ) f (y i,δ i ;ˆη) = h(y mis,i ) h(y mis,i)dy mis,i f (yi,δ i ;ˆη) h(y mis,i ) h(y mis,i)dy mis,i w for any g such that the expectation exists. = E {g (y i ) y obs,i, δ i ; ˆη} In the importance sampling literature, h( ) is called proposal distribution and f ( ) is called target distribution. Do not need to compute the conditional distribution f (y mis,i y obs,i, δ i ; η). Only the joint distribution f (y obs,i, y mis,i, δ i ; η) is needed because f (y obs,i, y (j) mis,i, δ i; ˆη)/h(y (j) i,mis ) m k=1 f (y obs,i, y (k) mis,i, δ i; ˆη)/h(y (k) i,mis ) = f (y (j) mis,i y obs,i, δ i ; ˆη)/h(y (j) i,mis ) m (k) k=1 f (ymis,i y obs,i, δ i ; ˆη)/h(y (k) i,mis ). Ch 4 59 / 79

60 EM algorithm by fractional imputation 1 Imputation-step: generate y (j) i,mis h (y i,mis). 2 Weighting-step: compute where m j=1 w ij(t) = 1. 3 M-step: update w ij(t) f (y ij, δ i ; ˆη (t) )/h(y (j) i,mis ) ˆη (t+1) = arg max m j=1 w ij(t) log f ( η; y ij, δ i ). 4 (Optional) Check if wij(t) is too large for some j. If so, set h(y i,mis ) = f (y i,mis y i,obs ; ˆη t ) and goto Step 1. 5 Repeat Step 2 - Step 4 until convergence. Ch 4 60 / 79

61 Remark Imputation Step + Weighting Step = E-step. The imputed values are not changed for each EM iteration. Only the fractional weights are changed. 1 Computationally efficient (because we use importance sampling only once). 2 Convergence is achieved (because the imputed values are not changed). See Theorem 4.5. For sufficiently large t, ˆη (t) ˆη. Also, for sufficiently large m, ˆη ˆη MLE. For estimation of ψ in E{U(ψ; Y )} = 0, simply use 1 n m j=1 w ij U(ψ; y ij) = 0 where wij = wij (ˆη) and ˆη is obtained from the above EM algorithm. Ch 4 61 / 79

62 Theorem 4.5 (Theorem 1 of Kim (2011) ) Theorem Let If then Q (η ˆη (t) ) = m j=1 w ij(t) log f ( η; y ij, δ i ). Q (ˆη (t+1) ˆη (t) ) Q (ˆη (t) ˆη (t) ) (37) where l obs (η) = n ln{f obs(i) (y i,obs, δ i ; η)} and f obs(i) (y i,obs, δ i ; η) = 1 m l obs (ˆη (t+1)) l obs (ˆη (t)), (38) m j=1 f (y ij, δ i ; η)/h m (y (j) i,mis ). Ch 4 62 / 79

63 Proof By using Jensen s inequality, l obs (ˆη (t+1)) l obs (ˆη (t)) = Therefore, (37) implies (38). m ln m j=1 j=1 w ij(t) w ij(t) ln f (yij, δ i; ˆη (t+1) ) f (yij, δ i; ˆη (t) ) { } f (y ij, δ i ; ˆη (t+1) ) f (y ij, δ i; ˆη (t) ) = Q (ˆη (t+1) ˆη (t) ) Q (ˆη (t) ˆη (t) ). Ch 4 63 / 79

64 Example 4.11: Return to Example 3.15 Fractional imputation 1 Imputation Step: Generate y (1) i,, y (m) i from f (y i x i ; ˆθ ) (0). 2 Weighting Step: Using the m imputed values generated from Step 1, compute the fractional weights by ( f y (j) wij(t) i x i ; ˆθ ) (t) { ( ) 1 π(x i, y (j) f y (j) i ; ˆφ } (t) ) i x i ; ˆθ (0) where ( π(x i, y i ; ˆφ) exp ˆφ0 + ˆφ 1 x i + ˆφ ) 2 y i = ( ). 1 + exp ˆφ 0 + ˆφ 1 x i + ˆφ 2 y i Ch 4 64 / 79

65 Example 4.11 Using the imputed data and the fractional weights, the M-step can be implemented by solving and j=1 j=1 m ( ) wij(t) S θ; x i, y (j) = 0 i m { } ( ) wij(t) δ i π(φ; x i, y (j) i ) 1, x i, y (j) i = 0, (39) where S (θ; x i, y i ) = log f (y i x i ; θ)/ θ. Ch 4 65 / 79

66 Example 4.12: Back to Example 3.18 (GLMM) Level 1 model y ij f 1 (y ij x ij, a i ; θ 1 ) for some fixed θ 1 and a i random. Level 2 model a i f 2 (a i ; θ 2 ) Latent variable: a i We are interested in generating a i from n i p(a i x i, y i ; θ 1, θ) f 1 (y ij x ij, a i ; θ 1 ) f 2(a i ; θ 2 ) j=1 Ch 4 66 / 79

67 Example 4.12 (Cont d) E-step 1 Imputation Step: Generate a (1) i,, a (m) (t) i from f 2 (a i ; ˆθ 2 ). 2 Weighting Step: Using the m imputed values generated from Step 1, compute the fractional weights by ij(t) g 1(y i x i, a (j) ; w i ˆθ (t) 1 ) where g 1 (y i x i, a i ; ˆθ 1 ) = n i j=1 f 1(y ij x ij, a i ; θ 1 ). M-step: Update the parameters by solving j=1 m ( ) wij(t) S 1 θ 1 ; x i, y i, a (j) i = 0 j=1 m ( ) wij(t) S 2 θ 2 ; a (j) i = 0. Ch 4 67 / 79

68 Example 4.13: Measurement error model Interested in estimating θ in f (y x; θ). Instead of observing x, we observe z which can be highly correlated with x. Thus, z is an instrumental variable for x: f (y x, z) = f (y x) and f (y z = a) f (y z = b) for a b. In addition to original sample, we have a separate calibration sample that observes (x i, z i ). Ch 4 68 / 79

69 Example 4.13 (Cont d) Table: Data Structure Z X Y Calibration Sample o o Original Sample o o The goal is to generate x in the original sample from f (x i z i, y i ) f (x i z i ) f (y i x i, z i ) = f (x i z i ) f (y i x i ) Obtain a consistent estimator ˆf (x z) from calibration sample. E-step 1 Generate x (1) i,, x (m) i from ˆf (x i z i ). 2 Compute the fractional weights associated with x (j) i by w ij f (y i x (j) ; ˆθ) M-step: Solve the weighted score equation for θ. i Ch 4 69 / 79

70 Remarks for Computation Recall that, writing S (η η) = M j=1 w ij (η)s(η; y ij, δ i ) where wij (η) is the fractional weight associated with y ij, denoted by w ij (η) = f (yij, δ i; η)/h m (y (j) i,mis ) m k=1 f (y ik, δ i; η)/h m (y (k) (40) i,mis ), and S(η; y, δ) = log f (y, δ; η)/ η, the EM algorithm for fractional imputation can be expressed as ˆη (t+1) solve S (η η (t) ) = 0. Ch 4 70 / 79

71 Remarks for Computation Instead of EM algorithm, Newton-type algorithm can also be used. The Newton-type algorithm for computing the MLE from the fractionally imputed data is given by { 1 ˆη (t+1) = ˆη (t) + Iobs )} S (ˆη(t) (ˆη (t) ˆη (t) ) where I obs (η) = m j=1 m j=1 w ij (η)ṡ(η; y ij, δ i ) w ij (η) { S(η; y ij, δ i ) S i (η) } 2, Ṡ(η; y, δ) = S(η; y, δ)/ η and S i (η) = M j=1 w ij (η)ṡ(η; y ij, δ i). Ch 4 71 / 79

72 Estimation of general parameter Parameter Ψ is defined through E{U(Ψ; Y )} = 0. The FI estimator of Ψ is computed by solving m j=1 Note that ˆη is the solution to m j=1 w ij (ˆη)U(Ψ; y ij) = 0. (41) w ij (ˆη) S (ˆη; y ij) = 0. Ch 4 72 / 79

73 Estimation of general parameter We can use either linearization method or replication method for variance estimation. For linearization method, using Theorem 4.2, we can use sandwich formula where ( ˆV ˆΨ) ˆτ q = n 1 j=1 = ˆτ q 1 ˆΩ q ˆτ q 1 (42) m ( wij U ˆΨ; yij) ˆΩ q = n 1 (n 1) 1 (ˆq i q n) 2, with ˆq i = Ūi + ˆκ S i, where (Ū i, S i ) = m j=1 w ij (U ij, S ij ), Uij = U( ˆΨ; yij ), S ij = S(ˆη; y ij ), and ˆκ = m j=1 wij (ˆη) ( Uij Ūi ) S ij {Iobs (ˆη)} 1 Ch 4 73 / 79

74 Estimation of general parameter For replication method, we first obtain the k-th replicate ˆη (k) of ˆη by solving m w (k) i wij (η) S ( η; yij) = 0. j=1 Once ˆη (k) is obtained then the k-th replicate ˆΨ (k) of ˆΨ is obtained by solving m w (k) i wij (ˆη (k) )U(Ψ; yij) = 0 for ψ. j=1 The replication variance estimator of ˆΨ from (41) is obtained by ˆV rep ( ˆΨ) = L ( 2 c k ˆΨ (k) ˆΨ). k=1 Ch 4 74 / 79

75 Example 4.15: Nonparametric Fractional Imputation Bivariate data: (x i, y i ) x i are completely observed but y i is subject to missingness. Joint distribution of (x, y) completely unspecified. Assume MAR in the sense that P(δ = 1 x, y) does not depend on y. Without loss of generality, assume that δ i = 1 for i = 1,, r and δ i = 0 for i = r + 1,, n. We are only interested in estimating θ = E(Y ). Ch 4 75 / 79

76 Example 4.15 (Cont d) Let K h (x i, x j ) = K((x i x j )/h) be the Kernel function with bandwidth h such that K(x) 0 and K(x)dx = 1, xk(x)dx = 0, σk 2 x 2 K(x)dx > 0. Examples include the following: Boxcar kernel: K(x) = 1 2 I (x) Gaussian kernel: K(x) = 1 2π exp( 1 2 x 2 ) where Epanechnikov kernel: K(x) = 3 4 (1 x 2 )I (x) Tricube Kernel: K(x) = (1 x 3 ) 3 I (x) I (x) = { 1 if x 1 0 if x > 1. Ch 4 76 / 79

77 Example 4.15 (Cont d) Nonparametric regression estimator of m(x) = E(Y x): ˆm(x) = r l i (x)y i (43) where l i (x) = K ( x x i ) ( h j K x xj ). h Estimator in (43) is often called Nadaraya-Watson kernel estimator. Under some regularity conditions and under the optimal choice of h (with h = O(n 1/5 )), it can be shown that [ E { ˆm(x) m(x)} 2] = O(n 4/5 ). Thus, its convergence rate is slower than that of parametric one. Ch 4 77 / 79

78 Example 4.15 (Cont d) However, the imputed estimator of θ using (43) can achieve the n-consistency. That is, { r ˆθ NP = 1 y i + n i=r+1 ˆm(x i ) } (44) achieves n (ˆθ NP θ) N(0, σ 2 ) (45) where σ 2 = E{v(x)/π(x)} + V {m(x)}, m(x) = E(y x), v(x) = V (y x) and π(x) = E(δ x). Result (45) was first proved by Cheng (1994). Ch 4 78 / 79

79 Example 4.15 (Cont d) We can express ˆθ NP in (45) as a nonparametric fractional imputation (NFI) estimator of the form ˆθ NFI = 1 r r y n i + wij y (j) i j=r+1 where w ij = l i (x j ), which is defined after (43), and y (j) i = y i. Variance estimation can be implemented by a resampling method, such as bootstrap. Ch 4 79 / 79

Statistical Methods for Handling Missing Data

Statistical Methods for Handling Missing Data Jae-Kwang Kim Department of Statistics, Iowa State University July 5th, 2014 Outline Textbook : Statistical Methods for handling incomplete data by Kim and