Introduction An approximated EM algorithm Simulation studies Discussion

Size: px

Start display at page:

Download "Introduction An approximated EM algorithm Simulation studies Discussion"

Dana Hodge
5 years ago
Views:

1 1 / 33 An Approximated Expectation-Maximization Algorithm for Analysis of Data with Missing Values Gong Tang Department of Biostatistics, GSPH University of Pittsburgh NISS Workshop on Nonignorable Nonresponse November 12-13, 2015

2 2 / 33 1 Introduction Background & Current Approaches Expectation-Maximization (EM) Algorithm for Regression Analysis of Data with Nonresponse 2 An approximated EM algorithm The algorithm Application in contingency table analysis Application in regression analyses with a continuous outcome 3 Simulation studies 4 Discussion

3 3 / 33 Regression Analysis of Data with Nonresponse Consider bivariate data {x i, y i, i = 1, 2,..., n} where x i s are fully observed, y i s are only observed for i = 1,..., m. Denote R i to be the missing-data indicator: R i = 1 if y i is observed and R i = 0 otherwise. Assume that [y i x i ] g(y i x i ; θ) exp{θs(x i, y i ) + a(θ)} pr[r i = 1 x i, y i ] = w(x i, y i ; ψ) and the parameter of interest is θ. Goal: to avoid modeling w(x i, y i ; ψ).

4 4 / 33 Likelihood-based Inference (I) The observed data are D obs = {x i, R i, R i y i ; i = 1,..., n}. When w(x i, y i ; ψ) is parametrically modeled, the likelihood function is L(θ, ψ; D obs ) = = n p(r i, R i y i x i ; θ, ψ) m g(y i x i ; θ)w(x i, y i ; ψ) n i=m+1 g(y i x i ; θ){1 w(x i, y i ; ψ)} dy i When w(x i, y i ; ψ) = w(x i ; ψ), data are called missing at random (MAR) and L(θ, ψ; D obs ) L(θ; D obs )L(ψ; D obs ) (Rubin, 1976).

5 5 / 33 Likelihood-based Inference (II) When data are MAR plus θ and ψ are distinct, the modeling of w(x, y; ψ) is not necessary under the likelihood-based inference. When data are not MAR, the missingness has to be modeled and the inference on θ and ψ are made together. Misspecification of the missing-data model w(x, y; ψ) often leads to biased estimate of θ.

6 6 / 33 A conditional likelihood Assume that [y i x i ] g(y i x i ; θ) (A parametric regression) pr[r i = 1 x i, y i ] = w(x i, y i ; ψ) = w(y i ; ψ). Then [X Y, R = 1] = [X Y, R = 0] = [X Y ]. Consider the following conditional likelihood: CL(θ) = R i =1 p(x i y i ; θ, F X ) (F X is the CDF of X) = R i =1 g(y i x i ; θ)p(x i ) p(y i ; θ, F X ) (Bayes formula) R i =1 g(y i x i ; θ) g(yi x; θ) df X (x) Requires knowing F X ( )!

7 A pseudolikelihood method (Tang, Little & Raghunathan, 2003) In alternative, we may either model X f (x; α), obtain α = arg max α n f (x i; α), then consider a pseudolikelihood function PL 1 (θ) = R i =1 g(y i x i ; θ) g(yi x; θ) df X (x; α) or substitute F X ( ) by the empirical distribution F n ( ): PL 2 (θ) = R i =1 p(y i x i ; θ) p(yi x; θ) df n (x) = p(y i x i ; θ) 1 n R i =1 n j=1 p(y i x j ; θ) 7 / 33

8 8 / 33 Exponential tilting (Kim & Yu, 2011) Consider the following semiparametric logistic regression model: pr[r i = 1 x i, y i ] = logit 1 {h(x i ) ψy i }, where h( ) is unspecified and ψ is either known or estimated from an external dataset with the missing values recovered. Subsequently we would have: exp(ψy) p(y x, r = 0) = p(y x, r = 1) E[exp(ψy) x, r = 1]. With both p(y x, r = 1) and E[exp(ψy) x, r = 1] empirically estimated, one will obtain a density estimator for p(y x, r = 0) and estimate θ = E[H(X, Y )] via: θ = 1 n { ri =1 H(x i, y i ) + n Ê[H(x i, y i ) x i, r i = 0]} r i =0

9 9 / 33 An instrumental variable approach (Wang, Shao & Kim, 2014) Assume that X = (Z, U) and E(Y X) = β 0 + β 1 u + β 2 z pr(r = 1 Z, U, Y ) = pr(r = 1 U, Y ) = π(ψ; U, Y ) one may use the following estimating equations to estimate ψ: n r i { π(ψ; u i, y i ) 1}(1, u i, z i ) = 0. Then estimate β s via n r i π( ψ; u i, y i ) (y i β 0 β 1 u i + β 2 z i )(1, u i, z i ) = 0.

10 10 / 33 The likelihood function From the following representation: L(θ, ψ; D obs ) = = = = n p(r i, R i y i x i ; θ, ψ) n n n p(r i, R i y i, (1 R i )y i x i ; θ, ψ) p((1 R i )y i R i, R i y i, x i ; θ, ψ) p(r i, y i x i ; θ, ψ) p((1 R i )y i R i, R i y i, x i ; θ, ψ) p(y i x i ; θ)p(r i x i, y i ; ψ) p((1 R i )y i R i, R i y i, x i ; θ, ψ),

11 11 / 33 The log-likelihood function We have Let Q(θ, ψ; θ, ψ) = l(θ, ψ; y obs ) = log L(θ, ψ; y obs ) nx = {log p(y i x i ; θ) + log p(r i x i, y i ; ψ) nx log p((1 R i )y i R i, R i y i, x i ; θ, ψ)}. E[log p(y i x i ; θ) + log p(r i x i, y i ; ψ) R i, R i y i, x i ; θ, ψ], H(θ, ψ; θ, nx ψ) = E[log p((1 R i )y i R i, R i y i, x i ; θ, ψ) R i, R i y i, x i ; θ, ψ]. Then we have l(θ, ψ; y obs ) = Q(θ, ψ; θ, ψ) H(θ, ψ; θ, ψ), for any ( θ, ψ).

12 12 / 33 The Expectation-Maximization (EM) Algorithm (Dempster, Laird and Rubin, 1977) From the Jansen s inequality, we have H(θ, ψ; θ, ψ) H( θ, ψ; θ, ψ). The EM algorithm is an iterative algorithm to find a sequence {(θ (t), ψ (t) ), t = 0, 1, 2,...} such that (θ (t+1), ψ (t+1) ) = arg max (θ,ψ) Q(θ, ψ; θ (t), ψ (t) ). This assures that l(θ (t), ψ (t) ; y obs ) l(θ (t+1), ψ (t+1) ; y obs ). Typically logistic or probit regression models are used for w(x, y; ψ). Numerical integrations or the Monte Carlo method are often necessary in the implementation.

13 13 / 33 When ψ = ψ 0 is known Now consider when the true value of ψ, ψ 0, is known: l(θ, ψ 0 ; y obs ) = log L(θ, ψ 0 ; y obs ) nx = {log p(y i x i ; θ) + log p(r i x i, y i ; ψ 0 ) log p((1 R i )y i D i,obs ; θ, ψ 0 )}. With current estimate θ (t), the Q-function becomes Q(θ; θ (t) ) = nx E[log p(y i x i ; θ) + log p(r i x i, y i ; ψ 0 ) R i, R i y i, x i ; θ (t), ψ 0 ] nx E[log p(y i x i ; θ) R i, R i y i, x i ; θ (t), ψ 0 ], because the second term does not involve θ.

14 14 / 33 Another view of the likelihood Without loss of generality, assume that the regression model follows a canonical exponential family. Then Ω(θ; θ (t) ) = mx nx logg(y i x i ; θ) + E[log g(y i x i ; θ) x i, R i = 0; θ (t), ψ 0 ] mx logg(y i x i ; θ) + i=m+1 i=m+1 nx {θe[s(x i, y i ) x i, R i = 0; θ (t), ψ 0 ] + a(θ)}, we need to obtain E[S(x i, y i ) x i, R i = 0; θ (t), ψ 0 ] in order to carry out the EM algorithm. With w(x i, y i ; ψ 0 ) known, R E[S(x i, y i ) x i, R i = 0; θ (t) s(xi, y i )g(y i x i ; θ (t) ){1 w(x i, y i ; ψ 0 )} dy i, ψ 0 ] = R g(yi x i ; θ (t) ){1 w(x i, y i ; ψ 0 )} dy i

15 15 / 33 An alternative look of the E-step On the other hand, E[S(x i, y i ) x i, θ (t) ] = E[E[S(x i, y i ) R i, x i ] x i ] We would have E[S(x i, y i ) x i, R i = 0; θ (t), ψ 0 ] = E[S(x i, y i ) x i, R i = 1; θ (t) ]pr[r i = 1 x i ; θ (t), ψ 0 ] +E[S(x i, y i ) x i, R i = 0; θ (t), ψ 0 ]pr[r i = 0 x i ; θ (t), ψ 0 ], = E[S(x i, y i ) x i, θ (t) ] E[S(x i, y i ) x i, R i = 1; θ (t), ψ 0 ]pr[r i = 1 x i ; θ (t), ψ 0 ] pr[r i = 0 x i ; θ (t), ψ 0 ]

16 16 / 33 Empirical replacements It is noted that if we use some empirical estimates of E[S(x i, y i ) x i, R i = 1; θ (t), ψ 0 ] and pr[r i = 0 x i ; θ (t), ψ 0 ] to replace them, we would be able to carry out an iterative algorithm as the following: At the E-step be[s(x i, y i ) x i, R i = 0; θ (t), ψ 0 ] = E[S(x i, y i ) x i, θ (t) ] b E[S(x i, y i ) x i, R i = 1] bpr[r i = 1 x i ] 1 bpr[r i = 1 x i ] At the M-step, find θ = θ (t+1) that solves: θ (t+1) = arg max θ Q (θ; θ (t) ) mx := arg max θ [θ{ S(x i, y i ) + nx i=m+1 be[s(x i, Y ) x i, R i = 0; θ (t) ]} + na(θ)]

17 17 / 33 An important observation In usual an EM algorithm, θ (t) may be far away from the truth θ 0. However, in order for the empirical estimates pr[r i = 1 x i ] and Ê[S(x i, y i ) x i, R i = 1] to be self-consistent with the corresponding terms under (θ (t), ψ 0 ), it is required that θ (t) is a consistent estimate of θ. Therefore we should start with a consistent initial estimate of θ and carry out this iterative algorithm to obtain another consistent estimate of θ at convergence.

18 A short summary of the modified EM 1 Choose initial value of θ (0) from either a complete dataset from subjects with similar characteristics or a completed subset recovered from recalls. 2 Compute the Nadaraya-Watson estimates or local polynomial estimates for pr[r = 1 x i ] and E[S(x i, Y ) x i, R = 1], for each i = m + 1, m + 2,..., n. 3 At the E-step of the tth iteration, calculate be[s(x i, y i ) x i, R i = 0; θ (t), ψ 0 ]. 4 At the M-step of the tth iteration, find θ = θ (t+1) that solves: nx mx nx E[S(x i, y i ) x i ; θ] = S(x i, y i ) + be[s(x i, Y ) x i, R i = 0; θ (t) ] i=m+1 With a complete external dataset {x i, y i ; i = n + 1,..., n + n E }, we would solve θ = θ (t+1) from: n+n nx X E E[S(x i, y i ) x i ; θ] + E[S(x i, y i ) x i ; θ] i=n+1 = mx S(xi, y i ) + nx be[s(xi, Y ) x i, R i = 0; θ (t) ] + n+n X E S(xi, y i ) 18 / 33

19 19 / 33 Here we design an iterative algorithm with a sequence {θ (t), t = 0, 1, 2,...} such that θ (t+1) = arg max θ Q (θ; θ (t) ) = arg max θ {Q(θ; θ (t) ) + o(1)} This algorithm will yield l(θ (t+1), ψ 0 ; y obs ) l(θ (t), ψ 0 ; y obs ) + o(1).

20 An example in contingency table analysis Consider discrete data {x i, y i, z i, i = 1,..., n}: x i s and y i s are fully observed. z i is observed for i = 1,..., c; missing for i = c + 1,..., n. Let m = n c. Essentially the observed include the fully classified table {c jkl } and the partially classified table {m jk+ }. Complete external data D E = {x i, y i, z i, i = n + 1,..., n + n E } are fully observed. A log-linear model is assumed with parameter θ, which leads to π jkl = pr[x = i, Y = j, Z = l], j = 1,..., J; k = 1,..., K ; l = 1,..., L. Initial estimates {π (0) jkl } are derived from the external complete data D E. 20 / 33

21 21 / 33 Implementation under the discrete setting Initial estimations: bpr[z = l x = j, y = k, R = 1] = bpr[r = 1 x = j, y = k] = At the tth step, for (j, k, l), we update S jkl = c jkl + P c I{x i = j, y i = k, z i = l} P c I{x i = j, y i = k} P c I{x i = j, y i = k} P n I{x i = j, y i = k} = c jk+ n jk+. = c jkl c jk+ nx I{x i = j, y i = k} bpr[z = l x = j, y = k, R = 0] i=c+1 = c jkl + m jk+ pr[z = l x = j, y = k; θ (t) ] cjkl c jk+ c jk+ n jk+ 1 c jk+ n jk+ = c jkl + m jk+ π (t) jkl c jkl π (t) n jk+ jk+ m jk+ n jk+ = n jk+ π (t) jkl π (t) jk+

22 22 / 33 Impression on the discrete setting In the E-step, we ended up as if imputing all Z i s based on (x i, y i ), including those observed ones. If we include the external data D E in the algorithm with updating the sufficient statistics through π (t) Sjkl = njkl E + S jkl = njkl E jkl + n jk+ π (t), jk+ the modified EM algorithm is equivalent to running a regular EM algorithm on the fully classified table {n E jkl } and the partial classified table {n jk+ }, or, removing the observed z i s from the complete cases (not part of the external complete data).

23 23 / 33 Implementation under the continuous setting Consider the regression analysis of [Y X] where Y is continuous and subject to nonresponse If X is discrete, the empirical approximations are: r j =1,x j =x i =k Ê[S(x i, Y ) x i = k, r i = 1] = s(x i, y j ) #{j : r j = 1, x j = x i = k} pr[r i = 1 x i = k] = #{j : r j = 1, x j = x i = k} #{j : x j = x i = k} If X is continuous, we use either the Nadaraya-Watson estimate or local polynomial estimate as Ê[S(x i, Y ) x i = k, r i = 1]; a kernel estimator for pr[r i = 1 x i = k].

24 24 / 33 Simulation settings when data are NMAR We simulated bivariate data {x i, y i, i = 1, 2,..., N} following: (i) x i N(0, 1) or x i Bin(5, 0.3). (ii) [y i x i ] N(β 0 + β 1 x i, σ 2 ), where θ = (β 0, β 1, σ 2 ) = (1, 1, 1). Keep n E = 200 of the above observations as the external datasets for obtaining initial values for θ. Simulate missing y i s from the rest n = 1000 subjects with: pr[r i = 1 x i, y i ] = Φ(ψ 0 + ψ 1 x i + ψ 2 y i ), where Φ() is the CDF of the Gaussian distribution. Compare the complete-case estimate ( θ cc ) based on the 1000 observations with missing data, MLE from the external data ( θ E ) and the proposed approximated EM estimates ( θ AEM ).

25 25 / 33 Simulation results with X N(0, 1) Methods β 0 β 1 σ 2 Complete-case analysis Empirical Bias* Empirical SD* Coverage of 95% CI 0% 8.8% 79.5% MLEs from external subset Empirical Bias* Empirical SD* Coverage of 95% CI 96.1% 93.8% 92.6% Approximated EM** Empirical Bias* Empirical SD* Coverage of 95% CI 95.0% 95.2% 93.4% The proportion of nonresponse was about 40%. * Empirical biases and SDs were the actual numbers times ** The Epanechnikov kernel was used in the approximated EM. Bootstrap was used to derive the standard errors of the b θ AEM

26 26 / 33 Simulation results with X Bin(5, 0.3) Methods β 0 β 1 σ 2 Complete-case analysis Empirical Bias* Empirical SD* Coverage of 95% CI 94.6% 11.6% 86.5% MLEs from external subset Empirical Bias* Empirical SD* Coverage of 95% CI 94.7% 94.2% 92.5% Approximated EM** Empirical Bias* Empirical SD* Coverage of 95% CI 94.7% 93.9% 93.6% The proportion of nonresponse was about 40% * Empirical biases and SDs were the actual numbers times ** The empirical averages were used in the approximated EM. Bootstrap was used to derive the standard errors of the b θ AEM.

27 27 / 33 Simulation settings when data MAR We simulated bivariate data {x i, y i, i = 1, 2,..., N} following: (i) x i N(0, 1). (ii) [y i x i ] N(β 0 + β 1 x i, σ 2 ), where θ = (β 0, β 1, σ 2 ) = (1, 1, 1). Keep n E = 200 of the above observations as the external datasets for obtaining initial values for θ. Simulate missing y i s from the rest n = 1000 subjects with: pr[r i = 1 x i, y i ] = Φ(ψ 0 + ψ 1 x i ), where Φ() is the CDF of the Gaussian distribution. Compare the complete-case estimate ( θ cc ) based on the 1000 observations with missing data, MLE from the external data ( θ E ) and the proposed approximated EM estimates ( θ AEM ).

28 28 / 33 Simulation results with X N(0, 1) Methods β 0 β 1 σ 2 Complete-case analysis Empirical Bias* Empirical SD* Coverage of 95% CI 94% 94.5% 93.4% MLEs from external subset Empirical Bias* Empirical SD* Coverage of 95% CI 96.1% 93.8% 92.6% Approximated EM** Empirical Bias* Empirical SD* Coverage of 95% CI 93.9% 94.5% 93.7% The proportion of nonresponse was about 40%. * Empirical biases and SDs were the actual numbers times ** The Epanechnikov kernel was used in the modified EM. Bootstrap was used to derive the standard errors of the b θ AEM

29 29 / 33 Without an external complete dataset... If one could obtain a reliable initial estimate θ (0) and the associated variance-covariance matrix estimate V, then we can use it as the initial and perform the M-steps via θ (t+1) = arg max θ {(θ θ (0) ) T V 1 (θ θ (0) ) + 1 n Q (θ; θ (t) )}

30 30 / 33 Without external complete dataset... Consider missing-data mechanisms such as pr[r i = 1 x i, y i ] = w(x i, y i ; ψ) = w(y i ; ψ), There are two approaches for implementing the approximated EM algorithm for data with outcome-dependent nonresponses: 1 Use the pseudolikelihood estimate as the initial estimate and run the approximated EM. 2 Use the pseudolikelihood estimate as the initial estimate and incorporate the pseudolikehood function as a component in each M-step: θ (t+1) = arg max θ {l pl (θ) + Q (θ; θ (t) )} The second approach is more computationally intensive.

31 31 / 33 Future works Need to monitor {l(θ (t) ; ψ 0 ), Q (θ (t+1) ; θ (t) ) Q(θ (t+1) ; θ (t)) ; t = 0, 1, 2...}. (X,Y,R) Search for a target function l(θ; P n (X,Y,R) a stationary point of l(θ; P n, θ (0) )., θ (0) ) so that θ AEM is Variance estimation. Link to integrative data analysis.

32 32 / 33 Acknowledgment Megan Hunt Olson, University of Wisconsin, Green Bay. Yang Zhang, Amgen Inc.

33 33 / 33 Reference 1. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B 39, Tang, G., Little, R.J.A. and Raghunathan, T. (2003). Analysis of Multivariate Missing Data with Nonignorable Nonresponse. Biometrika, 90, Kim, J. K. and C. L. Yu (2011). A semi-parametric estimation of mean functionals with non-ignorable missing data, Journal of the American Statistical Association 106, Wang, S., J. Shao and J. K. Kim (2014). Identifiability and estimation in problems with nonignorable nonresponse, Statistica Sinica 24,

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score