EM Algorithm II. September 11, 2018

Size: px

Start display at page:

Download "EM Algorithm II. September 11, 2018"

Spencer Fleming
5 years ago
Views:

1 EM Algorithm II September 11, 2018

2 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data log likelihood: l O (θ Y obs ) = log { f (Y obs, y mis θ) dy mis EM algorithm: { E step: h (k) (θ) E l C (θ Y obs, Y mis ) Y obs, θ. (k) Note, the integration is with respect { to Y mis Y obs, so E l C (θ Y obs, Y mis ) Y obs, θ (k) = l C (θ Y obs, Y mis ) f (y mis Y obs, θ)dy mis. M step: θ (k+1) = arg max θ h (k) (θ) Ascent property: l O (θ (k) Y obs ) is non-decreasing along k. If you can calculate it, it is a good idea to monitor it for debugging purpose. Issues: 1. Could trap in local maxima. 2. Slow convergence.

3 Standard errors for the EM estimates 2/27 Numerical approximation of the Hessian matrix Note: l(θ) = observed-data log-likelihood We estimate the gradient using { l(θ) i = l(θ) θ i l(θ + δ ie i ) l(θ δ i e i ) 2δ i where e i is a unit vector with 1 for the ith element and 0 otherwise. In calculating derivatives using this formula, generally start with some medium size δ and then repeatedly halve it until the estimated derivative stabilizes. We can estimate the Hessian by applying the above formula twice: { l(θ) i j l(θ + δ ie i + δ j e j ) l(θ + δ i e i δ j e j ) l(θ δ i e i + δ j e j ) + l(θ δ i e i δ j e j ) 4δ i δ j

4 Louis estimator (1982) 3/27 l C (θ Y obs, Y mis ) log { f (Y obs, Y mis θ) { l O (θ Y obs ) log f (Y obs, y mis θ) dy mis l C (θ Y obs, Y mis ), l O (θ Y obs ) = gradients of l C, l O l C (θ Y obs, Y mis ), l O (θ Y obs ) = second derivatives of l C, l O We can prove that (5) l O (θ Y obs ) = E { l C (θ Y obs, Y mis ) Y obs (6) l O (θ Y obs ) = E { l C (θ Y obs, Y mis ) Y obs E {[ l C (θ Y obs, Y mis ) ] 2 Y obs + [ l O (θ Y obs ) ] 2 MLE: ˆθ = arg max θ l O (θ Y obs ) Louis variance estimator: { l O (θ Y obs ) 1 evaluated at θ = ˆθ Note: All of the conditional expectations can be computed in the EM algorithm using only l C and l C, which are first and second derivatives of the complete-data log-likelihood. Louis estimator should be evaluated at the last step of EM.

5 Proof of (5) 4/27 Proof: By the definition of l O (θ Y obs ), l O (θ Y obs ) = log { f (Y obs, y mis θ) dy mis θ = f (Y obs, y mis θ) dy mis / θ f (Yobs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis (7) Multiplying and dividing the integrand of the numerator by f (Y obs, y mis θ) gives (5), f (Y obs,y mis θ) f (Y l O (θ Y obs ) = obs,y mis θ) f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis = = = log { f (Y obs,y mis θ) θ f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis f (Y obs, y mis θ) l C (θ Y obs, Y mis ) dy mis f (Yobs, y mis θ) dy mis l C (θ Y obs, Y mis ) f (y mis Y obs, θ) dy mis = E { l C (θ Y obs, Y mis ) Y obs.

6 Proof of (6) 5/27 Proof: We take an additional derivative of l O (θ Y obs ) in expression (7) to obtain l O (θ Y obs ) = = f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis f 2 (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis { l O (θ Y obs ) 2 = f (Y obs, y mis θ) dy mis f (Y obs θ) { l O (θ Y obs ) 2 (8) To see how the first term breaks down, we take an additional derivative of log { f f (Yobs, y mis θ) (Y obs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis θ

7 to obtain f 2 log { f (Y obs, y mis θ) (Y obs, y mis θ) dy mis = f (Y θ θ obs, y mis θ) dy mis [ ] 2 log { f (Yobs, y mis θ) + f (Y obs, y mis θ) dy mis θ Thus we express the first term in equation (8) to be E { l C (θ Y obs, Y mis ) Y obs + E {[ l C (θ Y obs, Y mis ) ] 2 Y obs.

8 Convergence rate of EM 7/27 Let I C (θ) and I O (θ) denote the complete information and observed information, respectively. One can show when the EM converges, the linear convergence rate, denoted as (θ (k+1) ˆθ)/(θ (k) ˆθ) approximates 1 I O (ˆθ)/I C (ˆθ). (later) This means that When missingness is small, EM converges quickly Otherwise EM converges slowly.

9 Towards SEM algorithm 8/27 EM algorithm does not generate asymptotic covariance matrix (standard errors) for parameters as a byproduct. The asymptotic covariance matrix for ˆθ, denoted as V, can be found as { l O (ˆθ Y obs ) 1. However, the derivations can be difficult to evaluate directly. In contrast, l C (ˆθ Y obs, Y mis ), and hence I OC E { l C (ˆθ Y obs, Y mis ) Y obs is relatively easier to evaluate. Louis estimator for covariance matrix, i.e., E { l C (ˆθ Y obs, Y mis ) Y obs E {[ l C (ˆθ Y obs, Y mis ) ] 2 Y obs + [ l O (ˆθ Y obs ) ] 2 requires calculation of the conditional expectation of the square of the complete-data score function, which is specific to each problem. Supplemented EM algorithm (Meng & Rubin, 1991) obtains covariance matrix by using only the code for computing the complete-data covariance matrix, the code for EM itself, and code for standard matrix operations.

10 Towards SEM algorithm (continued) 9/27 EM defines a mapping, M : θ (k+1) = M(θ (k) ), where M(θ) = (M 1 (θ),..., M p (θ)) Let {DM i j = ( M j (θ)/ θ i ) θ=ˆθ, which is a p p matrix. We can show that θ (k+1) ˆθ DM(θ (k) ˆθ), which means DM is the rate of convergence of EM. Proof: Because θ (k+1) = M(θ (k) ) and ˆθ = M(ˆθ), θ (k+1) ˆθ = M(θ (k) ) M(ˆθ). By Taylor series expansion on the right hand side, we have θ (k+1) ˆθ DM(θ (k) ˆθ). It has been shown that V = I 1 OC (I DM) 1, which means, the observed-data asymptotic variance can be obtained by inflating the complete-data asymptotic variance by the factor (1 DM) 1.

11 SEM algorithm 10/27 SEM consists of three parts 1. The evaluation of I OC 2. The evaluation of DM 3. The evaluation of V Evaluation of I OC E { l C (ˆθ Y obs, Y mis ) Y obs Example 1 (Grouped Multinomial): l C (θ X) = (x 1 + y 4 ) log θ + (y 2 + y 3 ) log (1 θ). Example 2 (Normal Mixtures): l C (µ, σ, p x, y) = i j y i j { log p j + log φ(x i µ j, σ j ) f (Y obs, Y mis ) exponential family: l C (θ X) = S (X) η(θ) B(θ) The l C, l C and l C are linear functions of x 1, i y i j and S (X) (sufficient statistics). Recall that we evaluate E(sufficient statistics Y obs, θ (k) ) at every E step. We easily obtain I OC by plugging in E(sufficient statistics Y obs, ˆθ) at the last E step

12 SEM algorithm (continued) 11/27 Evaluation of DM = {r i j For a scalar θ, we can use the sequence θ (k) to obtain DM. For a vector θ, we cannot do so, because θ (k+1) i ˆθ i j DM i j (θ (k) j ˆθ j ). Each DM i j is the component-wise rate of convergence of the following forced EM 1. Run EM to get the MLE ˆθ 2. Pick a starting point, θ (0), some small distance from ˆθ but not equal to ˆθ in any component 3. Repeat the following until r (k) i j is stable (a) Calculate θ (k) = M(θ (k 1) ) using one step of EM (b) For each i = 1,..., p, i. Let θ (k) (i) = (ˆθ 1,..., ˆθ i 1, θ (k) i, ˆθ i+1,..., ˆθ p ) (Replace the ith element of ˆθ with the ith element of θ (k) ) ii. Perform one step of EM on θ (k) (i) to obtain M[θ (k) (i)] iii. Obtain r (k) i j = { { M j [θ (k) (i)] ˆθ j / θ (k) i ˆθ i for j = 1,..., p

13 SEM algorithm (continued) 12/27 Note: The MLE ˆθ should be obtained at very low tolerance (e.g., ɛ = ) The final r i j is taken to be the first value of r (k) i j can be different for different (i, j). satisfying r (k) i j r (k 1) i j < ɛ, where k

14 MCEM algorithm 13/27 EM algorithm: the analytical integration of the likelihood required for the E-step can be difficult. Monte Carlo EM algorithm (Wei & Tanner, 1990) replaces the analytical integration in the E-step by a Monte Carlo integration procedure with MCMC sampling techniques such as the Gibbs or the Metropolis Hastings algorithm. MCE step: Simulate a sample Y mis,1,..., Y mis,m from f (Y mis Y obs, θ (k) ) and calculate m h (k) (θ) = m 1 l C (θ Y obs, Y mis, j ) M step: θ (k+1) = arg max θ h (k) (θ) j=1 Choose m to guarantee convergence Wei & Tanner recommend starting with small value of m and then increasing m as θ (k) moves closer to the true maximizer.

15 Bayesian EM 14/27 Suppose we have prior π(θ) and wish to find the mode of log posterior = l O (θ Y obs ) + log π(θ). { E step: h (k) (θ) E l C (θ Y obs, Y mis ) + log π(θ) Y obs, θ (k) { = E l C (θ Y obs, Y mis ) Y obs, θ (k) + log π(θ) M step: θ (k+1) = arg max θ h (k) (θ)

16 Bayesian EM: Example 15/27 Observed data: Complete data: where x 0 + x 1 = y 1. { (y 1, y 2, y 3 ) multinomial n; { (x 0, x 1, y 2, y 3 ) multinomial n; 2 + θ 4, 1 θ 2, θ 4 1 2, θ 4, 1 θ 2, θ 4 No Prior θ Beta(ν 1, ν 2 ) : π(θ) = Γ(ν 1+ν 2 ) Γ(ν 1 )Γ(ν 2 ) θ(ν 1 1) (1 θ) (ν 2 1) l C (θ x 0, x 1, y 2, y 3 ) (x 1 + y 3 ) log θ + y 2 log(1 θ) (x 1 + y 3 + ν 1 1) log θ + (y 2 + ν 2 1) log(1 θ) ω (k) 12 = E(x 1 θ (k), y 1 ) θ (k) y 1 /(θ (k) + 2) Same as left θ (k+1) (ω (k) 12 + y 3)/(ω (k) 12 + y 2 + y 3 ) (ω (k) 12 + y 3 + ν 1 1)/(ω (k) 12 + y 2 + y 3 + ν 1 + ν 2 2)

17 Generalized EM algorithm 16/27 Generalized EM (GEM) E step: evaluate h (k) (θ) as before M step: Choose θ (k+1) such that h (k) (θ (k+1) ) h (k) (θ (k) ) (do not necessarily maximize h (k) (θ), just increase it.) Note: This retains the ascent property of EM. EM gradient algorithm (Lange, 1995): a class of GEM M step: Do one step of Newton-Raphson: { θ (k+1) = θ (k) + α (k) d (k), where d (k) 2 h (k) 1 { (θ) h (k) (θ) = θ θ θ Start with α (k) = 1; do step-halving until h (k) (θ (k+1) ) h (k) (θ (k) ) θ=θ (k). Lange pointed out that one step Newton-Raphson saves us from performing iterations within iterations and yet still displays the same local rate of convergence as a full EM algorithm that maximizes h (k) (θ) at each iteration.

18 ECM algorithm 17/27 EM is unattractive if maximizing complete-data log likelihood h (k) (θ) is complicated. In many cases, maximizing h (k) (θ) is relatively simple when conditional on some of the parameters being estimated. Expectation Conditional Maximization algorithm (Meng & Rubin, 1993) replaces a (complicated) M step with a sequence of conditional maximization (CM) steps. E step: evaluate h (k) (θ) as before CM step: Partition θ into T parts: θ = (θ 1,..., θ T ). For t = 1,..., T, obtain θ (k+1) t = arg max θ t h (k) (θ (k+1) 1,..., θ (k+1) t 1, θ t, θ (k) t+1,..., θ(k) T ) Note: Sharing all appealing convergence properties of EM, such as ascent property Typically need more E- and M- iterations but can be faster in total computer time.

19 ECM algorithm: example 18/27 Complete data: y 1,..., y n gamma(α, β) with density f (y α, β) = yα 1 e y/β β α Γ(α) Observed data: y O = censoring of the complete data Complete-data log-likelihood l C (α, β y 1,..., y n ) = (α 1) log y i y i /β n { α log β + log Γ(α) i i Define ȳ = n 1 i y i, ḡ = n 1 i log y i, and ψ(α) = Γ (α)/γ(α) E step: ω (k) E(ȳ y O, α (k), β (k) ) τ (k) E(ḡ y O, α (k), β (k) ) CM steps Given α (k), β (k+1) = ω (k) /α (k) Given β (k+1), α (k+1) = ψ 1 (τ (k) log β (k+1) )

20 ECME algorithm 19/27 ECM Either algorithm (Liu & Rubin, 1994) is a generalization of the ECM algorithm. It replacs some CM-steps of ECM, which maximize the constrained expected complete-data log likelihood, with steps that maximize the correspondingly constrained observed-data log likelihood. E step: evaluate h (k) (θ) as before CM step: Partition θ into T parts: θ = (θ 1,..., θ T ). For t = 1,..., T, obtain either or Note: θ (k+1) t θ (k+1) t = arg max θ t = arg max θ t h (k) (θ (k+1) 1,..., θ (k+1) t 1, θ t, θ (k) t+1,..., θ(k) T ) l O (θ (k+1) 1,..., θ (k+1) t 1, θ t, θ (k) t+1,..., θ(k) T Y obs) Share with both EM and ECM their stable monotone convergence and simplicity of implementation Converge substantially faster than either EM or ECM, measured by either the number of iterations or actual computer time.

21 Example: Mixed-effects model 20/27 For a longitudinal dataset of i = 1,..., N subjects, each with t = 1,..., n i measurements of the response, a simple linear mixed effect model is given by Y it = X i β + b i + ɛ it, b i N(0, σ 2 b ), ɛ i N ni (0, σ 2 ɛ I ni ), b i, ɛ i independent Observed-data log-likelihood l(β, σ 2 b, σ2 ɛ Y 1,..., Y N ) where {Σ i tt = σ 2 b + σ2 ɛ and {Σ i tt = σ 2 b for t t. i { 1 2 (Y i X i β) Σ 1 i (Y i X i β) 1 2 log Σ i, In fact, this likelihood can be directly maximized for (β, σ 2 b, σ2 ɛ) by using Newton-Raphson or Fisher scoring. Note: Given (σ 2 b, σ2 ɛ) and hence Σ i, we obtain β that maximizes the likelihood by solving l(β, σ 2 b, σ2 ɛ Y 1,..., Y N ) = X i β Σ 1 i (Y i X i β) = 0, i 1 N N β = X i Σ 1 i X i X i Σ 1 i Y i. i=1 i=1

22 Example: Mixed-effects model (continued) 21/27 Complete-data log-likelihood: b i are treated as missing data Let ɛ i = Y i X i β b i. We know that ( ) {( ) ( ) bi 0 σ 2 = N, b σ 2 ɛ I ni ɛ i l C (β, σ 2 b, σ2 ɛ ɛ 1,..., ɛ N, b 1,..., b N ) { i 1 b 2 2σ 2 i 1 2 log σ2 b 1 ɛ 2σ b 2 i ɛ i n i ɛ 2 log σ2 ɛ The parameter that maximizes the l C is obtained as, given the complete data N σ 2 b = N 1 σ 2 ɛ = β = i=1 i=1 b 2 i 1 N n i N X i X i i=1 N i=1 1 ɛ i ɛ i N i=1 X i (Y i b i ).

23 Example: Mixed-effects model (continued) 22/27 E step: to evaluate We use the relationship E ( ) b 2 i Y i, β (k), σ 2(k) b, σ 2(k) ɛ E ( ɛ i ɛ Y i, β (k), σ (k) b, ) σ2(k) ɛ E ( b i Y i, β (k), σ (k) b, ) σ2(k) ɛ E(b 2 i Y i ) = {E(b i Y i ) 2 + Var(b i Y i ). Thus we need to calculate E(b i Y i ) and Var(b i Y i ). Recall the conditional distribution for multivariate normal variables ( ) {( ) ( Yi Xi β σ 2 = N, b e ni e n i + σ 2 ɛ I ni σ 2 b e ) n i, e n 0 i = (1, 1,..., 1) b i σ 2 b e n i Let Σ i = σ 2 b e n i e n i + σ 2 ɛ I ni. We known that σ 2 b E(b i Y i ) = 0 + σ 2 b e n i Σ 1 i (Y i X i β) Var(b i Y i ) = σ 2 b σ2 b e n i Σ 1 i σ 2 b e n i.

24 Example: Mixed-effects model (continued) 23/27 Similarly, We use the relationship We can derive E(ɛ i ɛ i Y i ) = E(ɛ i Y i )E(ɛ i Y i ) + Var(ɛ i Y i ). ( Yi ɛ i ) = N {( ) Xi β, 0 Let Σ i = σ 2 b e n i e n i + σ 2 ɛ I ni. Then we have M step of standard EM algorithm N = N 1 σ 2(k+1) b σ 2(k+1) ɛ = β (k+1) = ( ) σ 2 b e ni e n i + σ 2 ɛ I ni σ 2 ɛ I ni σ 2 ɛ I ni σ 2. ɛ I ni E(ɛ i Y i ) = 0 + σ 2 ɛσ 1 i (Y i X i β) Var(ɛ i Y i ) = σ 2 ɛ I ni σ 4 ɛσ 1 i. i=1 i=1 1 N n i N X i X i i=1 E(b 2 i Y i, β (k), σ 2(k) b, σɛ 2(k) )) (1) N i=1 1 E(ɛ i ɛ i Y i, β (k), σ 2(k) b, σ 2(k) ɛ ) (2) N i=1 X i E(Y i b i Y i, β (k), σ 2(k) b, σ 2(k) ɛ ). (3)

25 ECME algorithm: Mixed-effects model (continued) 24/27 M step of ECME algorithm Partition the parameter vector (β, σ 2 b, σ2 ɛ) as β and (σ 2 b, σ2 ɛ) First maximize complete-data log-likelihood over (σ 2 b, σ2 ɛ), given by (1) and (2) Given (σ 2(k+1) b, σ 2(k+1) ɛ ), we can calculate Σ (k+1) i = σ 2(k+1) b e ni e n i + σ 2(k+1) ɛ I ni and obtain β that maximizes the observed-data log likelihood 1 N { β (k+1) = X i Σ (k+1) 1 N { i Xi X i Σ (k+1) 1 i Yi. i=1 i=1 Convergence accelerated: In some of ECME s M steps, the observed-data likelihood is being conditionally maximized, rather than an approximation to it (expected complete-data log likelihood).

26 Accelerated EM 25/27 Aitken s acceleration method (Louis, 1982) Suppose θ (k) ˆθ, as k. Then Now ˆθ = θ (k) + [ θ (k+h+1) θ (k+h)]. h=0 θ (k+h+2) θ (k+h+1) = M(θ (k+h+1) ) M(θ (k+h) ) J(θ (k+h) ) [ θ (k+h+1) θ (k+h)] J(θ (k) ) [ θ (k+h+1) θ (k+h)] { J(θ (k) ) h+1 [ θ (k+1) θ (k)] M : mapping defined by EM J : Jabobian of M Thus { ˆθ θ (k) + J(θ (k) ) h [ θ (k+1) θ (k)] h=0 θ (k) + { I J(θ (k) ) 1 [ θ (k+1) θ (k)] by which we can produce the effect of an infinite number of iterations by the following algorithm

27 Accelerated EM 26/27 The algorithm: 1. From θ (k), produce θ (k+1) using EM 2. Estimate ( I J(θ (k) ) ) 1 by ( I ˆ J ) 1 (see below) 3. Compute θ (k+1) 4. Use θ (k+1) in step 1. = θ (k) + ( I ˆ J ) 1 [ θ (k+1) θ (k)] Louis (1982) showed ( I ˆ J ) 1 = IOC (I O ) 1 where I OC = E { l C (ˆθ Y obs, Y mis ) Y obs and IO can be obtained by the Louis formula.

28 References 27/27 Dempster, A.P., Laird, N., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society B, 57, Liu, C. and Rubin, D.B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, Louis, T.A. (1982). Finding observed information using the EM algorithm. Journal of the Royal Statistical Society B, 44, Meng, X. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86, Meng, X. and Rubin, D.B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80, Wei, G. C. G. and Tanner, M.A. (1990). A Monte Carlo implementation of the EM algorithm and the poor mans data augmentation algorithms. Journal of the American Statistical Association, 85,

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares