EM for ML Estimation

Size: px

Start display at page:

Download "EM for ML Estimation"

Scott Rice
6 years ago
Views:

1 Overview EM for ML Estimation An algorithm for Maximum Likelihood (ML) Estimation from incomplete data (Dempster, Laird, and Rubin, 1977) 1. Formulate complete data so that complete-data ML estimation is easy 2. Estimate missing data Expectation (E) step 3. Analysis of imputed complete data Maximization (M) step Simple implementation, and Stable converge monotonically Extensions 1. Simplicity. M-step: EConditionalM; E-step: MonteCarloEM 2. Efficiency. Numerical techniques: Newton-step, Aitken s acceleration (DLR, 1977), CG-EM,...; Statistical considerations: ECMEeither Q or L, ParametereXpanded-EM

2 The EM algorithm ML Estimation The problem: Given the observed-data X obs with the observed-data model f(x obs θ) (θ Θ), find ˆθ = argmax θ Θ f(x obs θ) = argmax θ Θ ln f(x obs θ) The idea: Formulate complete-data X com by creating missing data so that there exists a one-to-one mapping: (X obs,x mis ) X com The complete-data model satisfies f(x obs θ) = f(x obs,x mis θ)dx mis The mapping: X com X obs is many-to-one. It is easy to compute E[lnf(X com θ ) X obs,θ] It is easy to obtain argmax θ ln f(x com θ)

3 The EM algorithm ML Estimation The algorithm: Set a starting value θ (0), for t = 1, 2,... iterate between the following E and M steps E-step. Compute Q(θ θ (t 1) ) = E[ln f(x com θ) X obs,θ (t 1) ] M-step. Obtain θ (t) = argmax θ Q(θ θ (t 1) ) A special important case: For the exponential family f(x com θ) = h(x com) c(θ) exp{θs T (X com )} Let ˆθ(s(X com )) argmax θ ( ln c(θ) + θs T (X com )) E-step. Estimate the complete-data sufficient statistics s (t) = E[s(X com ) X obs,θ (t 1) ] M-step. Obtain θ (t) = ˆθ(s (t) )

4 The EM algorithm ML Estimation Example: Grouped multinomial data: 197 animals are distributed into four categories with the observed counts X obs = (y 1,y 2,y 3,y 4 ) = (125, 18, 20, 34) A genetic model for the population specifies cell probabilities ) ( θ 4, 1 θ 4, 1 θ 4, θ 4 Thus f(x obs θ) = ( i y i )! i y i! = ( i y i )! i y i! ( θ ) y1 ( ) y2 ( 1 θ 1 θ ( θ ) y1 ( ) y2 +y 1 θ 3 ( θ ) y3 ( θ ) y4 4 ) y4 Split the first category into two categories with cell probabilities (1/2, θ/4)

5 The EM algorithm ML Estimation Complete-data: X com = (x 0,x 1,x 2,x 3,x 4 ) with the mapping y 1 = x 0 + x 1, y 2 = x 2, y 3 = x 3, y 4 = x 4 Complete-data model: X com Multinomial Complete-data log-likelihood: ( 1, θ, 1 θ, 1 θ, ) θ ln f(x com θ) = (x 1 +x 4 ) ln θ+(x 2 +x 3 ) ln(1 θ)+h(x com ) E-step: Compute x (t) 1 = y 1 4 θ(t 1) 1 + θ(t 1) 2 4 M-step: Compute θ (t) = x (t) 1 + y 4 x (t) 1 + y 4 + y 2 + y 3

6 Convergence of Likelihood Sequences The EM algorithm Notation f(x mis X obs,θ ) = f(x obs,x mis θ ) f(x obs θ ) H(θ θ) = E [ln f(x mis X obs,θ ) X obs,θ] Q(θ θ) = E [ln f(x obs,x mis X obs,θ ) X obs,θ] L(θ) = lnf(x obs θ) Then L(θ ) = Q(θ θ) H(θ θ) Lemma (Jensen s inequality) H(θ θ) H(θ θ) GEM A Generalized EM algorithm replaces θ (t) = argmax θ Q(θ θ (t 1) ) in the M-step with a θ (t) that Q(θ (t) θ (t 1) ) Q(θ (t 1) θ (t 1) ) Monotonicity Each iteration of GEM increases the actual likelihood, i.e., L(θ (t) ) L(θ (t 1) ) (t = 1, 2,...) Q: Will the log-likelihood sequence {L(θ (t) )} converge?

7 Convergence of EM Sequences The EM algorithm How about the convergence 1 of EM sequences {θ (t) }? Convergence of EM sequences (Wu, 1983) 1 The proof of convergence of EM sequences (Theorem 2) in DLR contains an incorrect use of the triangle inequality. The theorem itself is questionable. Boyles (1983) provided a (GEM not EM) counterexample.

8 Rate of Convergence The EM algorithm GEM Mapping θ (t) = M(θ (t 1) ) from Θ to Θ Convergence rate (DLR) Suppose that {θ (t) } is a GEM sequence such that (1) θ (t) convergences to θ in the closure of Θ, (2) D 10 Q(θ (t) θ (t 1) ) = 0 and (3) D 20 Q(θ (t) θ (t 1) ) is negative definite with eigenvalues bounded away from zero. Then DL(θ ) = 0, D 20 Q(θ θ ) is negative definite and DM(θ ) = D 20 H(θ θ )[D 20 Q(θ θ )] 1 Note that D 2 L(θ ) is a measure of the information in the observed data about θ....

9 Applications The EM Algorithm Missing Data line break! Multinomial sampling The numerical example discussed in class Normal linear model Create a complete special design matrix Multivariate normal sampling Missing data Grouping, Censoring, and Truncation Consider the example of Liu and Sun (2000, Technometrics) Finite Mixtures Consider ML estimation from a random sample from f(x θ) = k j=1 α j f j (x θ), where α j > 0, j α j = 1, and f j (x θ) is the density function of N(µ j,σ 2 j). Suppose x 1,...,x n is a sample from f(x θ). Consider the hierarchical model: z i θ Multinomial(α 1,...,α k ) and x i (θ,z i = j) f j (x θ) by introducing the missing data z 1,...,z n.

10 Applications The EM Algorithm Mixed-effects Models (Laird and Ware, 1982, Biometrics) y i = X i β + Z i b i + e i y i is (n i 1) observed vector, X i and Z i are (n i p) and (n i q) design matrices, respectively, b i N q (0, Ψ), e i N ni (0,σ 2 I) Factor Analysis Models y i = µ + βz i + e i y i is (p 1) observed vector, β is called the factor-loading matrix, Z i N q (0,I) (q < p), e i N p (0,σ 2 I)

11 Applications The EM Algorithm Positron Emission Tomography (Vardi, Shepp, and Kaufman, 1985, JASA, p. 8-19) Network Tomography (Vardi, 1996, JASA, ) ML Estimation of Discrete Distributions with Simplex Constraints (Liu, 2000, JASA, )

12 Applications The EM Algorithm Local Multiple Biopolymer Sequence Alignment (Lawrence and Reilly, 1990, Proteins: Structure, Function, and Genetics, 7, 41-51) Linear biopolymer sequence data Sequence Observed Data 1 y 1,1 y 1,2... y 1,n1 2 y 2,1 y 2,2... y 2,n2... m y m,1 y m,2... y m,nm where y i,j {s k : k = 1,...,K} for all i,j with K = 4 for DNA sequences and K = 20 for proteins (sequences of amino acids)..

13 Applications The EM Algorithm Assumptions: 1. Each observed sequence contains one complete mitof (subsequence) of a common type. 2. The location of the motif, denoted by x i, in sequence i is unknown for all i = 1,..., m. 3. The motif length, J, is fixed. 4. The common type is specified as follows: y i,xi, y i,xi +1,..., y i,xi +J 1 are independent For j = 1,..., J, y i,xi +(j 1) multinomial(θ j,1,...,θ j,k ), i.e., Prob(y i,xi +(j 1) = s k ) = θ j,k. θ j (θ j,1,..., θ j,k ) is unknown. 5. All y i,j are independent. For y i,j in background (not in the motif in sequence i), y i,j multinomial(θ 0,1,...,θ 0,K ), i.e., Prob(y i,j = s k ) = θ 0,k, θ 0 (θ 0,1,..., θ 0,K ) is unknown. 6. For unknown motif locations, Prob(x i = l) = 1 n i J+1.

14 for j = 1,...,J and k = 1,...,K. Applications The EM Algorithm Complete-data likelihood (is proportional to): K k=1 m i=1 θ N i,k(x i ) 0,k J j=1 K k=1 m i=1 θ M i,j,k(x i ) j,k where N i,k (x i ) = {y i,t : t = 1,...,x i 1,x i + J,...,n i ;y i,t = s k } M i,j,k (x i ) = {y i,t : t = x i + (j 1);y i,t = s k } The complete-data log-likelihood function: [ K m ] [ J K m N i,k (x i ) ln θ 0,k + k=1 i=1 j=1 k=1 i=1 M i,j,k (x i ) ] ln θ j,k The complete-data ML estimate of θ: ˆθ 0,k = mi=1 N i,k (x i ) mi=1 (n i J) and ˆθj,k = mi=1 M i,j,k (x i ) m

15 Applications The EM Algorithm E Step: Fix θ at its current estimate and compute ˆN i,k E(N i,k (x i ) Y,θ) and ˆM i,j,k E(M i,j,k (x i ) Y,θ) by evaluating = = Prob(x i = l Y,θ) K k=1 K k=1 J θ N i,k(l) 0,k j=1 k=1 J j=1 θ N i,k(l) 0,k K J K j=1 k=1 J K j=1 k=1 ( ) Mi,j,k (l) θj,k θ 0,k θ j,k(yi,l+j 1 ) θ 0,k(yi,l+j 1 ) θ M i,j,k(l) j,k θ M i,j,k(l) 0,k J K j=1 k=1 ( θj,k θ 0,k (l = 1,...,n i J + 1) ) Mi,j,k (l) for i = 1,...,m, where k(y) is the index of y in {s 1,s 2,...,s K }.

16 Applications The EM Algorithm M Step: Update the estimate of θ: ˆθ 0,k = mi=1 ˆNi,k mi=1 (n i J) and ˆθj,k = mi=1 ˆMi,j,k m for j = 1,...,J and k = 1,...,K.

17 Observed Information Matrix EM Supplements and Extensions Louis (1982, JRSS(B), ) I obs (θ) = E [I com (θ) X obs,θ] E [ G(Xcom θ)g T (X com θ) X obs,θ ] +G (X obs θ)g T (X obs θ) where G(X com θ) and G (X obs θ) are the gradient vectors of the Q and L functions Meng and Rubin (1991, JASA, ) I 1 (θ X obs ) = (E [I(θ X com ) X obs,θ ]) 1 + (E [I(θ X com ) X obs,θ ]) 1 DM(I DM) 1 where θ (t+1) θ (θ (t+1) θ )DM and DM is approximated numerically using the computer code for E and M steps. Liu (1998, Biometrika, ) Computing observed information matrix from conditional information

18 Extensions EM Supplements and Extensions ECM (Meng and Rubin, 1993, Biometrika) ECME (Liu and Rubin, 1994, Biometrika) PX-EM (Liu, Rubin, and Wu, 1998, Biometrika)

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data