Optimization Methods II. EM algorithms.

Aula 7. Optimization Methods II. 0 Optimization Methods II. EM algorithms. Anatoli Iambartsev IME-USP

Aula 7. Optimization Methods II. 1 [RC] Missing-data models. Demarginalization. The term EM algorithms has been around for a long time by [DLR]. Consider the case where the density of the observations can be expresses as g θ (x) = f θ (x, z)dz, g θ (x) f θ (x, z). This representation occurs in many statistical settings, including censoring models and mixtures and latent variable models (tobit, probit, arch, stochastic volatility, ect.).

Aula 7. Optimization Methods II. 2 [RC] Example 5.12 The mixture model of Example 5.2 (see previous lecture), 0.25N(µ 1, 1) + 0.75N(µ 2, 1), can be expressed as a missing-data model even though the (observed) likelihood can be computed in a manageable time. Indeed, if we introduce a vector z = (z 1,..., z n ) {1, 2} n in addition to the sample x = (x 1,..., x n ) such that P θ (Z i = 1) = 1 P θ (Z i = 2) = 0.25, X i Z i = z N(µ z, 1). we recover the mixture model from the Example 5.2 as the marginal distribution of X i. The (observed) likelihood is then obtained as E(H(x, z)) for { 1 H(x, z) 4 exp (x } i µ 1 ) 2 { 3 2 4 exp (x } i µ 2 ) 2. 2 i:z i =1 i:z i =2

Aula 7. Optimization Methods II. 3 [RC] Example 5.12 The (observed) likelihood is then obtained as E(H(x, z)) for } } H(x, z) Indeed, i:z i =1 g µ1,µ 2 (x) = { 1 4 exp = = (x i µ 1 ) 2 2 i:z i =2 f µ1,µ 2 (x, z)dz = z {1,2} n i:z i =1 n ( 1 4 i=1 1 4 e (x i µ 1 )2 2 2π e (x i µ 1 )2 2 + 3 2π 4 { 3 4 exp (x i µ 2 ) 2 2 H(x, z) ( 2π) n z {1,2} n i:z i =2 3 4 e (x i µ 2 )2 2 2π ). e (x i µ 2 )2 2 2π :

Aula 7. Optimization Methods II. 4 [RC] Example 5.13 Censored data may come from experiments where some potential observations are replaced with a lower bound because they take too long to observe. Suppose that we observe Y 1,..., Y m, iid, from f(y θ) and that the (n m) remaining (Y m+1,..., Y n ) are censored at the threshold a. The corresponding likelihood function is then L(θ y) = ( 1 F (a θ) ) n m m i=1 f(y i θ), where F is the cdf associated with f and y = (y 1,..., y m ).

Aula 7. Optimization Methods II. 5 [RC] Example 5.13 L(θ y) = ( 1 F (a θ) ) n m m i=1 f(y i θ). If we had observed the last n m values, say z = (z m+1,..., z n ), with z i a(i = m + 1,..., n), we could have constructed the (complete data) likelihood L c (θ y, z) = m i=1 f(y i θ) n i=m+1 f(z i θ). Note that L(θ y) = E ( L c (θ y, Z) ) = L c (θ y, z)f(z y, θ)dz, where f(z y, θ) is the density of the missing data conditional on the observed data, namely the product of the f(z i θ)/(1 F (a θ)) s; i.e., f(z θ) restricted to (a, + ).

Aula 7. Optimization Methods II. 6 Main Idea of EM algorithms Demarginalization: g θ (x) f θ (x, z), g θ (x) = f θ (x, z)dz. A values from z can be generated by the conditional distribution k θ (z x) = f θ(x, z) g θ (x). Take a logarithm log g θ (x) = log f θ (x, z) log k θ (z x). In notations of likelihood function log L(θ x) = log L c (θ x, z) log k θ (z x), where L c stands for complete likelihood function.

Aula 7. Optimization Methods II. 7 Main Idea of EM algorithms log L(θ x) = log L c (θ x, z) log k θ (z x), Let us fix a value θ 0 and calculate the expectation according to the distribution k θ0 (z x): log L(θ x) = E k,θ0 log L c (θ x, z) E k,θ0 log k θ (z x) =: Q(θ θ 0, x) H(θ θ 0, x). Theorem. Let θ 1 maximizes the Q, i.e., Then Q(θ 1 θ 0, x) = max θ Q(θ θ 0, x). log L(θ 1 x) log L(θ 0 x).

Aula 7. Optimization Methods II. 8 Main Idea of EM algorithms. Proof. Q(θ 1 θ 0, x) = max θ Proof. Q(θ θ 0, x) log L(θ 1 x) log L(θ 0 x). log L(θ 1 x) log L(θ 0 x) ( Q(θ1 θ 0, x) Q(θ 0 θ 0, x) ) ( H(θ 1 θ 0, x) H(θ 0 θ 0, x) ) Note that by definition of θ 1 Q(θ 1 θ 0, x) Q(θ 0 θ 0, x) 0.

Aula 7. Optimization Methods II. 9 Main Idea of EM algorithms. Proof. and log L(θ 1 x) log L(θ 0 x) ( Q(θ1 θ 0, x) Q(θ 0 θ 0, x) ) ( H(θ 1 θ 0, x) H(θ 0 θ 0, x) ) H(θ 1 θ 0, x) H(θ 0 θ 0, x) = E k,θ0 log k θ1 (Z x) E k,θ0 log k θ0 (Z x) = E k,θ0 log k θ 1 (Z x) k θ0 (Z x) log E k θ1 (Z x) k,θ 0 k θ0 (Z x) = log 1 = 0, where Jensen inequality was used E log ξ log Eξ.

Aula 7. Optimization Methods II. 10 Main Idea of EM algorithms. Proof. log L(θ 1 x) log L(θ 0 x) ( Q(θ1 θ 0, x) Q(θ 0 θ 0, x) ) ( H(θ 1 θ 0, x) H(θ 0 θ 0, x) ) We have thus Q(θ 1 θ 0, x) Q(θ 0 θ 0, x) 0, H(θ 1 θ 0, x) H(θ 0 θ 0, x) 0, log L(θ 1 x) log L(θ 0 x) 0. This completes the proof of the theorem.

Aula 7. Optimization Methods II. 11 Main Idea of EM algorithms. Each iteration EM algorithm maximizes a function Q. Let θ t be a sequence obtained recurcively Q(θ t+1 θ t, x) = max θ Q(θ θ t, x). This recurrent scheme consists on two steps expectation and maximization, that gives the name for the scheme: EM algorithm.

Aula 7. Optimization Methods II. 12 EM algorithm. Choose the initial parameter θ 0 and repeat: E-step. Calculate the expectation Q(θ θ t, x) = E k,θt log L c (θ x, Z) with respect to the distribution k θt (z x). M-step. Maximize Q(θ θ t, x) on θ and determine the next value θ t+1 = arg max θ Q(θ θ t, x), define t = t + 1, return to the E-step

Aula 7. Optimization Methods II. 13 [RC]. Example 5.14 (Cont. of Example 5.13) Let again Y 1,..., Y m are iid com density f(y θ) and others Y m+1,..., Y n are censured at the level a. The likelihood function L(θ y) = ( 1 F (a θ) ) n m m i=1 f(y i θ), where F (a θ) = P(Y i a). If we had observed the last n m values, say z = (z m+1,..., z n ), with z i a(i = m + 1,..., n), we could have constructed the (complete data) likelihood L c (θ y, z) = m i=1 f(y i θ) n i=m+1 f(z i θ). and k θ (z y) = n m i=1 f(z i θ) 1 F (a θ).

Aula 7. Optimization Methods II. 14 [RC]. Example 5.14 (Cont. of Example 5.13) Suppose that f(y θ) corresponds to the N(θ, 1) distribution, the complete-data likelihood is m n L c (θ y, z) e (y i θ) 2 /2 e (z i θ) 2 /2, i=1 i=m+1 resulting in the expected complete-data log-likelihood Q(θ θ 0, y) = 1 m (y i θ) 2 1 n ( E k,θ0 Zi θ) 2), 2 2 i=1 i=m+1 where the missing observations Z i are distributed from a normal N(θ, 1) distribution truncated in a. Doing the M-step (i.e., differentiating the function Q(θ θ 0, y) in θ) and setting it equal to 0 then leads to the EM update ˆθ = mȳ + (n m)e k,θ 0 (Z 1 ). n

Aula 7. Optimization Methods II. 15 [RC]. Example 5.14 (Cont. of Example 5.13) Doing the M-step... the EM update ˆθ = mȳ + (n m)e k,θ 0 (Z 1 ). n Since E k,θ0 (Z 1 ) = θ + φ(a θ), where φ and Φ are the 1 Φ(a θ) normal pdf and cdf, respectively, the EM sequence is θ t+1 = m n ȳ + n m ( θ t + φ(a θ ) t). n 1 Φ(a θ t )

Aula 7. Optimization Methods II. 16 Principle of missing information (informally) log L(θ x) = log L c (θ x, z) log k θ (z x) 2 log L(θ x) = 2 log L c (θ x, z) + 2 log k θ (z x) θ 2 θ 2 θ 2 Observed information = Complete information Missing information Informally about a convergence rate of EM algorithms: if a proportion of missing information increases with iterations, then a rate of convergence of an algorithm decreases.

Aula 7. Optimization Methods II. 17 EM algorithms for exponential family models. It is more easy to implement an algorithm when a complete data (z, x) has the exponential family type of distribution in a canonic form: p(z, x θ) = b(z, x) exp(θt s(z, x)) a(θ) Let y = (z, x) be a vector of complete data. We have log p(y θ) = log b(y) + θ T s(y) log a(θ) 1 a(θ) log p(y θ) = s(y) θ a(θ) θ Remember that a(θ) = b(y) exp(θ T s(y))dy

Aula 7. Optimization Methods II. 18 EM algorithms for exponential family models. θ log p(y θ) = s(y) 1 a(θ) a(θ) θ, a(θ) = b(y) exp(θ T s(y))dy we have log a(θ) θ = = = 1 a(θ) 1 a(θ) a(θ) θ = 1 a(θ) b(y)s(y) exp(θ T s(y))dy b(y) exp(θt s(y)) dy θ s(y) b(y) exp(θt s(y)) dy = E(s(y) θ) a(θ) Thus, log p(y θ) = 0 s(y) = E(s(y) θ) θ

Aula 7. Optimization Methods II. 19 EM algorithms for exponential family models. Implementation of EM algorithm: E-step Q(θ θ t ) = = log p(z, x θ)p(z θ t, x)dz b(z, x)p(z θ t, x)dz + θ T s(z, x)p(z θ t, x)dz log a(θ) Note that the first term will not participate in M-step. M-step: we obtain extreme point Q(θ θ t ) θ = s(z, x)p(z θ t, x)dz 1 a(θ) a(θ) θ = E(s(z, x) θ t, x) E(s(z, x) θ) Remember that E(s(z, x) θ) = 1 a(θ) a(θ) θ

Aula 7. Optimization Methods II. 20 EM algorithms for exponential family models. Thus the maximization of Q(θ θ t, x) in M-step is equivalent to solve the following equation E(s(z, x) θ t, x) = E(s(z, x) θ) If the solution exists, then it is unique.

Aula 7. Optimization Methods II. 21 Monte Carlo for E-step. Given θ t we need to calculate Q(θ θ t, x) = E k,θt log L c (θ Z, x). When it is difficult to calculate explicitly we can calculate approximately using Monte Carlo: 1. generate z 1,..., z m according k θ (z x); 2. calculate Q(θ θ t ) = 1 m m i=1 log Lc (θ z i, x); during M-step one maximizes Q by θ in order to obtain θ t+1.

Aula 7. Optimization Methods II. 22 EM algorithm. [H] Hunter, D.R. On the Geometry of EM algorithms.: this paper demonstrates how the geometry of EM algorithms can help explain how their rate of convergence is related to the proportion of missing data. In footnote [H] wrote: In a footnote, [DLR] refer to the comment of a referee, who noted that the use of the word algorithm may be criticized since EM is not, strictly speaking, an algorithm. However, EM is a recipe for creating algorithms, and thus we consider the set of EM algorithms to consist of all algorithms baked according to the EM recipe.

Aula 7. Optimization Methods II. 23 References. [DLR] Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm, J.Roy. Statist. Soc. Ser. B, 39, 1-38, 1977. [H] David R. Hunter. On the Geometry of EM algorithms. Technical Report 0303, Dep. of Stat., Penn State University. February, 2003. [RC ] Cristian P. Robert and George Casella. Introducing Monte Carlo Methods with R. Series Use R!. Springer