Computing the MLE and the EM Algorithm

Size: px

Start display at page:

Download "Computing the MLE and the EM Algorithm"

Elijah Johnston
6 years ago
Views:

ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0.

1 ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations have a simple closed form solution, and other times they do not and we must use computational methods to find θ. Example In some cases, the MLE is computed by taking a simple average. Suppose X i i.i.d P oisson(λ). Then the MLE is λ n n Xi. Example The MLE sometimes requires solving a system of linear equations. Suppose that X N(Hθ, I), where H is n k and known and θ is k and unknown. Then the MLE is θ (H T H) H T X Example 3 The MLE can also be the solution to a nonlinear system of equations. Suppose that X i i.i.d pn(µ 0, σ 0) + ( p)n(µ, σ ), i,..., n, and let θ [p µ 0 σ 0 µ σ ] T p p(x i θ) e (x i µ 0) σ 0 + p e (x i µ ) σ πσ 0 πσ Figure : Two-dimensional Gaussian mixture density. The likelihood is a complicated nonlinear function. Moreover, it is non-convex in θ. n p(x θ) p(x i θ), a product of sums of exponentials. i Taking the logarithm doesn t simplify things: log p(x θ) a sum of logs of a sum of exponentials.

2 Computing the MLE and the EM Algorithm Also recall that the sufficient statistic in this case is the whole set of data (X, X,..., X n ); i.e., there is no small sufficient statistic that summarizes them. What can we do in such situations? We need a computational method to maximize the liklihood fucntion. There are two common approaches:. Gradient/Newton methods θ (t+) θ (t) + θ log p(x θ) θθ (t), where > 0 is a step size.. Expectation-Maximization Algorithm (EM algorithm) Gradient ascent methods should be familiar to most readers. The EM algorithm is a specialized approach designed for MLE problems, and it has some attractive properties, namely it doesn t require specification of a step size and under mild conditions it is guaranteed to converge to a local maximum of the likelihood function. If the likelihood function is concave (i.e., negative log-likelihood is convex), then convergence to a global maximum likelihood point is possible using gradient methods or EM. The rest of the lecture discusses the EM algorithm. The EM Algorithm In many problems MLE based on observed data X would be greatly simplified if we had additionally observed another piece of data Y.Y is called the hidden or latent data. Example 4 X N (Hθ, I) can be modeled as: such that HW + W N (0, I). Y k θ + W X n H n k Y + W If we ust have X, then we must solve a system of equations to obtain the MLE. If the dimension is large, then computing the MLE is quite expensive(i.e. the inversion is at least O(max(nk, k 3 ))). But if we also have Y, then the MLE can be computed with O(k) as we know ˆθ Y. Example 5 x i y i pn (µ 0, σ 0) + ( p)n (µ, σ ) x i y i l N (µ l, σ l ) Bernoulli(p) p yi ( p) yi Given {(x i, y i )} n i, we have: ˆµ l ˆσ l yil yil i:y il x i (x i ˆµ l ) i:y il ˆp yil n MLE s are easy to compute here. However, if we only have {x i } n i, the computation of MLE is a complicated, non-convex optimization, where we can apply EM algorithm to compute. The application of EM algorithm in this situation is shown in Example 4.

3 Computing the MLE and the EM Algorithm 3 Main Idea Let L(θ) log p(x θ) and also define the complete data log-like: L c (θ) log p(x, y θ) log p(y x, θ)p(x θ) log p(y x, θ) + log p(x θ) log p(y x, θ) + L(θ) Suppose our current guess of θ is θ (t) and that we would like to imporve this guess. Consider L(θ) L(θ (t) ) L c (θ) L c (θ (t) ) + log p(y x, θ(t) ) p(y x, θ) Now take expectation of both sides with respect to y p(y x, θ (t) ), we have: L(θ) L(θ (t) ) E y [L c (θ)] E y [L c (θ (t) )] + D(p(y x, θ (t) ) p(y x, θ)) Since D(p(y x, θ (t) ) p(y x, θ)) 0, we have the following inequality: L(θ) L(θ (t) ) E y [L c (θ)] E y [L c (θ (t) )] Q(θ, θ (t) ) Q(θ (t), θ (t) ) where Q(θ, θ ) : E p(y x,θ )[log p(x, y θ)] is the expectation of complete data log-likelihood. We choose θ (t+) as the solution of the following optimization problem: θ (t+) arg max Q(θ, θ (t) ) θ The EM algorithm is an attractive option if the Q function is easily computed and optimized. The relationship between log p(x, θ), Q(θ, θ (t) ), θ t and θ (t+) are depicted in the following figure: Figure : Graphical show of EM algorithm The process of EM algorithm is as follows: Init: t 0, θ (0) 0 or random value for t0,,,... E step: Q(θ, θ (t) ) E p(y x,θ (t) )[log p(x, y θ)] M step: θ (t+) arg max Q(θ, θ (t) ) θ The E-step and M-step repeat until convergence. The two key properties of the EM algorithm are:

4 Computing the MLE and the EM Algorithm 4. log p(x θ (0) ) log p(x θ () ).... It converges to stationary point(e.g. local max) Now let s look at a few applications of the EM algorithm. The EM algorithm is especially attractive in cases where the Q function is easy to compute and optimize. There is a bit of art involved in the choice of the hiddent or latent data Y, and this needs to be worked out on a case-by-case basis. Example 6 Original model X Hθ + W : Complete model: Y θ + W W N (0, α I k k ) Then we construct the complete log-likelihood: X H n k Y + W W N (0, I n n α HH T ) log p(x, y θ) log p(x y θ) + log p(y θ) y θ constant α α (θt y θ T θ y T y) + constant α (θt y θ T θ) + constant As the part left after taking away the constant is proportional to y, so we only need to calculate E p(y x θ )[y]. (t) Introduce Z Y, Z X HY, then we have the oint distribution of Z, Z as: Z θ α N (, I k k I n n α HH T ) Z X H In n Z As we know, we know: Y I k k 0 Z X Hθ In n α N (, H Y θ α H T α ) I k k Make a linear transformation, we have: [ ] [ ] [ ] X Hθ In n 0 Y α H T N ( X θ α H T, Hθ 0 α I k k α 4 H T ) H So we have: E p(y x θ (t) )[y] α H T x + θ (t) α H T Hθ (t) y (t) As Q(θ, θ (t) ) α (θ T y (t) θ T θ) + constant, set Q θ 0, we have: θ (t+) y (t) It is easy to calculate the stationary point in this iteration, let θ (t+) θ (t), we have: which is the answer we are familiar with. θ stationary (H T H) H T x

5 Computing the MLE and the EM Algorithm 5 Example 7 Suppose: We have: Thus, X, X,..., X n p(x, y θ) Π n i log p(x, y θ) n m p N (µ, σ ) m p e (xi µ ) σ πσ i m p log( e (xi µ ) σ πσ yi ) yi Denote p (t) (y i ) Set Q θ E p(y x θ (t) )[log p(x, y θ)] 0, we have: n i n m p log( e (xi µ ) σ )E p(y x θ )[ yi] (t) πσ i m p log( e (xi µ ) πσ p(t) N (xi;µ(t),(σ(t) ) ) P m l p(t) l N (x i;µ (t) l,(σ (t) Q(θ, θ (t) ) n i i σ ) p (t) m l p(t) l N (x i; µ (t), (σ(t) ) ) N (x i ; µ (t) l, (σ (t) l ) ), we have the expression of Q(θ, l ) ) θ(t) ): n m p (t) (y i ) log(p (t) N (x i; µ, σ )) m p (t) (y i ) log(n (x i ; µ, σ )) + constant µ (t+) (σ (t+) ) n i p(t) (y i )x i n i p(t) (y i ) n i (x i µ (t+) ) p (t) (y i ) n i p(t) (y i )

The Expectation-Maximization Algorithm

1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable