CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view of the EM agorithm, and show how it can be appied to a arge famiy of estimation probems with atent variabes. We begin our discussion with a very usefu resut caed Jensen s inequaity 1 Jensen s inequaity Let f be a function whose domain is the set of rea numbers. Reca that f is a convex function if f (x) 0 (for a x R). In the case of f taking vector-vaued inputs, this is generaized to the condition that its hessian H is positive semi-definite (H 0). If f (x) > 0 for a x, then we say f is stricty convex (in the vector-vaued case, the corresponding statement is that H must be stricty positive semi-definite, written H > 0). Jensen s inequaity can then be stated as foows: Theorem. Let f be a convex function, and et X be a random variabe. Then: E[f(X)] f(ex). Moreover, if f is stricty convex, then E[f(X)] = f(ex) hods true if and ony if X = E[X] with probabiity 1 (i.e., if X is a constant). Reca our convention of occasionay dropping the parentheses when writing expectations, so in the theorem above, f(ex) = f(e[x]). For an interpretation of the theorem, consider the figure beow. 1
2 f(a) f E[f(X)] f(b) f(ex) a E[X] b Here, f is a convex function shown by the soid ine. Aso, X is a random variabe that has a 0.5 chance of taking the vaue a, and a 0.5 chance of taking the vaue b (indicated on the x-axis). Thus, the expected vaue of X is given by the midpoint between a and b. We aso see the vaues f(a), f(b) and f(e[x]) indicated on the y-axis. Moreover, the vaue E[f(X)] is now the midpoint on the y-axis between f(a) and f(b). From our exampe, we see that because f is convex, it must be the case that E[f(X)] f(ex). Incidentay, quite a ot of peope have troube remembering which way the inequaity goes, and remembering a picture ike this is a good way to quicky figure out the answer. Remark. Reca that f is [stricty] concave if and ony if f is [stricty] convex (i.e., f (x) 0 or H 0). Jensen s inequaity aso hods for concave functions f, but with the direction of a the inequaities reversed (E[f(X)] f(ex), etc.). 2 The EM agorithm Suppose we have an estimation probem in which we have a training set {x (1),..., x (m) } consisting of m independent exampes. We wish to fit the parameters of a mode p(x, z) to the data, where the ikeihood is given by m (θ) = og p(x; θ) = m og z p(x, z; θ).
3 But, expicity finding the maximum ikeihood estimates of the parameters θ may be hard. Here, the s are the atent random variabes; and it is often the case that if the s were observed, then maximum ikeihood estimation woud be easy. In such a setting, the EM agorithm gives an efficient method for maximum ikeihood estimation. Maximizing (θ) expicity might be difficut, and our strategy wi be to instead repeatedy construct a ower-bound on (E-step), and then optimize that ower-bound (M-step). For each i, et Q i be some distribution over the z s ( z Q i(z) = 1, Q i (z) 0). Consider the foowing: 1 og p(x (i) ; θ) = i i = i i og og p(x (i), ; θ) (1) p(x(i), ; θ) og p(x(i), z(i) ; θ) The ast step of this derivation used Jensen s inequaity. Specificay, f(x) = og x is a concave function, since f (x) = 1/x 2 < 0 over its domain x R +. Aso, the term [ ] p(x Q i ( (i), ; θ) ) in the summation is ust an expectation of the quantity [ p(x (i), ; θ)/ ] with respect to drawn according to the distribution given by Q i. By Jensen s inequaity, we have f ( [ ]) [ p(x (i), ; θ) E Q i E z(i) Qi f ( p(x (i), ; θ) )], where the Q i subscripts above indicate that the expectations are with respect to drawn from Q i. This aowed us to go from Equation (2) to Equation (3). Now, for any set of distributions Q i, the formua (3) gives a ower-bound on (θ). There re many possibe choices for the Q i s. Which shoud we choose? We, if we have some current guess θ of the parameters, it seems 1 If z were continuous, then Q i woud be a density, and the summations over z in our discussion are repaced with integras over z. (2) (3)
4 natura to try to make the ower-bound tight at that vaue of θ. I.e., we make the inequaity above hod with equaity at our particuar vaue of θ. (We see ater how this enabes us to prove that (θ) increases monotonicay with successsive iterations of EM.) To make the bound tight for a particuar vaue of θ, we need for the step invoving Jensen s inequaity in our derivation above to hod with equaity. For this to be true, we know it is sufficient that that the expectation be taken over a constant -vaued random variabe. I.e., we require that p(x (i), ; θ) = c for some constant c that does not depend on. This is easiy accompished by choosing p(x (i), ; θ). Actuay, since we know z Q i( ) = 1 (because it is a distribution), this further tes us that = p(x(i), ; θ) z p(x(i), z; θ) = p(x(i), ; θ) p(x (i) ; θ) = p( x (i) ; θ) Thus, we simpy set the Q i s to be the posterior distribution of the s given x (i) and the setting of the parameters θ. Now, for this choice of the Q i s, Equation (3) gives a ower-bound on the ogikeihood that we re trying to maximize. This is the E-step. In the M-step of the agorithm, we then maximize our formua in Equation (3) with respect to the parameters to obtain a new setting of the θ s. Repeatedy carrying out these two steps gives us the EM agorithm, which is as foows: Repeat unti convergence { (E-step) For each i, set (M-step) Set θ := arg max θ := p( x (i) ; θ). i og p(x(i), z(i) ; θ).
5 } How we we know if this agorithm wi converge? We, suppose θ (t) and θ (t+1) are the parameters from two successive iterations of EM. We wi now prove that (θ (t) ) (θ (t+1) ), which shows EM aways monotonicay improves the og-ikeihood. The key to showing this resut ies in our choice of the Q i s. Specificay, on the iteration of EM in which the parameters had started out as θ (t), we woud have chosen i ( ) := p( x (i) ; θ (t) ). We saw earier that this choice ensures that Jensen s inequaity, as appied to get Equation (3), hods with equaity, and hence (θ (t) ) = i i ( ) og p(x(i), ; θ (t) ). i ( ) The parameters θ (t+1) are then obtained by maximizing the right hand side of the equation above. Thus, (θ (t+1) ) i i i ( ) og p(x(i), ; θ (t+1) ) i ( ) i ( ) og p(x(i), ; θ (t) ) i ( ) (4) (5) = (θ (t) ) (6) This first inequaity comes from the fact that (θ) i og p(x(i), z(i) ; θ) hods for any vaues of Q i and θ, and in particuar hods for Q i = i, θ = θ (t+1). To get Equation (5), we used the fact that θ (t+1) is chosen expicity to be arg max θ i og p(x(i), z(i) ; θ) and thus this formua evauated at θ (t+1) must be equa to or arger than the same formua evauated at θ (t). Finay, the step used to get (6) was shown earier, and foows from i having been chosen to make Jensen s inequaity hod with equaity at θ (t).,
6 Hence, EM causes the ikeihood to converge monotonicay. In our description of the EM agorithm, we said we d run it unti convergence. Given the resut that we ust showed, one reasonabe convergence test woud be to check if the increase in (θ) between successive iterations is smaer than some toerance parameter, and to decare convergence if EM is improving (θ) too sowy. Remark. If we define J(Q, θ) = i og p(x(i), z(i) ; θ), the we know (θ) J(Q, θ) from our previous derivation. The EM can aso be viewed a coordinate ascent on J, in which the E-step maximizes it with respect to Q (check this yoursef), and the M-step maximizes it with respect to θ. 3 Mixture of Gaussians revisited Armed with our genera definition of the EM agorithm, ets go back to our od exampe of fitting the parameters φ, µ and Σ in a mixture of Gaussians. For the sake of brevity, we carry out the derivations for the M-step updates ony for φ and µ, and eave the updates for Σ as an exercise for the reader. The E-step is easy. Foowing our agorithm derivation above, we simpy cacuate = Q i ( = ) = P ( = x (i) ; φ, µ, Σ). Here, Q i ( = ) denotes the probabiity of taking the vaue under the distribution Q i. Next, in the M-step, we need to maximize, with respect to our parameters φ, µ, Σ, the quantity m og p(x(i), z(i) ; φ, µ, Σ) m k = Q i ( = ) og p(x(i) = ; µ, Σ)p( = ; φ) Q i ( = ) = m =1 k =1 og 1 (2π) n/2 Σ exp ( 1 1/2 2 (x(i) µ ) T Σ 1 (x (i) µ ) ) φ
7 Lets maximize this with respect to µ. If we take the derivative with respect to µ, we find µ m k =1 og = µ m = 1 2 m = m k =1 1 (2π) n/2 Σ exp ( 1 1/2 2 (x(i) µ ) T Σ 1 1 2 (x(i) µ ) T Σ 1 (x (i) µ ) µ 2µ T Σ 1 ( ) Σ 1 x (i) Σ 1 µ x (i) µ T Σ 1 µ (x (i) µ ) ) φ Setting this to zero and soving for µ therefore yieds the update rue µ := m w(i) x (i) m w(i) which was what we had in the previous set of notes. Lets do one more exampe, and derive the M-step update for the parameters φ. Grouping together ony the terms that depend on φ, we find that we need to maximize m k =1 og φ. However, there is an additiona constraint that the φ s sum to 1, since they represent the probabiities φ = p( = ; φ). To dea with the constraint that k =1 φ = 1, we construct the Lagrangian, L(φ) = m k =1 og φ + β( k φ 1), =1 where β is the Lagrange mutipier. 2 Taking derivatives, we find φ L(φ) = m φ + 1 2 We don t need to worry about the constraint that φ 0, because as we shorty see, the soution we find from this derivation wi automaticay satisfy that anyway.
8 Setting this to zero and soving, we get φ = m w(i) β I.e., φ m w(i). Using the constraint that φ = 1, we easiy find that β = m k =1 w(i) = m 1 = m. (This used the fact that w(i) = Q i ( = ), and since probabiities sum to 1, w(i) = 1.) We therefore have our M-step updates for the parameters φ : φ := 1 m m. The derivation for the M-step updates to Σ are aso entirey straightforward.