An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative model with latent variables. We first describe the E-steps and M-steps, and then use finite mixture model as an example to illustrate this procedure in practice. Finally, we discuss its intrinsic relations with an optimization problem, which reveals the nature of E-M. The Expectation and Maximization Consider a generative model with parameter. The model generates a set of data D, which comprises two parts: ) the observed data X, and 2) the latent data Z, which is observed. With this model, the complete lielihood of D is given by pd ) = px, Z ), ) and the marginal lielihood of the observed data X is given by px ) = z px, z ). 2) Here, we sum over all possible assignments z to the variable Z. If Z is continuous, then we can change the sum to the integral over the domain of Z. The maximum lielihood estimation MLE) of given X is to find the parameter Θ that maximizes the marginal lielihood, as ˆ = argmax Θ px ) = argmax log px ). 3) Θ Here, Θ is the parameter domain, i.e. the set of all valid parameters. In practice, it is usually easier to wor with the log-lielihood instead of the lielihood itself. For many problems, including all the examples that we shall see later, the size of the domain of Z grows exponentially as the problem scale increases, maing it computationally intractable to exactly evaluate or even optimize) the marginal lielihood as above. The expectation maximization E-M) algorithm was developed to address this issue, which provides an iterative approach to perform MLE. The E-M algorithm, as described below, alternates between E-steps and M-steps until convergence.. Initialize ˆ 0). One can simply set 0) to some random value in Θ, or employ some problem-specific heuristics to derive an initial guess. 2. For t =, 2,..., repeat: a. E-step. Compute the posterior distribution of Z given X and t ), as Z) = pz X; ˆ t ) ). 4) Here, we use to denote the posterior distribution of Z obtained at the t-th iteration. b. M-step. Solve the optimal ˆ t) by maximizing the expectation of the complete log-lielihood with respect to, as ˆ t) = argmax E log px, Z )) = argmax z) log px, z ). 5) Θ Θ The computation of this sum can often be greatly simplified by taing advantage of the independence between variables, as we shall see in next section. 3. The iterations stop when some convergence criterion is met. For example, it can be stopped when the difference between t) and t ) is below some threshold. z
2 E-M in Practice: Estimation of Finite Mixture Models One of the most important application of E-M is the maximum lielihood estimation of finite mixture models. Generally, a finite mixture model is composed of K component models, whose parameters are respective,..., K, and a prior distribution over these components, denoted by π. The probability density function pdf) of this finite mixture model is given by px M) = π p c x ). 6) = Here, we use M = π,,..., K ) to denote all parameters of the mixture model. p c is the pdf of a component model, which has a parameter. When each component model is a Gaussian model, then this finite mixture model is called a Gaussian mixture model. Generation of a sample x from a finite mixture model as described above can be done in two steps: ) draw an indicator variable z from the prior distribution π as z π = π,..., π K ). Here, z can tae an integer value in {,..., K}. This value indicates which particular component is going to be used to generate x. 2) Given z, draw x from the z-th component model. Thus, we can consider this as a process that generates two values: z and x. Given a set of samples X = {x,..., x n }, we can treat the associated set of indicators Z = {z,..., z n } as latent variables. To estimate the model parameters, we can use E-M algorithm. To derive the E-steps and the M-steps, we go through three steps as follows.. Write down the complete log-lielihood. The complete log-lielihood is the log of the joint pdf of both the observed and latent variables, given the model parameters. In a finite mixture model, it is given by log px, Z M) = log px i z i ; M)pz i M) 7) Here, px i z i ; M) = p c x i z i) and pz i M) = π z i. Therefore, substitution of these into the formula above results in log px, Z M) = log πz i + log p c x i z i) ). 8) This form is not easier to wor with when deriving M-steps. The general tric is to introduce an indicator function δ, which is defined by { z = ) δ z) = 9) 0 z ). With this notation, we can further rewrite the complete log-lielihood into log px, Z M) = = δ z i ) log π + log p c x i ) ). 0) 2. Derive the E-steps. Given the model parameters M and x i, the indicator z i is independent from other samples, and as a result, we have pz X; M) = pz i x i ; M). ) Following the Bayes rule, we can get pz i = x i ; M) = π p c x i ) K l= π lp c x i l ). 2) The model parameters M are updated at each iteration t. To simplify notation, we use M t) to denote the model parameters obtained at iteration t, and to denote the posterior distribution obtained based on M t ). Then, we have Z) = i z i ), with i z i ) = pz i x i ; M t ) ). 3) 2
3. Derive the M-steps. With respect to, the expectation of the complete log-lielihood is given by n E δ z i ) log π + log p c x i ) )) = E δ z i )log π + log p c x i )) ). 4) = = Here, each term has E δ z i )log π + log p c x i )) ) = E δ z i ) ) log π + log p c x i )), 5) and since the value of δ z i ) can only be either 0 or, we have E δ z i ) ) = Pr δ z i ) = ) = ). 6) Consequently, the expected complete log-lielihood can be written into = Hence, the optimal value of at time t, denoted by ˆ t) = argmax i ) log π + log p c x i ) ). 7) t) ˆ, can thus be obtained by i ) log p c x i ). 8) This has a similar form as the ordinary MLE problem, except that the term for each sample is modulated by a coefficient i. In addition, we have to estimate the value of π, which can be done by solving the following problem ˆπ t) = argmax π i ) log π, s.t. π =, π 0, =,..., K. 9) It is not difficult to derive with Lagrange multipliers) that the optimal solution to this problem is given by ˆπ t) = i ). 20) n Summary: the E-step is to calculate the posterior probabilities of z i, as The M-step is to optimize the model parameters, as and = i z i ) = pz i x i ; M t ) ) = π p c x i ) K l= π lp c x i l ). 2) ˆ t) = argmax ˆπ t) = n i ) log p c x i ), 22) i ). 23) In general, Eq.2) and Eq.23) are the same for all finite mixture models, while Eq.22) depends on the particular form of the component model. Here are some examples:. Univariate gaussian distribution. Here, = µ, σ 2 ), and the pdf is given by p c x ) = exp x µ ) 2 ). 24) 2πσ 2 The optimal solution in the M-step is given by ˆµ t) = π t) i )x i, and Here, π t) = n qt) i ). 3 2σ 2 ) 2 ˆσ t) = π t) i x i ˆµ t) )2. 25)
2. Multivariate gaussian distribution over R d. Here, = µ, Σ ), and the pdf is given by p c x ) = 2π)d Σ exp ) 2 x µ ) T Σ x µ ). 26) The optimal solution in the M-step is given by ˆµ t) = π t) i )x i, and ˆΣt) = π t) i x i ˆµ t) )xi ˆµ t) )T. 27) 3. Simple topic model. Suppose each component model is a topic characterized by a word distribution, with w) being the probability of generating the word w from the topic, and each sample x is a bag of words, with xw) being the number of times the word w appears in the bag. Then p c x ) = w) xw). 28) w W Here, W is the vocabulary, i.e. the set of all possible words. Under this model, the optimal solution in M-step is given by ˆ w) = i x i w). 29) N Here, N is the total number of words in all observed bags. 3 An Optimization View: Relating E-M and MLE In this section, we show that the E-M algorithm is actually maximizing the following objective function via coordinate ascend. Q; q) = E q log px, Z )) + Hq) = E q log px, Z )) E q log qz)). 30) Here, q is a distribution over Z, and is the model parameter. Hq) is called the entropy of the distribution q, which is defined to be Hq) E q log qz)) = qz) log qz). 3) z The complete log lielihood here can be decomposed into We substitute Eq.32) into Eq.30), leading to log px, Z ) = log px ) + log pz X; ). 32) Q; q) = E q log px )) + E q log pz X; )) E q log qz)) = log px ) KLqZ) pz X; )). 33) Therefore, with fixed, we have Here, ˆq = argmax q Q; q) = argmax E q log pz X; )) E q log qz)). 34) q E q log pz X; )) E q log qz)) = E q log pz X; ) log qz)) = KL qz) pz X; )). 35) Therefore, maximizing Q, q) with fixed is equivalent to minimizing the Kullbac-Leibler divergence between qz) and the posterior distribution pz X; ), which attains the minimum zero), when qz) = pz X; ). In other words, the optimal solution q to this problem is given by the posterior distribution pz X; ). Recall that the E-step is to compute this posterior distribution. In this sense, the E-step can be considered as the step to maximize Q; q) with respect to q, with fixed. From Eq.30), we can see that with q fixed, the optimal is given by ˆ = argmax Q; q) = argmax E q log px, Z )). 36) 4
This is to maximize the expectation of the complete lielihood, which is exactly what the M-step is doing. From the analysis above, we can see that the E-M is actually a coordinate ascend procedure to maximize Q; q). In particular, the E-step is to optimize q with fixed, while the M-step is to optimize with q fixed. Consequently, the value of Q; q) will eep increasing during this process, until it reaches a local optima. To gain further insight, we revisit Eq.33), and see that Q; q) equals the marginal log-lielihood of X with respect to minus the KL divergence between qz) and pz X; ). Since KL-divergence is always non-negative, Q; q) is a lower bound of, and the E-M is to maximize this lower bound. Furthermore, it is worth noting that at each E-step, when we obtain the optimal q by minimizing the KL divergence to zero, we close the gap between the lower bound and the true marginal log-lielihood with respect to t ). To mae it explicit, Since Qˆ, ˆ t ) ) Qˆ, ˆ t) ) Qˆq t+), ˆ t) ), we have Qˆ, ˆ t ) ) = log px t ) ). 37) log px t ) ) log px t) ). 38) This shows that the actual value of the marginal log-lielihood of X increases as t increases, implying that the E-M algorithm is actually maximizing the marginal log-lielihood of X, i.e. log px ), the goal of MLE. 5