Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters? Maximum likelihood. 5 There is an issue logp(x 1:D π,θ 1:K ) D logp(x d π,θ 1:K ) (1) d 1 D log d 1 d 1 K p( d π)p(x d π,θ 1:K ) (2) In the second line, the log cannot help simplify the summation. Why is that summation there? Because d is a hidden variable. If d were observed, what happens? Recall classification. The likelihood would decompose. 10 In general, fitting hidden variable models with maximum likelihood is hard. THe EM algorithm can help. (But there are exceptions in both ways: some hidden variable models don t require EM and for others EM is not enough.) 1

The expectation-maximimiation algorithm is a general-purpose technique for fitting parameters in models with hidden variables. Here is the algorithm for mixtures (in English) 15 REPEAT: 1. For each document d, compute the conditional distribution of its cluster assignment d given the current setting of the parameters π (t ) and θ (t ) 2. Obtain π (t +1) (t +1) and θ 1:K by computing MLE s for each cluster, but weighting each data point by its posterior probability of being in that cluster. 1:K. 20 Notice that this resembles k -means. What are the differences? This strategy simply works. You can apply it in all the settings that we saw at the end of the slides. But we need to be precise about what we mean by weighting. REPEAT: 25 1. E-step: For each document d, compute λ d p( d π (t ),θ (t ) 1:K ) 2. M-step: For each cluster label s distribution over terms (t +1) θ k,v D d 1 λ d,k n v (x d ) D d 1 λ d,k n(x d ) (3) and for the cluster proportions (t +1) π k D d 1 λ d,k D (4) Do we know how to do the E-step? p( d k π (t ),θ (t ) 1:K ) π k V v 1 θ n v (x d ) k,v. (5) 30 This is precisely the prediction computation from classification (e.g., from is this email Spam or Ham? to is this email class 1,2,...,10? ) We emphasie that the parameters (in the E-step) are fixed. How about the M-step for cluster-conditional distributions θ k? The numerator is the expected number of times I saw word v in class k. What is the expectation taken with respect to? The posterior distribution of the hidden variable, the variable that determines which cluster parameter each word was generated from. 2

35 The denominator is the expected number of words (total) that came from class k. How about the M-step for cluster proportions π? The numerator is the expected number of times I saw a document from cluster k. The denominator is the number of documents These are like traditional MLEs, but with expected counts. This is how EM works. 40 It is actually maximiing the likelihood. Let s see why. Mixtures of Gaussians The mixture of Gaussians model is equivalent, except x d is now a Gaussian observation. The mixture components are means µ 1:K and variances σ 2 1:K. The proportions are π. Mixtures of Gaussians can express different kinds of distributions 45 Multimodal Heavy-tailed As a density estimator, arbitrarily complex densities can be approximated with a mixture of Gaussians (given enough components). We fit these mixtures with EM. 50 The E-step is conceptually the same. With the current setting of the parameters, compute the conditional distribution of the hidden variable d for each data point x d, p( d x d,π,µ 1:K,σ 2 ) p( 1:K d π)p(x d d,µ 1:K,σ 2 ). (6) 1:K This is just like for the mixtures of multinomials, only the data are now Gaussian. The M-step computes the empirical mean and variance for each component. 55 In µ k, each point is weighted by its posterior probability of being in the cluster. In σ 2 k, weight (x d ˆµ k ) 2 by the same posterior probability. 3

The complex density estimate comes out of the predictive distribution p(x π,µ 1:K,σ 2 ) p( π)p(x,µ 1:K 1:K,σ 2 ) (7) 1:K K π k p(x µ k,σ 2 ) (8) k k 1 This is K Gaussian bumps, each weighted by π k. Why does this work? EM maximies the likelihood. 60 Generic model The hidden variables are. The observations are x. Tha parameters are θ. The joint distribution is p(,x θ ). 65 The complete log likelihood is l c (θ ;x, ) logp(,x θ ) (9) For a distribution q( ), the expected complete log likelihood is E[log p(z, x θ )]. Jensen s inequality: In for a concave function and λ (0, 1), f (λx + (1 λ)y ) λf (x) + (1 λ)f (y ) (10) Draw a picture with 70 x y λx + (1 λ)y f (λx + (1 λ)y ) λf (x) + (1 λ)f (y ) This generalies to probability distributions. For concave f, f (E[X ]) E[f (X )]. (11) 4

75 Jensen s bound and (θ,q) l(θ ;x) logp(x θ ) (12) log p(x, θ ) (13) p(x, θ ) log q( ) q( ) p(x, θ ) q( ) log q( ) (14) (15) (q,θ ). (16) This bound on the log likelihood depends on a distribution over the hidden variables q( ) and on the parameters θ. EM is a coordinate ascent algorithm. We optimie with respect to each argument, holding the other parameter fixed. q (t +1) ( ) arg max q θ (t +1) arg max θ (q,θ ( t )) (17) (q (t +1),θ ) (18) 80 In the E-step, we find q ( ). This equals the posterior p( x,θ ), (p( x,θ ),θ ) p(x, θ ) p( x,θ ) log p( x,θ ) (19) p( x,θ ) logp(x θ ) (20) logp(x θ ) (21) l(θ ;x) (22) In the M-step, we only need to worry about the expected complete log likelihood because (q,θ ) q( ) logp(x, θ ) q( ) logq( ). (23) The second term does not depend on the parameter θ. The first term is the expected complete log likelihood. Perspective of optimie bound; tighten bound 85 This finds a local optimum & is sensitive to starting points Test for convergence: After the E-step, we can compute the likelihood at the point θ (t ). 5

Big picture 90 With EM, we can fit all kinds of hidden variable models with maximum likelihood. EM has had a huge impact on statistics, machine learning, signal processing, computer vision, natural language processing, speech recognition, genomics and other fields. Missing data was no longer a block. It opened up the idea of Imagine I can observe everything. Then I can fit the MLE. But I cannot observe everything. With EM, I can still try to fit the MLE. For example 95 100 Used to fill in missing pixels in tomography. Modeling nonresponses in sample surveys. Many applications in genetics treating ancentral variables and various unknown rates as unobserved random variables. Time series models (we ll see these) Financial models EM application at Amaon: Users and items are clustered; a user s rating on an item has to do with their cluster combination. Closely related to 105 Variational methods for approximate posterior inference Gibbs sampling Both let us have the same flexibility in modeling, but in the set up where we cannot compute the real posterior. (This often occurs in Bayesian models.) History of the EM algorithm 110 115 Dempster et al. (1977): Maximum likelihood from incomplete data via the EM algorithm Hartley: I feel like the old minstrel who has been singing his song for 18 years and now finds, with considerable satisfaction, that his folklore is the theme of an overpowering symphony. Hartley (1958) Maximum likelihood procedures for incomplete data McKendrik (1926) Applications of mathematics to medical problems 6