Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin
(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable
(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable
Partitioning Algorithms K-means hard assignment: each object belongs to only one cluster Mixture modeling soft assignment: probability that an object belongs to a cluster Generative approach: think of each cluster as a component distribution, and any data point is drawn from a mixture of multiple component distributions 4
Gaussian Mixture Model Mixture of K Gaussian distributions: (Multi-modal distribution) p(x y=i) Ν(µ i, σ 2 Ι ) p(x) = Σ p(x y=i) P(y=i) i µ 2 µ 1 Mixture component Mixture proportion µ 3
(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable
General GMM GMM Gaussian Mixture Model (Multi-modal distribution) p(x y=i) Ν(µ i, Σ i ) p(x) = Σ p(x y=i) P(y=i) i µ 2 µ 1 Mixture component Mixture proportion µ 3
General GMM GMM Gaussian Mixture Model (Multi-modal distribution) There are k components Component i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Each data point is generated according to the following recipe: 1) Pick a component at random: Choose component i with probability P(y=i) 2) Data-point x Ν(µ i, Σ i ) µ 2 µ 1 µ 3
(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable
General GMM GMM Gaussian Mixture Model (Multi-modal distribution) p(x y=i) Ν(µ i, Σ i ) Gaussian Bayes Classifier: P(y log P(y = = i x) j x) p(x y = i)p(y = i) = log p(x y = j)p(y = j) T T = x Wx + w x µ 2 µ 1 Quadratic Decision boundary second-order terms don t cancel out µ 3 Depend on µ 1, µ 2,.., µ K, Σ 1, Σ 2,.., Σ K, P(y=1),, P(Y=k)
Learning General GMM x 1,...,x m p(x) = kx i=1 p(x Y = i)p (Y = i) Mixture component Mixture proportion, p i 1 µ 1 2 µ 3 µ 1 µ 2 Gaussian mixture model 3 p(x Y = i) N(µ i, i ) Parameters: {p i,µ i, i } K i=1 How to estimate parameters? Maximum Likelihood But don t know labels Y (recall Gaussian Bayes classifier) 11
Learning General GMM Maximize marginal likelihood: argmax j P(x j ) = argmax j i=1 P(y j =i,x j ) K K = argmax j i=1 P(y j =i)p(x j y j =i) marginalizing y j P(y j =i) = P(y=i) Mixture component i is chosen with prob P(y = i) = m k 1 1 T arg max P( y = i) exp ( x j µ i ) i ( x j µ i ) j= 1 i= 1 det( ) 2 i How do we find the μ i s and P(y=i)s which give max. marginal likelihood? * Set log Prob (.) = 0 and solve for μ i s. Non-linear non-analytically solvable μ i * Use gradient descent: Doable, but often slow
GMM vs. k-means Maximize marginal likelihood: argmax j P(x j ) = argmax j i=1 P(y j =i,x j ) K K = argmax j i=1 P(y j =i)p(x j y j =i) argmax j P(x j ) = argmax j p(x j y j =C(j)) Same as k- means! = argmax = argmin
Expectation-Maximization (EM) A general algorithm to deal with hidden data, but we will study it in the context of unsupervised learning (hidden labels) first No need to choose step size as in Gradient methods. EM is an Iterative algorithm with two linked steps: E-step: fill-in hidden data (Y) using inference M-step: apply standard MLE/MAP method to estimate parameters {p i, μ i, Σ i } k i=1 We will see that this procedure monotonically improves the likelihood (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood. k
EM for spherical, same variance GMMs E-step Compute expected classes of all datapoints for each class M-step P 1 2 ( y = i x, µ... µ ) exp x µ P( y i) j 1 k 2 j i = 2σ Compute MLE for μ given our data s class membership distributions (weights) In K-means E-step we do hard assignment EM does soft assignment µ i = m j=1 m j=1 ( ) P y = i x j x j P( y = i x ) j Similar to K-means, but with weighted data Iterate.
EM for general GMMs Iterate. On iteration t let our estimates be λ t = { μ 1 (t), μ 2 (t) μ k (t), Σ 1 (t), Σ 2 (t) Σ k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class M-step P ( ) ( ) ( ( ) ( ) ) t t t y = i x, λ p p x µ, Σ j t i j p i (t) is shorthand for estimate of P(y=i) on t th iteration Just evaluate a Gaussian at x j Compute MLEs given our data s class membership distributions (weights) ( y = i x j, λt ) x P( y = i x, λ ) ( t+ ) j µ 1 ( t+ 1) = Σ i j P p j t ( t+ 1) i j P = j = m i = i j P ( y i x,λ ) j t i ( t+ 1) ( t+ 1) ( y = i x j, λ t )( x j µ i )( x j µ i ) P( y = i x, λ ) j m = #data points j t T
EM for general GMMs: Example Σ 2 Σ 3 Σ 1 µ 1 µ 2 µ 3 P(y = x j,µ 1,µ 2,µ 3,Σ 1,Σ 2,Σ 3,p 1,p 2,p 3 )
After 1 st iteration
After 2 nd iteration
After 3 rd iteration
After 4 th iteration
After 5 th iteration
After 6 th iteration
After 20 th iteration
Example: GMM clustering
General GMM GMM Gaussian Mixture Model (Multi-modal distribution) p(x) = Σ p(x y=i) P(y=i) i Mixture component Mixture proportion µ 2 µ 1 µ 3 p(x y=i) Ν(µ i, Σ i )
Resulting Density Estimator
General EM algorithm Marginal likelihood x is observed, z is missing: log How to maximize marginal likelihood using EM?
Lower-bound on marginal likelihood log Variational approach P(z) f(z) Jensen s inequality: log z P(z) f(z) z P(z) log f(z) log: concave function log(ax+(1-a)y) a log(x) + (1-a) log(y) x ax+(1-a)y y
Lower-bound on marginal likelihood log P(z) f(z) Jensen s inequality: log z P(z) f(z) z P(z) log f(z) =:
EM as Coordinate Ascent log E-step: Fix θ, maximize F over Q Q t+ 1 = arg max F( θ Q t,q) M-step: Fix Q, maximize F over θ θ = arg max F( θ,q t + 1 t+ 1 θ )
Convergence of EM Marginal Likelihood function F(θ, Q t+1 ) Sequence of EM lower bound F-functions EM monotonically converges to a local maximum of likelihood!
EM & Local Maxima Typical likelihood function Different sequence of EM lower bound F-functions depending on initialization Use multiple, randomized initializations in practice
EM as Coordinate Ascent log E-step: Fix θ, maximize F over Q Q t+ 1 = arg max F( θ Q t,q) M-step: Fix Q, maximize F over θ θ = arg max F( θ,q t + 1 t+ 1 θ )
log E step E-step: Fix θ, maximize F over Q log 1 log KL divergence between two distributions
log E step E-step: Fix θ, maximize F over Q log + log KL>=0, above expression is maximized if KL divergence = 0 KL(Q,P) = 0 iff Q = P Therefore, E step:
log E step E-step: Fix θ, maximize F over Q log + log è Compute probability of missing data z given current choice of θ Re-aligns F with marginal likelihood!! log
log M step M-step: Fix Q, maximize F over θ log F (, Q (t+1) ) = mx X j=1 z Q (t+1) (z x j ) log P (z, x j ) Q (t+1) (z x j ) = mx X mx Q (t+1) (z x j ) log P (z, x j )+ H(Q (t+1) (z x j )) j=1 z j=1 = Fixed (Independent of θ) Log likelihood if z was known Expected log likelihood wrt Q
log M step M-step: Fix Q, maximize F over θ log F (, Q (t+1) ) = mx + H(Q (t+1) (z x j )) j=1 Fixed (Independent of θ) Expected log likelihood wrt Q (t+1)
EM as Coordinate Ascent log E-step: Fix θ, maximize F over Q Q t+ 1 = arg max F( θ M-step: Fix Q, maximize F over θ Q = arg max F( θ,q t,q) Compute probability of missing data given current choice of θ θ t + 1 t+ 1 θ ) E.g., P( y = i x j,µ ) t Compute estimate of θ by maximizing marginal likelihood using Q (t+1) (z x j )
Summary: EM Algorithm A way of maximizing likelihood function for hidden variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maximum likelihood parameter estimates. Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: 1. E-step: 2. M-step: = arg max F( θ In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood. EM performs coordinate ascent on F, can get stuck in local minima. BUT Extremely popular in practice. Q θ t+ 1 t,q) Q t + 1 t+ 1 = arg max F( θ,q θ )