Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1
9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm in General 2
Topics in Mixtures of Gaussians Goal of Gaussian Mixture Modeling Latent Variables Maximum Likelihood EM for Gaussian Mixtures 3
Goal of Gaussian Mixture Modeling Machine Learning A linear superposition of Gaussians in the form k=1 Goal of Modeling: K p(x) = π k (x µ k, Σ k ) Find maximum likelihood parameters π k, µ k, Σ k Examples of data sets and models 1-D data, K=2 subclasses 2-D data, K=3 k 1 2 π 0.4 0.6 µ 28 1.86 σ 0.48 0.88 Each data point is associated with a subclass k with probability π k
GMMs and Latent Variables A GMM is a linear superposition of Gaussian components Provides a richer class of density models than the single Gaussian We formulate a GMM in terms of discrete latent variables This provides deeper insight into this distribution Serves to motivate the EM algorithm Which gives a maximum likelihood solution to no. of components and their means/covariances 5
Latent Variable Representation Linear superposition of K Gaussians: K p(x) = π k (x µ k, Σ k ) k=1 Introduce a K-dimensional binary variable z Use 1-of-K representation (one-hot vector) Let z = z 1,..,z K whose elements are k z k {0,1} and z k = 1 K possible states of z corresponding to K components k 1 2 z 10 01 π k 0.4 0.6 µ k 28 1.86 σ k 0.48 0.88 k 1 2 3 z 100 010 001
Joint Distribution Define joint distribution of latent variable and observed variable p(x,z)=p(x z) p(z) x is observed variable z is the hidden or missing variable Marginal distribution p(z) Conditional distribution p(x z) 7
Graphical Representation of Mixture Model The joint distribution p(x,z) is represented in the form p(z)p(x z) Latent variable z=[z 1,..z K ] represents subclass Observed variable x We now specify marginal p(z)and conditional p(x z) Using them we specify p(x) in terms of observed and latent variables 8
Specifying the marginal p(z) Associate a probability with each component z k Denote p(z k = 1) = π k where parameters {π k } satisfy 0 π k 1 and π k = 1 k Because z uses 1-of-K it follows that p(z) p(z) = K z π k k k=1 p(x z) since z k {0,1} and components of z are mutually exclusive and hence are independent With one component p(z 1 ) = π 1 z 1 With two components p(z 1,z 2 ) = π 1 z 1 π 2 z 2 9
Specifying the Conditional p(x z) For a particular component (value of z) p(x z k = 1) = (x µ k, Σ k ) Thus p(x z) can be written in the form p(x z) = K k=1 ( x µ k, Σ ) z k k Due to the exponent z k all product terms except for one equal one p(z) p(x z) 10
Marginal distribution p(x) The joint distribution p(x,z) is given by p(z)p(x z) Thus marginal distribution of x is obtained by summing over all possible states of z to give p(x) = p(z)p(x z) = π z k ( x µ k k, Σ ) z k k = π k x µ k, Σ k Since z z k {0,1} z K k=1 ( ) This is the standard form of a Gaussian mixture K k=1 11
Value of Introducing Latent Variable If we have observations x 1,..,x Because marginal distribution is in the form p(x) = z p(x,z) It follows that for every observed data point x n there is a corresponding latent vector z n, i.e., its sub-class Thus we have found a formulation of Gaussian mixture involving an explicit latent variable We are now able to work with joint distribution p(x,z) instead of marginal p(x) Leads to significant simplification through introduction of expectation maximization 12
Another conditional probability (Responsibility) In EM p(z x) plays a role The probability p(z k =1 x) is denoted γ (z k ) From Bayes theorem View γ (z k ) p(z k = 1 x) = p(z k = 1)p(x z k = 1) p(z k = 1) = π k γ (z k ) = p(z k = 1 x) K j =1 p(z j = 1)p(x z j = 1) = π k (x µ k, Σ k ) K j =1 π j (x µ k, Σ j ) as prior probability of component k as the posterior probability p(x,z)=p(x z)p(z) it is also the responsibility that component k takes for explaining the observation x 13
Plan of Discussion ext we look at 1. How to get data from a mixture model synthetically and then 2. Given a data set {x 1,..x } how to model the data using a mixture of Gaussians 14
Synthesizing data from mixture Use ancestral sampling Start with lowest numbered node and draw a sample, Generate sample of z, called ẑ move to successor node and draw a sample given the parent value, etc. Then generate a value for x from conditional p(x ẑ ) Samples from p(x,z) are plotted according to value of x and colored with value of z Samples from marginal p(x) obtained by ignoring values of z 500 points from three Gaussians Complete Data set Incomplete Data set 15
Illustration of responsibilities Evaluate for every data point Posterior probability of each component Responsibility γ (z nk ) is associated with data point x n Color using proportion of red, blue and green ink If for a data point γ (z n1 ) = 1 it is colored red If for another point γ (z n2 ) = γ (z n3 ) = 0.5 it has equal blue and green and will appear as cyan 16
Maximum Likelihood for GMM We wish to model data set {x 1,..x } using a mixture of Gaussians ( items each of dimension D) Represent by x D matrix X n th row is given by x n T Represent latent variables with x K matrix Z n th row is given by z n T Z = X = Goal is to state the likelihood function so as to estimate the three sets of parameters by maximizing the likelihood z 1 z 2 z x 1 x 2 x 17
Graphical representation of GMM For a set of i.i.d. data points {x n } with corresponding latent points {z n } where n=1,.., Bayesian etwork for p(x,z) using plate notation x D matrix X x K matrix Z 18
Likelihood Function for GMM Mixture density function is ( ) p(x) = p(z)p(x z) = π k x µ k, Σ k z K Therefore Likelihood function is p(x π, µ, Σ) = n=1 k=1 K π k (x n µ k, Σ k ) k=1 Since z has values {z k } with probabilities {π k } Product is over the i.i.d. samples Therefore log-likelihood function is ln p(x π, µ, Σ) = n=1 K ln π k (x n µ k, Σ k ) k=1 Which we wish to maximize A more difficult problem than for a single Gaussian 19
Maximization of Log-Likelihood ln p(x π, µ, Σ) = n=1 K ln π k (x n µ k, Σ k ) k=1 Goal is to estimate the three sets of parameters π k,µ k,σ k By taking derivatives in turn w.r.t each while keeping others constant But there are no closed-form solutions Task is not straightforward since summation appears in Gaussian and logarithm does not operate on Gaussian While a gradient-based optimization is possible, we consider the iterative EM algorithm 20
Some issues with GMM m.l.e. Before proceeding with the m.l.e. briefly mention two technical issues: 1. Problem of singularities with Gaussian mixtures 2. Problem of Identifiability of mixtures 21
Problem of Singularities with Gaussian mixtures Consider Gaussian mixture components with covariance matrices Data point that falls on a mean µ j = x n contribute to the likelihood function (x n x n,σ j 2 I) = 1 1 (2π ) 1/2 σ j since exp(x n -µ j ) 2 =1 Σ k = σ k 2 I As σ j 0 term goes to infinity Therefore maximization of log-likelihood K ln π k (x n µ k, Σ k ) is not well-posed ln p(x π, µ, Σ) = n=1 k=1 Does not happen with a single Gaussian Multiplicative factors go to zero Does not happen in the Bayesian approach Problem is avoided using heuristics Resetting mean or covariance will One component assigns finite values and other to large value Multiplicative values Take it to zero 22
Problem of Identifiability A density p(x θ) is identifiable if θ θ ' then there is an x for which p(x θ) p(x θ ') A K-component mixture will have a total of K! equivalent solutions Corresponding to K! ways of assigning K sets of parameters to K components E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321 For any given point in the space of parameter values there will be a further K!-1 additional points all giving exactly same distribution However any of the equivalent solutions is as good as the other Two ways of labeling three Gaussian subclasses A B C B A C 23
EM for Gaussian Mixtures EM is a method for finding maximum likelihood solutions for models with latent variables Begin with log-likelihood function ln p(x π, µ, Σ) = K ln π k (x n µ k, Σ k ) k=1 We wish to find that maximize this quantity Task is not straightforward since summation appears in Gaussian and logarithm does not operate on Gaussian Take derivatives in turn w.r.t Means µ k and set to zero Σ k n=1 π,µ,σ covariance matrices and set to zero mixing coefficients and set to zero π k 24
EM for GMM: Derivative wrt Begin with log-likelihood function ln p(x π, µ, Σ) = n=1 K ln π k (x n µ k, Σ k ) k=1 µ k Take derivative w.r.t the means Making use of exponential form of Gaussian Use formulas: We get d dx lnu = u ' u and µ k d dx eu = e u u ' and set to zero 0 = n=1 j π k (x n µ k, Σ k ) 1 (x π j (x n µ j, Σ j ) n µ k k ) γ (z nk ) the posterior probabilities Inverse of covariance matrix 25
M.L.E. solution for Means Multiplying by µ k = 1 k γ (z nk )x n Where we have defined n=1 k = γ (z nk ) n=1 Σ k (assuming non-singularity) Mean of k th Gaussian component is the weighted mean of all the points in the data set: where data point x n is weighted by the posterior probability that component k was responsible for generating x n Which is the effective number of points assigned to cluster k 26
M.L.E. solution for Covariance Set derivative wrt Σ k to zero Making use of mle solution for covariance matrix of single Gaussian Σ k = 1 k n=1 γ (z nk )(x n µ k )(x n µ k ) T Similar to result for a single Gaussian for the data set but each data point weighted by the corresponding posterior probability Denominator is effective no of points in component 27
M.L.E. solution for Mixing Coefficients Maximize ln p(x π, µ, Σ) w.r.t. π k Must take into account that mixing coefficients sum to one Achieved using Lagrange multiplier and maximizing ln p(x π,µ,σ) + λ K k=1 π k 1 π k Setting derivative wrt to zero and solving gives π = k k 28
Summary of m.l.e. expressions GMM maximum likelihood parameter estimates Means 1 µ = γ( z )x k nk n k n= 1 Covariance matrices Mixing Coefficients Σ k = 1 k γ(z nk )(x n µ k )(x n µ k ) T π = k n =1 k k = γ(z nk ) All three are in terms of responsibilities and so we have not completely solved the problem n =1 29
EM Formulation The results for µ k,σ k,π k are not closed form solutions for the parameters Since γ (z nk ) the responsibilities depend on those parameters in a complex way Results suggest an iterative solution An instance of EM algorithm for the particular case of GMM 30
Informal EM for GMM First choose initial values for means, covariances and mixing coefficients Alternate between following two updates Called E step and M step In E step use current value of parameters to evaluate posterior probabilities, or responsibilities In the M step use these posterior probabilities to to reestimate means, covariances and mixing coefficients 31
EM using Old Faithful Data points and Initial mixture model Initial E step Determine responsibilities After first M step Re-evaluate Parameters After 2 cycles After 5 cycles After 20 cycles 32
Comparison with K-Means K-means result E-M result 33
Animation of EM for Old Faithful Data http://en.wikipedia.org/wiki/ File:Em_old_faithful.gif Code in R #initial parameter estimates (chosen to be deliberately bad) theta <- list( tau=c(0.5,0.5), mu1=c(2.8,75), mu2=c(3.6,58), sigma1=matrix(c(0.8,7,7,70),ncol=2), sigma2=matrix(c(0.8,7,7,70),ncol=2) ) 34
Practical Issues with EM Takes many more iterations than K-means Each cycle requires significantly more comparison Common to run K-means first in order to find suitable initialization Covariance matrices can be initialized to covariances of clusters found by K-means EM is not guaranteed to find global maximum of log likelihood function 35
Summary of EM for GMM Given a Gaussian mixture model Goal is to maximize the likelihood function w.r.t. the parameters (means, covariances and mixing coefficients) Step1: Initialize the means, covariances and mixing coefficients log-likelihood π k µ k Σ k and evaluate initial value of 36
EM continued Step 2: E step: Evaluate responsibilities using current parameter values γ (z k )= π k (x n µ k, Σ k ) K j =1 π j (x n µ j, Σ j )) Step 3: M Step: Re-estimate parameters using current responsibilities µ k new = 1 k γ (z nk )x n n=1 Σ k new = 1 k n=1 γ (z nk )(x n µ k new )(x n µ k new ) T π k new = k where k = γ (z nk ) n=1 37
EM Continued Step 4: Evaluate the log likelihood ln p(x π, µ, Σ) = n=1 ln K π k (x n µ k, Σ k ) k=1 And check for convergence of either parameters or log likelihood If convergence not satisfied return to Step 2 38