Gaussian Mixtures and the EM algorithm

Size: px

Start display at page:

Download "Gaussian Mixtures and the EM algorithm"

Charles Rose
5 years ago
Views:

1 Gaussian Mixtures and the EM algorithm 1

2 sigma=1.0 sigma=1.0 Responsibilities sigma=0.2 sigma=0.2 Responsibilities

3 Details of figure Left panels: two Gaussian densitiesg 0 (x) and g 1 (x) (blue and orange) on the real line, and a single data point (green dot) at x = 0.5. The colored squares are plotted at x = 1.0 andx = 1.0, the means of each density. Right panels: the relative densities g 0 (x)/(g 0 (x)+g 1 (x)) and g 1 (x)/(g 0 (x)+g 1 (x)), called the responsibilities of each cluster, for this data point. In the top panels, the Gaussian standard deviationσ = 1.0; in the bottom panels σ = 0.2. The EM algorithm uses these responsibilities to make a soft assignment of each data point to each of the two clusters. Whenσ is fairly large, the responsibilities can be near 0.5 (they are 0.36 and 0.64 in the top right panel). Asσ 0, the responsibilities 1, for the cluster center closest to the target point, and0for all other clusters. This hard assignment is seen in the bottom right panel. 3

4 Model for density estimation: g Y (y) = (1 π)φ θ1 (y)+πφ θ2 (y) θ j = (µ j,σ 2 j) φ is normal density N(µ,σ 2 ) π is mixing proportion Goal: given data y 1,y 2,...y n, estimate ({µ j,σj 2 },π) by maximum likelihood l({µ j,σ 2 j },π) = N i=1 logg Y(y i ) log-likelihood (messy) 4

5 EM-algorithm Consider latent variables i = 1 ify i φ θ2 0 otherwise If we could observe i... n, the log-likelihood would be: l 0 = [(1 i )logφ θ1 (y i )+ i logφ θ2 (y i )] Let γ i (θ) = E( i θ,data) = Pr( i = 1 θ,data), the responsibility of model 2 for observation i. EM algorithm maximizes the expected log-likelihood El 0 = [(1 γ i (θ))logφ θ1 (y i )+γ i (θ)logφ θ2 (y i )] where the expectation is conditional on the data and the current estimates ˆθ. 5

6 Alternating algorithm: Given estimates for(µ j,σ 2 j,,π) compute responsibilities ˆγ i = Pr( i = 1 θ,data) (y ˆπφˆθ2 i ) = (1 (y ˆπ)φˆθ1 i )+ (y ˆπφˆθ2 i ) i = 1,2,...N (E-step) Given ˆγ i, estimateµ j,σ 2 j by weighted means + variances ˆµ 1 = N i=1 (1 ˆγ i)y i N i=1 (1 ˆγ i), ˆσ2 1 = N i=1 (1 ˆγ i)(y i ˆµ 1 ) 2 N i=1 (1 ˆγ, i) (M-Step) ˆµ 2 = N i=1 ˆγ iy i N i=1 ˆγ, ˆσ 2 2 = i N i=1 ˆγ i(y i ˆµ 1 ) 2 N i=1 ˆγ,ˆπ = ˆγ i /N i Iterate E+M step, until convergence 6

7 Global mle has l = + with ˆµ j = y i for any i,j ˆσ j 2 = 0. Not a useful solution. Want non-pathological local minimum Usual strategy: start and keep ˆσ 2 j away from zero. Relationship between K-means + Gaussian Mixtures If we restrict ˆσ 1 2 = ˆσ 2 2 = ˆσ 2, and let ˆσ 2 0 then EM K-means EM is a soft version of K-means 7

8 Negative log likelihood Gaussian mixtures, mu1=1,mu2=5,sigma=1 training test Number of clusters k 8

9 EM algorithm in general (Text pg ) Observed data Z log-likelihood l(θ;z) want to maximize We have (or we introduce) latent (missing) data Z m Complete data T = (Z,Z m ) Pr(Z m Z,θ ) = Pr(Zm,Z θ ) Pr(Z θ ) so Pr(Z θ Pr(T θ ) ) = Pr(Z m Z,θ ) or l(θ ;Z) = l 0 (θ ;T) l 1 (θ ;Z m Z) l(θ ;Z): observed data log-likelihood l 0 (θ ;T): complete data log-likelihood l 1 (θ ;Z m Z): conditional log-likelihood

10 Take conditional expectation with respect to distributiont Z, governed by parameter θ: l(θ ;Z) = E(l 0 (θ ;T) Z,θ) E(l 1 (θ ;Z m Z) Z,θ) Q(θ,θ) R(θ,θ) 8

11 EM algorithm in general 1. Start with initial guesses for parameters ˆθ (0) 2. Expectation step: at thejth step, compute Q(θ,ˆθ (j) ) = E(l 0 (θ ;T) Z,ˆθ (j) ) as a function of dummy argument θ 3. Maximization step: set ˆθ (j+1) as the maximizer ofq(θ, ˆθ (j) ) with respect toθ 4. Iterate steps 2+3 until convergence 9

12 Claim: maximization ofq(θ,θ) over θ gives maximizer of l(θ,z) By Jensen s inequality, R(θ,θ) is maximized as a function θ, when θ = θ So l(θ ;Z) l(θ;z) = [Q(θ,θ) Q(θ,θ)] [R(θ,θ) R(θ,θ)] 0 if Q(θ,θ) Q(θ,θ) Hence as long as we go uphill at each M-step (don t need to maximize), we increase the log-likelihood 10

13 Some properties of EM each iteration increases the log-likeihood In curved exponential families, can show that any limit point of EM is a stationarity point of the log-likelihood convergence speed is linear, versus quadratic for Newton methods. 11

14 Model-based clustering Fraley + Raftery 1998 Banfield + Raftery 1994 Gaussian mixture model for density of data G L m (Θ 1...Θ G,π 1...π G x) = Π n i=1 π k f k (x i Θ k ) G components (groups); π k = prob of observation falling in group k Θ k = (µ k,σ k ) k=1 f k (x i µ k,σ k ) = φ(µ k,σ k ) (multivariate normal) 12

15 Novel idea: parametrize Σ k = λ k D k A k D T k normalize so that determinant A k = 1, givingλ k = Σ k 1 p 13

16 Σ k = λ k D k A k Dk T λ k : determines volume of ellipsoid D k orthogonal matrix determining orientation of ellipsoid A k diagonal matrix determining shape of ellipsoid Σ k Distance Volume Orientation Shape λi Spherical Equal NA Equal λ k I Spherical Variable NA Equal λdad T Ellipsoidal Equal Equal Equal λ k D k A k Dk T Ellipsoidal Variable Variable Variable λd k ADk T Ellipsoidal Equal Variable Equal λ k D k ADk T Ellipsoidal Variable Variable Equal 14

17 Estimate parameters via EM algorithm Estimates of components of k - depend on assumed form They use Bayesian Information Criterion (BIC) to choose model parametrization and # of clusters: 2logp(x M)+const 2l m (x,par) m M logn m M = # independent parameters in model n = sample size Useful for low-dimensions (say 4), but difficult to estimate covariances in high dimensional feature spaces. 15

Computing the MLE and the EM Algorithm

ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations