Cheng Soon Ong & Christian Walder. Canberra February June 2017

Size: px

Start display at page:

Download "Cheng Soon Ong & Christian Walder. Canberra February June 2017"

Peregrine Grant
5 years ago
Views:

1 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679

2 Part XIX Mixture Models and EM 2 680of 679

3 EM for Gaussian Mixtures Starting point is the log of the likelihood function { N K } ln p(x π, µ, Σ) = ln π k N (x µ k, Σ k ) n=1 k=1 Critical point of ln p(x π, µ, Σ) w.r.t. µ k Therefore 0 = N π k N (x n µ k, Σ k ) K j=1 π Σ 1 (x n µ k ) j N (x n µ j, Σ j ) }{{} γ(z nk) n=1 µ k = 1 N γ(z nk )x n N k n=1 where the effective number of points assigned to Gaussian k is N k = N n=1 γ(z nk). 681of 679

4 EM for Gaussian Mixtures Maximum of the log of the likelihood function for µ k = 1 N γ(z nk ) x n N k n=1 Similarly for the covariance matrix Σ k = 1 N k N γ(z nk )(x n µ k )(x n µ k ) T, n=1 and for the mixing coefficients π k (using a Lagrange multiplier as k π k = 1) π k = N k N. This is not a closed form solution because the responsibilities γ(z nk ) depend on π, µ, Σ. 682of 679

5 EM for Gaussian Mixtures Given a Gaussian mixture and data X, maximise the log likelihood w.r.t. the parameters (π, µ, Σ). 1 Initialise the means µ k, covariances Σ k and mixing coefficients π k. Evaluate the log likelihood function. 2 E step : Evaluate the γ(z k) using the current parameters γ(z k) = πk N (x µ k, Σ k) K j=1 πj N (x µ j, Σ j) 3 M step : Re-estimate the parameters using the current γ(z k) µ new k Σ new k = 1 N k = 1 N k N γ(z nk) x n n=1 N n=1 γ(z nk)(x n µ new k π new k = Nk N )(x n µ new k ) T 4 Evaluate the log likelihood, if not converged then goto 2. { N K } ln p(x π, µ, Σ) = ln πk new N (x µ new k, Σ new k ) n=1 k=1 683of 679

6 EM for Gaussian Mixtures - Example L = (a) (b) (c) 2 2 L = 2 2 L = 5 2 L = (d) (e) (f) 2 684of 679

7 EM for Gaussian Mixtures - Relation to K-Means Assume a Gaussian mixture model. Covariance matrices given by ɛi, where ɛ is shared by all components. Then p(x µ k, Σ k ) = 1 exp { 12ɛ } (2πɛ) x µ D/2 k 2. Keep ɛ fixed, do not re-estimate. Responsibilities γ(z nk ) = π k exp { x n µ k 2 /2ɛ } j π j exp { x n µ j 2 /2ɛ } Taking the limit ɛ 0, the term in the denominator for which x n µ j 2 is the smallest will go to zero most slowly. 685of 679

8 EM for Gaussian Mixtures - Relation to K-Means Assume a Gaussian mixture model. γ(z nk ) = π k exp { x n µ k 2 /2ɛ } j π j exp { x n µ j 2 /2ɛ } Therefore γ(z nk ) = { 1 if x n µ k < x n µ j j k 0 otherwise Holds independent of π k as long as none are zero. Hard assignment to exactly one cluster : K-means. lim ɛ 0 γ(z nk) = r nk 686of 679

9 Mixture of Bernoulli Distributions Set of D binary variables x i, i = 1,..., D. Each governed by a Bernoulli distribution with parameter µ i. Therefore p(x µ) = D i=1 Expectation and covariance E [x] = µ µ xi i (1 µ i) 1 xi cov[x] = diag{µ i (1 µ i )} 687of 679

10 Mixture of Bernoulli Distributions Mixture with p(x µ, π) = p(x µ k ) = D i=1 K π k p(x µ k ) k=1 µ xi ki (1 µ ki) 1 xi Similar calculation as with mixture of Gaussian γ(z nk ) = π k p(x n µ k ) K j=1 π j p(x n µ j ) N k = N γ(z nk ) n=1 x = 1 N k π k = N k N N γ(z nk )x n n=1 µ k = x 688of 679

11 EM for Mixture of Bernoulli Distributions - Digits Examples from a digits data set, each pixel taken only binary values. Parameters µ ki for each component in the mixture. Fit to one multivariate Bernoulli distribution. 689of 679

12 The Role of Latent Variables EM finds the maximum likelihod solution for models with latent variables. Two kinds of variables Observed variables X Latent variables Z plus model parameters θ. Log likelihood is then { } ln p(x θ) = ln p(x, Z θ) Z Optimisation problem due to the log-sum. Assume maximisation of the distribution p(x, Z θ) over the complete data set {X, Z} is straightforward. But we only have the incomplete data set {X} and the posterior distribution p(z X, θ). 690of 679

13 EM - Key Idea Key idea of EM: As Z is not observed, work with an averaged version Q(θ, θ old ) of the complete log-likelihood ln p(x, Z θ), averaged over all states of Z. Q(θ, θ old ) = Z p(z X, θ old ) ln p(x, Z θ) 691of 679

14 EM Algorithm 1 Choose an initial setting for the parameters θ old. 2 E step Evaluate p(z X, θ old ). 3 M step Evaluate θ new given by θ new = arg max Q(θ, θ old ) θ where Q(θ, θ old ) = Z p(z X, θ old ) ln p(x, Z θ) 4 Check for convergence of log likelihood or parameter values. If not yet converged, then and go to step 2. θ old = θ new 692of 679

15 EM Algorithm - Convergence Start with the product rule for the observed variables X, the unobserved variables Z, and the parameters θ ln p(x, Z θ) = ln p(z X, θ) + ln p(x θ). Apply Z q(z) with arbitrary q(z) to the formula q(z) ln p(x, Z θ) = q(z) ln p(z X, θ) + ln p(x θ). Z Z Rewrite as ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) q(z) ln p(z X, θ) q(z) Z }{{} KL(q p) KL(q p) is the Kullback-Leibler divergence. 693of 679

16 Kullback-Leibler Divergence Distance between two distributions p(y) and q(y) KL(q p) = y KL(q p) = q(y) ln q(y) p(y) q(y) ln q(y) p(y) dy = y = q(y) ln p(y) q(y) q(y) ln p(y) q(y) dy KL(q p) 0 not symmetric: KL(q p) KL(p q) KL(q p) = 0 iff q = p. invariant under parameter transformations Example: Kullback-Leibler divergence between two normal distributions q(x) = N (x µ 1, σ 1 ) and p(x) = N (x µ 2, σ 2 ) KL(q p) = log σ 2 + σ2 1 + (µ 1 µ 2 ) 2 σ 1 2σ of 679

17 EM Algorithm - Convergence The two parts of ln p(x θ) ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) KL(q p) L(q, θ) ln p(x θ) 695of 679

18 EM Algorithm - E Step Hold θ old fixed. Maximise the lower bound L(q, θ old ) with respect to q( ). L(q, θ old ) is a functional. ln p(x θ) does NOT depend on q( ). Maximum for L(q, θ old ) will occur when the Kullback-Leibler divergence vanishes. Therefore, choose q(z) = p(z X, θ old ) ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) KL(q p) = 0 p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) L(q, θ old ) ln p(x θ old ) 696of 679

19 EM Algorithm - M Step Hold q( ) = p(z X, θ old ) fixed. Maximise the lower bound L(q, θ) with respect to θ : θ new = arg max θ L(q, θ old ) = arg max θ Z q( ) ln p(x, Z θ) L(q, θ new ) > L(q, θ old ) unless maximum already reached. As q( ) = p(z X, θ old ) is fixed, p(z X, θ new ) will not be equal to q( ), and therefore the Kullback-Leiber distance will be greater than zero (unless converged). ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) KL(q p) L(q, θ new ) ln p(x θ new ) 697of 679

20 EM Algorithm - Parameter View ln p(x θ) L (q, θ) θ old θ new Red curve : incomplete data likelihood. Blue curve : After E step. Green curve : After M step. 698of 679

Expectation Maximization

Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger