Unsupervised Learning 2001

Size: px

Start display at page:

Download "Unsupervised Learning 2001"

Bennett Rice
5 years ago
Views:

1 Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai Carl Edward Rasmusse Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer Sciece

2 The Expectatio Maximizatio (EM) algorithm Give a set of observed (visible) variables V, a set of uobserved (hidde) variables H, ad model parameters θ, optimize the log likelihood: L(θ) = log p(v θ) = log p(h, V θ)dh, (1) where we have writte the margial for the visibles i terms of a itegral over the joit distributio for hidde ad visible variables. Usig Jese s iequality for ay distributio of hidde states q(h) we have: L = log q(h) p(h, V θ) dh q(h) q(h) log p(h, V θ) dh = F(q, θ), (2) q(h) defiig the F(q, θ) fuctioal, which is a lower boud o the log likelihood. I the EM algorithm, we alterately optimize F(q, θ) wrt q ad θ, ad we ca prove that this will ever decrease L.

3 The E ad M steps of EM The lower boud o the log likelihood: F(q, θ) = where H(q) = q(h) log p(h, V θ) dh = q(h) q(h) log P (H, V θ)dh + H(q), (3) q(h) log q(h)dh is the etropy of q. We iteratively alterate: E step: optimize F(q, θ) wrt the distributio over hidde variables give the parameters: q (k) (H) := argmax q(h) F ( q(h), θ (k 1)). (4) M step: maximize F(q, θ) wrt the parameters give the hidde distributio: θ (k) := argmax θ F ( q (k) (H), θ ) = argmax θ q(h) log p(h, V θ)dh, (5) which is equivalet to optimizig the complete likelihood p(h, V θ), sice the etropy of q(h) does ot deped o θ.

4 EM as Coordiate Ascet i F

5 The EM algorithm ever decreases the log likelihood The differece betwee the cost fuctios: L(θ) F(q, θ) = log p(v θ) = log p(v θ) = q(h) log q(h) log q(h) log p(h, V θ) dh q(h) p(h V, θ)p(v θ) dh q(h) p(h V, θ) dh = KL ( q(h), p(h V, θ) ), q(h) (6) is called the Kullback-Liebler divergece; it is o-egative ad oly zero if ad oly if q(h) = p(h V, θ) (thus this is the E step). Although we are workig with the wrog cost fuctio, the likelihood is still icreased i every iteratio: L ( θ (k 1)) = F ( q (k), θ (k 1)) F ( q (k), θ (k)) L ( θ (k)), (7) where the first equatio holds because of the E step, ad the first iequality comes from the M step ad the fial iequality from Jese. Usually EM coverges to a local optimum of L (although there are exceptios).

6 The KL ( p(x), q(x) ) is o-egative ad zero iff x : p(x) = q(x) First let s cosider discrete distributios; the Kullback-Liebler divergece is: KL(p, q) = i q i log q i p i. (8) To fid the distributio q which miimizes KL(p, q) we add a lagrage multiplier to eforce the ormalizatio: E = KL(p, q) + λ(1 i q i ) = i q i log q i p i + λ(1 i q i ). (9) We the take partial derives ad set to zero: E = log(q i ) log(p i ) + 1 λ = 0 q i = p i exp(λ 1) q i E λ = 1 q i = 0 q i = 1 i i q i = p i. (10)

7 Why KL(p, q) is... Check that the curvature (Hessia) is positive (defiite), correspodig to a miimum: 2 E q i q i = 1 q i > 0, showig that q i = p i is a geuie miimum. KL(p, p) = 0. 2 E q i q j = 0, (11) At the miimum is it easily verified that A similar proof ca be doe for cotiuous distributios, the partial derivatives beig substituted by fuctioal derivatives.

8 The Gaussia mixture model (E-step) I the Gaussia mixture desity model, the deseties are give by: p(x θ) K k=1 π k exp( 1 σ k 2σk 2 (x µ k ) 2 ), (12) where θ is the collectio of parameters: meas µ k, variaces σk 2 ad mixig proportios π k (which must be positive ad sum to oe). There are (biary) hidde variables H (c) i, idicatig which compoet observatio x (c) belogs to. The coditioal likelihood ad priors are: p(x H, θ) = K k=1 H k σ 1 k exp( 1 2σk 2 (x µ k ) 2 ), ad p(h k θ) = π k. (13) I the E-step, compute the posterior for the hidde states give the curret variables: q(h) = p(h x, θ) = p(x H, θ)p (H θ) q(h (c) k ) π k exp( 1 σ k 2σk 2 (x (c) µ k ) 2 ) (14) with the ormalizatio beig q(h (c) k )/ k q(h(c) k ).

9 The Gaussia mixture model (M-step) I the M-step we optimize the sum (sice H is discrete): E = q(h) log[p(x H, θ)p(h θ)] = c,k q(h (c) k )[ log π k log σ k 1 2σk 2 (x (c) µ k ) 2]. (15) Optimizatio wrt. the parameters is doe by settig the partial derivatives of E to zero: E µ k = c E σ k E π k = c = c q(h (c) k )(x(c) µ k ) 2σ 2 k q(h (c) = 0 µ k = c q(h(c) k )x(c) c q(h(c) k ), k )[ 1 (x(c) µ k )] = 0 σ 2 σ k = k q(h (c) k ) 1 π k, σ 3 k E π k + λ = 0 π k = 1 c c q(h(c) k )(x(c) µ k ) 2 c q(h(c) k ), q(h (c) k ), (16) where λ is a Lagrage multiplier esurig that the mixig proportios sum to uity.

10 Factor Aalysis X 1 Y 1 Y 2 X K Y D Λ Liear geerative model: y d = K Λ dk x k + ɛ d k=1 x k are idepedet N (0, 1) Gaussia factors ɛ d are idepedet N (0, Ψ dd ) Gaussia oise K <D So, y is Gaussia with: P (y) = where Λ is a D K matrix, ad Ψ is diagoal. P (x)p (y x)dx = N (0, ΛΛ + Ψ) Dimesioality Reductio: Fids a low-dimesioal projectio of high dimesioal data that captures the correlatio structure of the data.

11 EM for Factor Aalysis X 1 X K Λ The model for y: P (y θ) = P (x θ)p (y x, θ)dx = N (0, ΛΛ + Ψ) Model parameters: θ = {Λ, Ψ}. Y 1 Y 2 Y D E step: For each data poit y, compute the posterior distributio of hidde factors give the observed data: Q (x) = P (x y, θ t ). M step: Fid the θ t+1 that maximises F(Q, θ): F(Q, θ) = Q (x) [log P (x θ) + log P (y x, θ) log Q (x)] dx = Q (x) [log P (x θ) + log P (y x, θ)] dx + cost.

12 The E step for Factor Aalysis E step: For each data poit y, compute the posterior distributio of hidde factors give the observed data: Q (x) = P (x y, θ t ) = P (x, y θ)/p (y θ) Tactic: write P (x, y θ), cosider y to be fixed. What is this as a fuctio of x? P (x, y ) = P (x)p (y x) = (2π) K 2 exp{ 1 2 x x} 2πΨ 1 2 exp{ 1 2 (y Λx) Ψ 1 (y Λx)} = cost exp{ 1 2 [x x + (y Λx) Ψ 1 (y Λx)]} = cost exp{ 1 2 [x (I + Λ Ψ 1 Λ)x 2x Λ Ψ 1 y ]} = cost exp{ 1 2 [x Σ 1 x 2x Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ ad µ = ΣΛ Ψ 1 y = βy. Note that µ is a liear fuctio of y ad Σ does ot deped o y.

13 The M step for Factor Aalysis M step: Fid θ t+1 maximisig F = Q (x) [log P (x θ) + log P (y x, θ)] dx + cost. log P (x θ)+ log P (y x, θ) = cost 1 2 x x 1 2 log Ψ 1 2 (y Λx) Ψ 1 (y Λx) = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λx + x Λ Ψ 1 Λx] = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λx + tr(λ Ψ 1 Λxx )] Takig expectatios over Q (x)... = cost 1 2 log Ψ 1 2 [y Ψ 1 y 2y Ψ 1 Λµ + tr(λ Ψ 1 Λ(µ µ + Σ))]

14 F = cost N 2 log Ψ 1 2 The M step for Factor Aalysis (cot.) [ y Ψ 1 y 2y Ψ 1 Λµ + tr(λ Ψ 1 Λ(µ µ + Σ)) ] Takig derivatives with respect to Λ ad Ψ 1, usig tr(ab) B = A ad F Λ = ( Ψ 1 y µ Ψ 1 Λ NΣ + ) µ µ = 0 ( ) 1 log A A = A : ˆΛ= ( y µ ) NΣ+ µ µ F Ψ 1 = N 2 Ψ 1 [ y y Λµ y y µ Λ + Λ(µ µ + Σ)Λ ] 2 ˆΨ = 1 N [ y y Λµ y y µ Λ + Λ(µ µ + Σ)Λ ] ˆΨ= ΛΣΛ + 1 N (y Λµ )(y Λµ ) (squared residuals) Whe Σ 0 these become the equatios for liear regressio!

15 Mixtures of Factor Aalysers Simultaeous clusterig ad dimesioality reductio. P (y θ) = k π k N (µ k, Λ k Λ k + Ψ) where π k is the mixig proportio for FA k, µ k is its cetre, Λ k is its factor loadig matrix, ad Ψ is a commo sesor oise model. θ = {{π k, µ k, Λ k } k=1...k, Ψ} We ca thik of this model as havig two sets of hidde latet variables: A discrete idicator variable s {1,... K} For each factor aalyzer, a cotious factor vector x,k R D k P (y θ) = K P (s θ) s =1 P (x s, θ)p (y x, s, θ) dx As before, a EM algorithm ca be derived for this model: E step: Ifer joit distributio of latet variables, P (x, s y, θ) M step: Maximize F with respect to θ.

16 Proof of the Matrix Iversio Lemma (A + XBX ) 1 = A 1 A 1 X(B 1 + X A 1 X) 1 X A 1 Need to prove: ( A 1 A 1 X(B 1 + X A 1 X) 1 X A 1) (A + XBX ) = I Expad: I + A 1 XBX A 1 X(B 1 + X A 1 X) 1 X A 1 X(B 1 + X A 1 X) 1 X A 1 XBX Regroup: = I + A 1 X = I + A 1 X = I + A 1 X (BX (B 1 + X A 1 X) 1 X (B 1 + X A 1 X) 1 X A 1 XBX ) (BX (B 1 + X A 1 X) 1 B 1 BX (B 1 + X A 1 X) 1 X A 1 XBX ) (BX (B 1 + X A 1 X) 1 (B 1 + X A 1 X)BX ) = I + A 1 X(BX BX ) = I

17 Readigs David MacKay s Textbook chapter 21, pages , draft 2.2.4, August 31, 2001 Ghahramai, Z. ad Hito, G.E. (1996) The EM Algorithm for Mixtures of Factor Aalyzers. Uiversity of Toroto Techical Report CRG-TR zoubi/papers/tr-96-1.ps.gz Mika, T. Tutorial o liear algebra. tpmika/papers/matrix.html Roweis, S.T. ad Ghahramai, Z. (1999) A Uifyig Review of Liear Gaussia Models. Neural Computatio 11(2). Sectios ad See also Appedix A.1-A.2. zoubi/papers/lds.ps.gz Wellig, M. (2000) Liear models. class otes. zoubi/course01/pca.ps

Week 3: The EM algorithm

Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent