EM and Structure Learning

Size: px

Start display at page:

Download "EM and Structure Learning"

Margery Tyler
5 years ago
Views:

1 EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012

2 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2

Gaussan mxture model Consder a mxture of K

Correspond to p X = p(x z z; μ z, Σ z )P

3 Gaussan mxture model Consder a mxture of K Gaussans p(x) = π k N(X μ k, Σ k ) k N(μ 1, Σ 1 ) mxture proporton mxture Component Learn π k, μ k, Σ k ; N(μ 2, Σ 2 ) Correspond to p X = p(x z z; μ z, Σ z )P z; π Z Can be used for unsupervsed clusterng X N 3

4 Why s learnng hard? In fully observed d settngs, the log-lkelhood decomposes nto a sum of local terms l θ; x = log p x, z θ = log P z θ 1 + log p(x z, θ 2 ) Wth latent varables, all the parameters become coupled together va margnalzaton l θ; x = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 Z Z X N X N 4

5 EM algorthm EM: Expectaton-maxmzaton for fndng θ l θ; D = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 Iterate between E-step and M-step untl convergence Expectaton step (E-step) f θ = E q z log p x, z θ, where q z = P(z x, θ t ) Maxmzaton step (M-step) θ t+1 = argmax θ f θ 5

6 Example: Gaussan mxture model A mxture of K Gaussans: Z s latent class ndcator vector P Z θ = θ 1 Z 1 θ 2 Z 2 θ K Z K X s a condtonal Gaussan varable P X Z k = 1, μ, Σ = 2π 1 d 2 Σ 1 exp 1 X μ 2 k Σ 1 k (X μ k ) 2 Z X N The lkelhood of a sample: P x θ, μ, Σ = P(z k = 1 θ) k P x z k = 1, μ, Σ = θ k k N x μ k, Σ k The complete log-lkelhood log P x, z θ, μ, Σ = log P z θ + log P(x z, μ, Σ = z K k θ + K z k log N x μ k, Σ k 6

7 Expected complete log-lkelhood l c {x, z}; θ, μ, Σ = log P x, z θ, μ, Σ log P z θ + log P(x z, μ, Σ = < l c {x, z}; θ, μ, Σ > P Z {x} = < log P z θ > P Z {x} + < log p(x z, μ, Σ) > P Z {x} = < z k k > P Z {x} log θ k 1 < z k > P Z {x} ( x μ k Σ 1 k (x μ k ) 2 k + log Σ k + C) 7

8 E-step Expectaton step: computng the expected value of the suffcent statstcs of the hdden varables (z) gven current estmate of the parameters (θ, μ, Σ) τ k =< z k > P Z {x} = P z k = 1 x, μ, Σ = θ kn x μ k,σ k k θ k N x μ k,σ k We are essentally dong nference < l c {x, z}; θ, μ, Σ > P Z {x} = k τ k log θ k 1 τ k ( x μ k Σ 1 k (x μ k ) 2 k + log Σ k + C) 8

9 M-step Maxmzaton step: compute the parameters under current results of the expected complete log-lkelhood θ k = argmax θk < l c {x, z}; θ, μ, Σ > P Z {x}, s. t. θ k = 1 k θ k = τ k N μ k = argmax μk < l c {x, z}; θ, μ, Σ > P Z {x} μ k = τ k x τ k Σ k = argmax Σk < l c {x, z}; θ, μ, Σ > P Z {x} Σ k = τ k (x μ k ) (x μ k ) T τ k 9

10 K-means vs EM for Gaussan mxture The EM algorthm for mxture of Gaussan s lke a soft clusterng algorthm K-means: E-step, we do hard assgnment: z = argmax k (x μ k ) Σ k 1 (x μ k ) M-step, we update the means and covarance of cluster usng maxmum lkelhood estmate: μ k = Σ k = δ z,k δ z,k x δ z,k (x μ k ) (x μ k ) T δ z,k 10

11 Expectaton-Maxmzaton Iteractons 11

12 Theory underlyng EM Recall that accordng to MLE, we ntend to learn the model parameter that would have maxmze the lkelhood of the data. l θ; D = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 But we are teratng these: Expectaton step (E-step) f θ = E q z log p x, z θ, where q z = P(z x, θ t ) Maxmzaton step (M-step) θ t+1 = argmax θ f θ Does maxmzng ths surrogate yeld a maxmzer of the lkelhood? 12

13 Jensen s nequalty For concave functon f(x) f α x α f x, where α = 1, α 0 f 1 3 x x 2 f(x 2 ) f(x 1 ) 1 3 f(x 1) f(x 2) f(x) 13

14 Lower bound of log-lkelhood Log-lkelhood l x; θ = log p x, z θ z = log q z x z p x,z θ q z x p x,z θ z q z x log (Jensen s nequalty) q z x = z q z x log p x, z θ z q z x log q(z x) =< log p x, z θ > q(z x) +H q(z x) =< l x, z; θ > q(z x) +H q z x What q to use? l x; θ < l x, z; θ > q z x +H q(z x) 14

15 What attans equalty? p z x, θ t attans the equalty at θ t p x,z θ Let F q, θ = z q z x log l x; θ = log z p x, z θ q z x F p z x, θ t, θ t = p z x, θ t log = p z x, θ t log p(x θ t ) z = log p(x θ t ) = log p x, z θ t z z p x,z θt p z x,θ t l x; θ t F p z x, θ t, θ l x; θ 15

16 Lower bound of log-lkelhood For fxed data x, defne a functonal called free energy: F q, θ = q z x log z p x,z θ q z x l x; θ EM s coordnate ascent on F: E-step: q t+1 = argmax q F q, θ t q M-step: θ t+1 = argmax θ F(q t+1, θ) θ 16

17 EM algorthm EM: Expectaton-maxmzaton for fndng θ l θ; D = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 Iterate between E-step and M-step untl convergence Expectaton step (E-step) f θ = E q z log p x, z θ, where q z = P(z x, θ t ) Maxmzaton step (M-step) θ t+1 = argmax θ f θ 17

18 Where are we now? Graphcal model representaton Bayesan networks (drected graphcal models) Markov networks (undrected graphcal models) Condtonal ndependence statements + factorzaton of jont dstrbuton Inference n graphcal models Varable elmnaton, message passng on trees and juncton trees Samplng (rejecton, mportance and Gbbs samplng) Learnng graphcal model parameters (gven structure) Maxmum lkelhood estmaton (just counts for dscrete BN) Bayesan learnng (posteror) EM for models wth latent varables 18

19 Structure Learnng The goal: gven set of ndependent samples (assgnments of random varables), fnd the best (the most lkely) graphcal model structure A F A S F A F S canddate structure N H Score structures S N H (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) (A,F,S,N,H) = (F,T,T,T,T) A N S F H Maxmum lkelhood; Bayesan score; Margn N H 19

20 Mutual nformaton Mutual nformaton I X, X j = P x, x j log P(x,x j x,x j P x P x j KL-dvergence between jont dstrbuton P X, X j and product of margnal dstrbutons P X P X j KL P X, X j P X P X j A dstance away from ndependence X and X j are ndependent f and only f I X, X j = 0 Gven M d data pont D = x l, P x, x j = #(x,x j ) M 20

21 Scorng a tree model Gven M d data ponts D = x l Gven a tree, suppose we ve learned the model p X G = p (X X π ) The log-lkelhood of data accordng to the tree model l D, G = log p ( D G = log p x j x π l = log p x j x π l = M x,x π p x, x π log p x x π j j = M p x, x π log p x,x π x,x π p x π 21

22 Decomposable score for tree models The log-lkelhood of data accordng to the tree model s l D, G = log p ( D G = M p x, x π log p x,x π x,x π p x π = M p x, x π log p x,x π = x,x π p x π p x p x M p x, x π log p x,x π x,x π + M x p x, x π log p x p x π p x,x π = M I( x, x π ) M H(x ) Score decompose accordng to edges n the tree! 22

23 Structural search space How many graphs over n nodes? O(2 n2 ) How many trees over n nodes? O(2 nlogn ) But t turns out that we can fnd exact soluton of an optmal tree under MLE Trck: n a tree each node has only one parent! Chow-lu algorthm 23

24 Equvalent trees l D, G = log p ( D G = M I( x, x π ) M H(x ) A F A F A F A F S S S Same skeleton S N H N H N H N H l D, G = log p ( D G = M (,j) T I(x, x j ) M H(x ) 24

25 Chow-lu algorthm T = argmax T M (,j) T I(x, x j ) M H(x ) Chow-lu algorthm For each par of varables X, X j, compute ther emprcal mutual nformaton I(x, x j ) Now you have a complete graph connectng varable nodes, wth edge weght equal to I(x, x j ) Run maxmum spannng tree algorthm 25

26 Extenson of Chow-Lu Tree augmented naïve Bayes TAN Naïve Bayes model overcounts, because correlaton between features not consdered Tree-augmented feature lst C Same as Chow-lu, but score edges wth condtonal mutual nformaton P x, x j C = #(x,x j C) M I X, X j = P x, x j C log c x,x j P(x,x j C) P x C P x j C 26

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number