Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy Michael Jordan s draft textbook, An Introduction to Probabilistic Graphical Models
Expectation Maximiation (EM) y i y t i x i N Supervised Training x t Supervised Testing x i N Unsupervised Learning, parameters (shared across observations) 1,..., N hidden data (unique to particular instances) Initialiation: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data Equivalent to test inference of full posterior distribution M-Step: Given posterior distributions, find likely parameters Distinct from supervised ML/MAP, but often still tractable Iteration: Alternate E-step & M-step until convergence
EM as Lower Bound Maximiation ln p(x ) =ln p(x, ) p(x, ) ln p(x ) q()ln q() ln p(x ) q()lnp(x, ) q()lnq(), L(q, ) Initialiation: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data q (t) = arg max q M-Step: Given posterior distributions, find likely parameters Iteration: Alternate E-step & M-step until convergence L(q, (t 1) ) (t) = arg max L(q (t),) (0)
Lower Bounds on Marginal Likelihood KL(q p) E-Step: q() =p( x, ) L(q, ) ln p( ) C. Bishop, Pattern Recognition & Machine Learning
ln p(x ) EM: Expectation Step q (t) = arg max q q()lnp(x, ) L(q, (t 1) ) General solution, for any probabilistic model: q (t) () =p( x, (t 1) ) q()lnq(), L(q, ) posterior distribution given current parameters For a directed graphical model: x fixes conditional distributions of every child node, given parents observed nodes (training data) unobserved nodes (hidden data) Inference: Find summary statistics of posterior needed for following M-step 1 2 3 x 1 x 2 x 3 0
EM: Maximiation Step ln p(x ) q()lnp(x, ) q()lnq(), L(q, ) (t) = arg max L(q (t),) = arg max q()lnp(x, ) Recall directed graphical model factoriation: y = {x, } N log p(y ) = log p(y i,n y (i),n, i )= " N # log p(y i,n y (i),n, i ) n=1 i2v i2v n=1 E q [log p(y )] = " N # E qi [log p(y i,n y (i),n, i )] 1 2 3 i2v n=1 q i (y i,y (i) )=p(y i,y (i) i old,x) x 1 x 2 x 3 From E-step, only require posterior marginal distributions of each node and its parents, given observations (which have probability one) for that training instance. 0
EM: A Sequence of Lower Bounds ln p( θ) L (q,θ) θ old θ new C. Bishop, Pattern Recognition & Machine Learning
Expectation Maximiation Algorithm KL(q p) =0 Infer Marginals KL(q p) Re-Infer Marginals Optimie Parameters L(q, old ) ln p( old ) L(q, new ) ln p( new ) E Step: Optimie distribution on hidden variables given parameters M Step: Optimie parameters given distribution on hidden variables C. Bishop, Pattern Recognition & Machine Learning
M-Step for Exponential Families Exponential Family: E-step Produces: M-step Objective: L() = p(x, ) =exp{ T (x, ) A()} q( i ),i=1,...,n ik, q( i = k) N E qi [log p(x i, i )] = i=1 N E qi [ T (x i, i ) A()] i=1 Taking gradient shows that optimum parameters satisfy Eˆ[ (x, )] = 1 N N i=1 k ik (x i,k) As in basic mixture models, solution always matches moments: Ø For observed variables, empirical distribution of data Ø For hidden variables, weighted distribution from E-step In directed graphical models, apply to each local conditional
EM for MAP Estimation Up to a constant independent of, ln p( x) = ln p()+lnp(x ) =lnp()+ln p(x, ) ln p()+ q()lnp(x, ) q()lnq(), L(q, ) Initialiation: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data q (t) () =p( x, (t 1) ) q (t) = arg max q L(q, (t 1) ) M-Step: Given posterior distributions, find likely parameters (t) = arg max L(q (t),) Iteration: Alternate E-step & M-step until convergence (0) Objective is sum of prior and weighted likelihood
F V Undirected Graphical Models p(x ) = 1 Y f (x f f ) Z() Z() = Y f (x f f ) x set of hyperedges linking subsets of nodes set of N nodes or vertices, {1, 2,...,N} f V Assume an exponential family representation of each factor: p(x ) =exp f (x f f )=exp{ T f f (x f )} T f f (x f ) A() A() = log Z() Partition function globally couples the local factor parameters
Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z() p(x ) =exp f (x f f )=exp{ T f f (x f )} T f f (x f ) A() D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Y " N n=1 f (x f,n f ) T f f (x f,n ) A() = log Z() Partition function globally couples the local factor parameters # NA()
Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z() Y f (x f,n f ) D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Take gradient with respect to parameters for a single factor: r f log p(d ) = " N n=1 " N n=1 T f f (x f,n ) f (x f,n ) # NE [ f (x f )] Must be able to compute marginal distributions for factors in current model: Ø Tractable for tree-structured factor graphs via sum-product Ø What about general factor graphs or undirected graphs? # NA()