Probabilistic Graphical Models

Similar documents
Probabilistic Graphical Models

Introduction to Machine Learning

Probabilistic Graphical Models

Introduction to Machine Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Expectation Maximization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Parameter learning in CRF s

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 15. Probabilistic Models on Graph

The Expectation-Maximization Algorithm

Expectation Maximization

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Data Mining Techniques

Hidden Markov Models

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Probabilistic Graphical Models

Lecture 9: PGM Learning

A minimalist s exposition of EM

Linear Dynamical Systems

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2017

But if z is conditioned on, we need to model it:

Graphical Models for Collaborative Filtering

Variational Inference (11/04/13)

Bayesian Networks Inference with Probabilistic Graphical Models

Recent Advances in Bayesian Inference Techniques

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Probabilistic Graphical Models

Undirected Graphical Models

Machine Learning for Data Science (CS4786) Lecture 24

Lecture 16 Deep Neural Generative Models

The Origin of Deep Learning. Lili Mou Jan, 2015

Note 1: Varitional Methods for Latent Dirichlet Allocation

Notes on Machine Learning for and

Unsupervised Learning

Mixture Models and Expectation-Maximization

STA 4273H: Statistical Machine Learning

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Bayesian Machine Learning - Lecture 7

Gaussian Mixture Models

Clustering and Gaussian Mixture Models

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Probabilistic Graphical Models

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Based on slides by Richard Zemel

Probabilistic Graphical Models for Image Analysis - Lecture 1

Clustering, K-Means, EM Tutorial

Expectation Propagation Algorithm

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

p L yi z n m x N n xi

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Variational Autoencoder

CPSC 540: Machine Learning

Naïve Bayes classification

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

2 : Directed GMs: Bayesian Networks

13 : Variational Inference: Loopy Belief Propagation

Posterior Regularization

K-Means and Gaussian Mixture Models

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

13: Variational inference II

Bayesian Networks. Motivation

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Directed and Undirected Graphical Models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Probabilistic and Bayesian Machine Learning

Conditional Random Field

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Unsupervised Learning

Technical Details about the Expectation Maximization (EM) Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

GWAS V: Gaussian processes

On the Relationship between Sum-Product Networks and Bayesian Networks

Dynamic Approaches: The Hidden Markov Model

Statistical Approaches to Learning and Discovery

CS838-1 Advanced NLP: Hidden Markov Models

Directed and Undirected Graphical Models

Lecture 14. Clustering, K-means, and EM

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSCI-567: Machine Learning (Spring 2019)

A brief introduction to Conditional Random Fields

Probabilistic Graphical Models

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lecture 4: Probabilistic Learning

Transcription:

Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy Michael Jordan s draft textbook, An Introduction to Probabilistic Graphical Models

Expectation Maximiation (EM) y i y t i x i N Supervised Training x t Supervised Testing x i N Unsupervised Learning, parameters (shared across observations) 1,..., N hidden data (unique to particular instances) Initialiation: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data Equivalent to test inference of full posterior distribution M-Step: Given posterior distributions, find likely parameters Distinct from supervised ML/MAP, but often still tractable Iteration: Alternate E-step & M-step until convergence

EM as Lower Bound Maximiation ln p(x ) =ln p(x, ) p(x, ) ln p(x ) q()ln q() ln p(x ) q()lnp(x, ) q()lnq(), L(q, ) Initialiation: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data q (t) = arg max q M-Step: Given posterior distributions, find likely parameters Iteration: Alternate E-step & M-step until convergence L(q, (t 1) ) (t) = arg max L(q (t),) (0)

Lower Bounds on Marginal Likelihood KL(q p) E-Step: q() =p( x, ) L(q, ) ln p( ) C. Bishop, Pattern Recognition & Machine Learning

ln p(x ) EM: Expectation Step q (t) = arg max q q()lnp(x, ) L(q, (t 1) ) General solution, for any probabilistic model: q (t) () =p( x, (t 1) ) q()lnq(), L(q, ) posterior distribution given current parameters For a directed graphical model: x fixes conditional distributions of every child node, given parents observed nodes (training data) unobserved nodes (hidden data) Inference: Find summary statistics of posterior needed for following M-step 1 2 3 x 1 x 2 x 3 0

EM: Maximiation Step ln p(x ) q()lnp(x, ) q()lnq(), L(q, ) (t) = arg max L(q (t),) = arg max q()lnp(x, ) Recall directed graphical model factoriation: y = {x, } N log p(y ) = log p(y i,n y (i),n, i )= " N # log p(y i,n y (i),n, i ) n=1 i2v i2v n=1 E q [log p(y )] = " N # E qi [log p(y i,n y (i),n, i )] 1 2 3 i2v n=1 q i (y i,y (i) )=p(y i,y (i) i old,x) x 1 x 2 x 3 From E-step, only require posterior marginal distributions of each node and its parents, given observations (which have probability one) for that training instance. 0

EM: A Sequence of Lower Bounds ln p( θ) L (q,θ) θ old θ new C. Bishop, Pattern Recognition & Machine Learning

Expectation Maximiation Algorithm KL(q p) =0 Infer Marginals KL(q p) Re-Infer Marginals Optimie Parameters L(q, old ) ln p( old ) L(q, new ) ln p( new ) E Step: Optimie distribution on hidden variables given parameters M Step: Optimie parameters given distribution on hidden variables C. Bishop, Pattern Recognition & Machine Learning

M-Step for Exponential Families Exponential Family: E-step Produces: M-step Objective: L() = p(x, ) =exp{ T (x, ) A()} q( i ),i=1,...,n ik, q( i = k) N E qi [log p(x i, i )] = i=1 N E qi [ T (x i, i ) A()] i=1 Taking gradient shows that optimum parameters satisfy Eˆ[ (x, )] = 1 N N i=1 k ik (x i,k) As in basic mixture models, solution always matches moments: Ø For observed variables, empirical distribution of data Ø For hidden variables, weighted distribution from E-step In directed graphical models, apply to each local conditional

EM for MAP Estimation Up to a constant independent of, ln p( x) = ln p()+lnp(x ) =lnp()+ln p(x, ) ln p()+ q()lnp(x, ) q()lnq(), L(q, ) Initialiation: Randomly select starting parameters E-Step: Given parameters, find posterior of hidden data q (t) () =p( x, (t 1) ) q (t) = arg max q L(q, (t 1) ) M-Step: Given posterior distributions, find likely parameters (t) = arg max L(q (t),) Iteration: Alternate E-step & M-step until convergence (0) Objective is sum of prior and weighted likelihood

F V Undirected Graphical Models p(x ) = 1 Y f (x f f ) Z() Z() = Y f (x f f ) x set of hyperedges linking subsets of nodes set of N nodes or vertices, {1, 2,...,N} f V Assume an exponential family representation of each factor: p(x ) =exp f (x f f )=exp{ T f f (x f )} T f f (x f ) A() A() = log Z() Partition function globally couples the local factor parameters

Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z() p(x ) =exp f (x f f )=exp{ T f f (x f )} T f f (x f ) A() D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Y " N n=1 f (x f,n f ) T f f (x f,n ) A() = log Z() Partition function globally couples the local factor parameters # NA()

Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z() Y f (x f,n f ) D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Take gradient with respect to parameters for a single factor: r f log p(d ) = " N n=1 " N n=1 T f f (x f,n ) f (x f,n ) # NE [ f (x f )] Must be able to compute marginal distributions for factors in current model: Ø Tractable for tree-structured factor graphs via sum-product Ø What about general factor graphs or undirected graphs? # NA()