Probabilistic Graphical Models

Similar documents
Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Graphical Models

14 : Theory of Variational Inference: Inner and Outer Approximation

Introduction to Machine Learning

STA 4273H: Statistical Machine Learning

p L yi z n m x N n xi

6.867 Machine learning, lecture 23 (Jaakkola)

Probabilistic Graphical Models

13: Variational inference II

STA 4273H: Statistical Machine Learning

14 : Theory of Variational Inference: Inner and Outer Approximation

Lecture 21: Spectral Learning for Graphical Models

Probabilistic Graphical Models (I)

Lecture 9: PGM Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

Introduction to Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Junction Tree, BP and Variational Methods

Bayesian Machine Learning - Lecture 7

Probabilistic Graphical Models

Lecture 16 Deep Neural Generative Models

Introduction to Graphical Models

Statistical Approaches to Learning and Discovery

Undirected Graphical Models: Markov Random Fields

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

STA 414/2104: Machine Learning

Graphical Models for Collaborative Filtering

Variational Inference (11/04/13)

Machine Learning for Data Science (CS4786) Lecture 24

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Pattern Recognition and Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Linear Dynamical Systems

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Inference in Bayesian Networks

9 Forward-backward algorithm, sum-product on factor graphs

Lecture 15. Probabilistic Models on Graph

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Recent Advances in Bayesian Inference Techniques

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Graphical Models for Statistical Inference and Data Assimilation

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Bayesian Learning in Undirected Graphical Models

Graphical Models for Statistical Inference and Data Assimilation

Machine Learning Techniques for Computer Vision

Probabilistic Graphical Models

Probabilistic Graphical Models

Graphical Models and Kernel Methods

13 : Variational Inference: Loopy Belief Propagation

Probabilistic Graphical Models

Graphical models for statistical inference and data assimilation

Undirected Graphical Models

Lecture 13 : Variational Inference: Mean Field Approximation

Message Passing Algorithms and Junction Tree Algorithms

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Probabilistic Graphical Models: Representation and Inference

STA 4273H: Statistical Machine Learning

12 : Variational Inference I

Variational Inference. Sargur Srihari

Variational Autoencoders

Bayesian Learning in Undirected Graphical Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

An Introduction to Bayesian Machine Learning

Factor Analysis and Kalman Filtering (11/2/04)

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Probabilistic Graphical Models

Sampling Algorithms for Probabilistic Graphical models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Parameter learning in CRF s

Based on slides by Richard Zemel

Machine Learning Lecture 14

Lecture 7: Con3nuous Latent Variable Models

Probabilistic and Bayesian Machine Learning

Approximate Inference Part 1 of 2

Expectation Propagation Algorithm

Approximate Inference Part 1 of 2

3 : Representation of Undirected GM

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Directed and Undirected Graphical Models

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles

CS281A/Stat241A Lecture 19

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Lecture 6: Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Clustering with k-means and Gaussian mixture distributions

Learning MN Parameters with Approximation. Sargur Srihari

Probabilistic Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Learning Parameters of Undirected Models. Sargur Srihari

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Representation of undirected GM. Kayhan Batmanghelich

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles

Markov Networks.

Transcription:

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft textbook, An Introduction to Probabilistic Graphical Models

Graphical Models, Inference, Learning Graphical Model: A factorized probability representation Directed: Sequential, causal structure for generative process Undirected: Associate features with edges, cliques, or factors Inference: Given model, find marginals of hidden variables Standardize: Convert directed to equivalent undirected form Sum-product BP: Exact for any tree-structured graph Junction tree: Convert loopy graph to consistent clique tree x 1 x 2 x 3 x 4 x 5

Undirected Inference Algorithms One Marginal All Marginals Tree elimination applied recursively to leaves of tree belief propagation or sum-product algorithm Graph elimination algorithm junction tree algorithm: belief propagation on a junction tree A junction tree is a clique tree with special properties: Ø Consistency: Clique nodes corresponding to any variable from the original model form a connected subtree Ø Construction: Triangulations and elimination orderings

Graphical Models, Inference, Learning Graphical Model: A factorized probability representation Directed: Sequential, causal structure for generative process Undirected: Associate features with edges, cliques, or factors Inference: Given model, find marginals of hidden variables Standardize: Convert directed to equivalent undirected form Sum-product BP: Exact for any tree-structured graph Junction tree: Convert loopy graph to consistent clique tree Learning: Given a set of complete observations of all variables Directed: Decomposes to independent learning problems: Predict the distribution of each child given its parents Undirected: Global normalization globally couples parameters: Gradients computable by inferring clique/factor marginals Learning: Given a set of partial observations of some variables E-Step: Infer marginal distributions of hidden variables M-Step: Optimize parameters to match E-step and data stats

Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z( ) p(x ) =exp X f2f f (x f f )=exp{ T f f (x f )} T f f (x f ) A( ) D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Y f2f " N X n=1 f (x f,n f ) X f2f T f f (x f,n ) # NA( ) A( ) = log Z( )

Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z( ) Y f2f f (x f,n f ) D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Take gradient with respect to parameters for a single factor: r f log p(d ) = " N X n=1 " N X n=1 X f2f T f f (x f,n ) f (x f,n ) # NE [ f (x f )] Must be able to compute marginal distributions for factors in current model: Ø Tractable for tree-structured factor graphs via sum-product Ø For general graphs, use the junction tree algorithm to compute # NA( )

Undirected Optimization Strategies log p(d ) = r f log p(d ) = " N X n=1 " X N n=1 X f2f T f f (x f,n ) f (x f,n ) # # NA( ) NE [ f (x f )] Gradient Ascent: Quasi-Newton methods like PCG, L-BGFS, Gradients: Difference between statistics of observed data, and inferred statistics for the model at the current iteration Objective: Explicitly compute log-normalization (variant of BP) Coordinate Ascent: Maximize objective with respect to the parameters of a single factor, keeping all other factors fixed Simple closed form depending on ratio between factor marginal for current model, and empirical marginal from data Iterative proportional fitting (IPF) and generalized iterative scaling algorithms (t+1) f (x f )= (t) f (x f ) p(x f ) p (t) f (x f )

Advanced Topics on the Horizon Graph Structure Learning f (x f f )=exp{ T f f (x f )} Setting factor parameters to zero implicitly removes from model Feature selection: Search-based, sparsity-inducing priors, Topologies: Tree-structured, directed, bounded treewidth, Approximate Inference: What if junction tree is intractable? Simulation-based (Monte Carlo) approximations Optimization-based (variational) approximations Inner loop of algorithms for approximate learning Alternative Objectives Max-Product: Global MAP configuration of hidden variables Discriminative learning: CRF, max-margin Markov network, Inference with Continuous Variables Gaussian: Closed form mean and covariance recursions Non-Gaussian: Variational and Monte Carlo approximations

Pairwise Markov Random Fields p(x) = 1 Z Y (s,t)2e st(x s,x t ) Y s2v s(x s ) x 1 x 2 Simple parameterization, but still expressive and widely used in practice Guaranteed Markov with respect to graph Any jointly Gaussian distribution can be represented by only pairwise potentials x 3 x 4 x 5 E V Z set of undirected edges (s,t) linking pairs of nodes set of N nodes or vertices, {1, 2,...,N} normalization constant (partition function)

Inference in Undirected Trees For a tree, the maximal cliques are always pairs of nodes: p(x) = 1 Z Y (s,t)2e st(x s,x t ) Y s(x s ) s2v

Belief Propagation (Integral-Product) BELIEFS: Posterior marginals x t MESSAGES: Sufficient statistics m ts (x s ) / Z ˆp t (x t ) / t (x t ) Y (t) x t st(x s,x t ) t (x t ) u2 (t) m ut (x t ) neighborhood of node t (adjacent nodes) Y u2 (t)\s m ut (x t ) x s x t I) Message Product II) Message Propagation

BP for Continuous Variables Is there a finitely parameterized, closed form for the message and marginal functions? Is there an analytic formula for the message integral, phrased as an update of these parameters?

Covariance and Correlation Covariance: 2 R d d u T u for any u 2 R d 1,u6= Always positive semidefinite: Often positive definite: Correlation: u T u> for any u 2 R d 1,u6= Independence:

Gaussian Distributions Simplest joint distribution that can capture arbitrary mean & covariance Justifications from central limit theorem and maximum entropy criterion Probability density above assumes covariance is positive definite ML parameter estimates are sample mean & sample covariance

Two-Dimensional Gaussians 5 spherical 1 diagonal 6 full 4 8 3 6 4 2 1 4 2 2 1 2 2 4 2 3 4 6 8 4 5 4 2 2 4 6 1 5 4 3 2 1 1 2 3 4 5 6 full 6 4 2 2 4 6 spherical diagonal.2.2.15.15.2.1.1.15.5.5 5 5 5.1.5 1 5 5 5 1 5 1 5 5 5 1 1 5 5 1

Gaussian Geometry Eigenvalues and eigenvectors: u i = i u i,i=1,...,d For a symmetric matrix: i 2 R u T i u i =1 u T i u j = dx =U U T = iu i u T i i=1 For a positive semidefinite matrix: i For a positive definite matrix: i > 1 = U 1 U T = dx 1 i=1 i u i u T i y i = u T i (x µ)

Probabilistic PCA & Factor Analysis Both Models: Data is a linear function of low-dimensional latent coordinates, plus Gaussian noise p(x i z i, )=N (x i Wz i + µ, ) p(x i ) =N (x i µ, W W T + ) Factor analysis: Probabilistic PCA: x 2 is a general diagonal matrix is a multiple of identity matrix = 2 I p(x bz) w p(z i ) =N (z i,i) x 2 low rank covariance parameterization µ } b z w µ p(z) p(x) bz z x 1 C. Bishop, Pattern Recognition & Machine Learning x 1

Gaussian Graphical Models x N (µ, ) J = 1

Gaussian Potentials