Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft textbook, An Introduction to Probabilistic Graphical Models

Undirected Inference Algorithms One Marginal All Marginals Tree elimination applied recursively to leaves of tree belief propagation or sum-product algorithm Graph elimination algorithm junction tree algorithm: belief propagation on a junction tree A junction tree is a clique tree with special properties: Ø Consistency: Clique nodes corresponding to any variable from the original model form a connected subtree Ø Construction: Triangulations and elimination orderings

Graphical Models, Inference, Learning Graphical Model: A factorized probability representation Directed: Sequential, causal structure for generative process Undirected: Associate features with edges, cliques, or factors Inference: Given model, find marginals of hidden variables Standardize: Convert directed to equivalent undirected form Sum-product BP: Exact for any tree-structured graph Junction tree: Convert loopy graph to consistent clique tree Learning: Given a set of complete observations of all variables Directed: Decomposes to independent learning problems: Predict the distribution of each child given its parents Undirected: Global normalization globally couples parameters: Gradients computable by inferring clique/factor marginals Learning: Given a set of partial observations of some variables E-Step: Infer marginal distributions of hidden variables M-Step: Optimize parameters to match E-step and data stats

Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z( ) p(x ) =exp X f2f f (x f f )=exp{ T f f (x f )} T f f (x f ) A( ) D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Y f2f " N X n=1 f (x f,n f ) X f2f T f f (x f,n ) # NA( ) A( ) = log Z( )

Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(d ) = NY n=1 1 Z( ) Y f2f f (x f,n f ) D = {x V,1,...,x V,N } Given N independent, identically distributed, completely observed samples: log p(d ) = Take gradient with respect to parameters for a single factor: r f log p(d ) = " N X n=1 " N X n=1 X f2f T f f (x f,n ) f (x f,n ) # NE [ f (x f )] Must be able to compute marginal distributions for factors in current model: Ø Tractable for tree-structured factor graphs via sum-product Ø For general graphs, use the junction tree algorithm to compute # NA( )

Undirected Optimization Strategies log p(d ) = r f log p(d ) = " N X n=1 " X N n=1 X f2f T f f (x f,n ) f (x f,n ) # # NA( ) NE [ f (x f )] Gradient Ascent: Quasi-Newton methods like PCG, L-BGFS, Gradients: Difference between statistics of observed data, and inferred statistics for the model at the current iteration Objective: Explicitly compute log-normalization (variant of BP) Coordinate Ascent: Maximize objective with respect to the parameters of a single factor, keeping all other factors fixed Simple closed form depending on ratio between factor marginal for current model, and empirical marginal from data Iterative proportional fitting (IPF) and generalized iterative scaling algorithms (t+1) f (x f )= (t) f (x f ) p(x f ) p (t) f (x f )

Advanced Topics on the Horizon Graph Structure Learning f (x f f )=exp{ T f f (x f )} Setting factor parameters to zero implicitly removes from model Feature selection: Search-based, sparsity-inducing priors, Topologies: Tree-structured, directed, bounded treewidth, Approximate Inference: What if junction tree is intractable? Simulation-based (Monte Carlo) approximations Optimization-based (variational) approximations Inner loop of algorithms for approximate learning Alternative Objectives Max-Product: Global MAP configuration of hidden variables Discriminative learning: CRF, max-margin Markov network, Inference with Continuous Variables Gaussian: Closed form mean and covariance recursions Non-Gaussian: Variational and Monte Carlo approximations

Pairwise Markov Random Fields p(x) = 1 Z Y (s,t)2e st(x s,x t ) Y s2v s(x s ) x 1 x 2 Simple parameterization, but still expressive and widely used in practice Guaranteed Markov with respect to graph Any jointly Gaussian distribution can be represented by only pairwise potentials x 3 x 4 x 5 E V Z set of undirected edges (s,t) linking pairs of nodes set of N nodes or vertices, {1, 2,...,N} normalization constant (partition function)

Inference in Undirected Trees For a tree, the maximal cliques are always pairs of nodes: p(x) = 1 Z Y (s,t)2e st(x s,x t ) Y s(x s ) s2v

Belief Propagation (Integral-Product) BELIEFS: Posterior marginals x t MESSAGES: Sufficient statistics m ts (x s ) / Z ˆp t (x t ) / t (x t ) Y (t) x t st(x s,x t ) t (x t ) u2 (t) m ut (x t ) neighborhood of node t (adjacent nodes) Y u2 (t)\s m ut (x t ) x s x t I) Message Product II) Message Propagation

BP for Continuous Variables Is there a finitely parameterized, closed form for the message and marginal functions? Is there an analytic formula for the message integral, phrased as an update of these parameters?

Covariance and Correlation Covariance: 2 R d d u T u for any u 2 R d 1,u6= Always positive semidefinite: Often positive definite: Correlation: u T u> for any u 2 R d 1,u6= Independence:

Gaussian Distributions Simplest joint distribution that can capture arbitrary mean & covariance Justifications from central limit theorem and maximum entropy criterion Probability density above assumes covariance is positive definite ML parameter estimates are sample mean & sample covariance

Two-Dimensional Gaussians 5 spherical 1 diagonal 6 full 4 8 3 6 4 2 1 4 2 2 1 2 2 4 2 3 4 6 8 4 5 4 2 2 4 6 1 5 4 3 2 1 1 2 3 4 5 6 full 6 4 2 2 4 6 spherical diagonal.2.2.15.15.2.1.1.15.5.5 5 5 5.1.5 1 5 5 5 1 5 1 5 5 5 1 1 5 5 1

Gaussian Geometry Eigenvalues and eigenvectors: u i = i u i,i=1,...,d For a symmetric matrix: i 2 R u T i u i =1 u T i u j = dx =U U T = iu i u T i i=1 For a positive semidefinite matrix: i For a positive definite matrix: i > 1 = U 1 U T = dx 1 i=1 i u i u T i y i = u T i (x µ)

Probabilistic PCA & Factor Analysis Both Models: Data is a linear function of low-dimensional latent coordinates, plus Gaussian noise p(x i z i, )=N (x i Wz i + µ, ) p(x i ) =N (x i µ, W W T + ) Factor analysis: Probabilistic PCA: x 2 is a general diagonal matrix is a multiple of identity matrix = 2 I p(x bz) w p(z i ) =N (z i,i) x 2 low rank covariance parameterization µ } b z w µ p(z) p(x) bz z x 1 C. Bishop, Pattern Recognition & Machine Learning x 1

Gaussian Graphical Models x N (µ, ) J = 1

Gaussian Potentials