Probabilistic Graphical Models

Similar documents
Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Graphical models and message-passing Part II: Marginals and likelihoods

Probabilistic Graphical Models

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Probabilistic and Bayesian Machine Learning

13 : Variational Inference: Loopy Belief Propagation

Junction Tree, BP and Variational Methods

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

A new class of upper bounds on the log partition function

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

13: Variational inference II

Exact MAP estimates by (hyper)tree agreement

Probabilistic Graphical Models

Variational Inference (11/04/13)

arxiv: v1 [stat.ml] 26 Feb 2013

Dual Decomposition for Marginal Inference

CS Lecture 19. Exponential Families & Expectation Propagation

Probabilistic Graphical Models

17 Variational Inference

Bayesian Machine Learning - Lecture 7

12 : Variational Inference I

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Semidefinite relaxations for approximate inference on graphs with cycles

Graphs with Cycles: A Summary of Martin Wainwright s Ph.D. Thesis

Lecture 9: PGM Learning

Inconsistent parameter estimation in Markov random fields: Benefits in the computation-limited setting

On the Convergence of the Convex Relaxation Method and Distributed Optimization of Bethe Free Energy

Decomposition Methods for Large Scale LP Decoding

Variational algorithms for marginal MAP

Variational Algorithms for Marginal MAP

Probabilistic Graphical Models

Probabilistic Graphical Models

High dimensional Ising model selection

Concavity of reweighted Kikuchi approximation

Linear Response for Approximate Inference

Recitation 9: Loopy BP

Does Better Inference mean Better Learning?

Bayesian Learning in Undirected Graphical Models

UNIVERSITY OF CALIFORNIA, IRVINE. Variational Message-Passing: Extension to Continuous Variables and Applications in Multi-Target Tracking

The Generalized Distributive Law and Free Energy Minimization

Fisher Information in Gaussian Graphical Models

Inference and Representation

Inference as Optimization

Bounding the Partition Function using Hölder s Inequality

Bayesian Learning in Undirected Graphical Models

Structured Variational Inference

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

Tightness of LP Relaxations for Almost Balanced Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Lecture 13 : Variational Inference: Mean Field Approximation

Learning MN Parameters with Approximation. Sargur Srihari

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Expectation Propagation Algorithm

Information Geometric view of Belief Propagation

Probabilistic Graphical Models

Convexifying the Bethe Free Energy

Fractional Belief Propagation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

The Expectation Maximization or EM algorithm

Probabilistic Graphical Models

Dual Decomposition for Inference

Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs

A Combined LP and QP Relaxation for MAP

CS Lecture 13. More Maximum Likelihood

Expectation Consistent Free Energies for Approximate Inference

Low-Density Parity-Check Codes

Minimizing D(Q,P) def = Q(h)

Inference in Bayesian Networks

LINK scheduling algorithms based on CSMA have received

Graphical models and message-passing Part I: Basics and MAP computation

Probabilistic Graphical Models

Lecture 6: Graphical Models: Learning

Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation

Concavity of reweighted Kikuchi approximation

On the optimality of tree-reweighted max-product message-passing

Expectation Propagation in Factor Graphs: A Tutorial

Lecture 12: Binocular Stereo and Belief Propagation

Convergent Tree-reweighted Message Passing for Energy Minimization

NOTES ON PROBABILISTIC DECODING OF PARITY CHECK CODES

Introduction to graphical models: Lecture III

Estimating Latent Variable Graphical Models with Moments and Likelihoods

An Introduction to Expectation-Maximization

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

Machine Learning Summer School

Algorithms for Reasoning with Probabilistic Graphical Models. Class 3: Approximate Inference. International Summer School on Deep Learning July 2017

The Expectation-Maximization Algorithm

Undirected Graphical Models

9 Forward-backward algorithm, sum-product on factor graphs

Probabilistic Graphical Models & Applications

Part 1: Expectation Propagation

Inferning with High Girth Graphical Models

Parameter learning in CRF s

CS281A/Stat241A Lecture 19

Belief Propagation, Information Projections, and Dykstra s Algorithm

Transcription:

School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4 X 4 1

Recap: Variational Inference Variational formulation A(θ) = sup{θ T µ A (µ)} µ M M : the marginal polytope, difficult to characterize A : the negative entropy function, no explicit form Mean field method: non-convex inner bound and exact form of entropy Bethe approximation and loopy belief propagation: polyhedral outer bound and non-convex Bethe approximation 2

Mean Field Approximation 3

Tractable Subgraphs Definition: A subgraph F of the graph G is tractable if it is feasible to perform exact inference Example: 4

Mean Field Methods For an exponential family with sufficient statistics defined on graph G, the set of realizable mean parameter set M(G; φ) :={µ R d p s.t. E p [φ(x)] = µ} φ For a given tractable subgraph F, a subset of mean parameters of interest M(F ; φ) :={τ R d τ = E θ [φ(x)] for some θ Ω(F )} Inner approximation Mean field solves the relaxed problem A F = A MF (G) set M (G). As M(F ; φ) o M(G; φ) o max τ M F (G) {τ,θ A F (τ)} is the exact dual function restricted to M M F (G). involves 5

Example: Naïve Mean Field for Ising Model Ising model in {0,1} representation p(x) exp x s θ s + ( ) x s x t θ st Mean parameters ( s V (s,t) E ) 1 2 3 iated µ s = mean E p [X s ]=P[X parameters s = 1] correspond for all s V, and to particular µ st = E p [X s X t ]=P[(X s,x t ) = (1,1)] for all (s,t) E. 4 5 6 For fully disconnected graph F, 7 8 9 M F (G) :={τ R V + E 0 τ s 1, s V,τ st = τ s τ t, (s, t) E} The dual decomposes into sum, one for each node A F (τ) = s V [τ s log τ s +(1 τ s ) log(1 τ s )] 6

Example: Naïve Mean Field for Ising Model Mean field problem A(θ) max (τ 1,...,τ m ) [0,1] m θ s τ s + s V The same objective function as in free energy based approach (s,t) E θ st τ s τ t A F (τ) The naïve mean field update equations τ s σ θ s + θ s τ t t N(s) Also yields lower bound on log partition function 7

Geometry of Mean Field Mean field optimization is always non-convex for any exponential family in which the state space is finite Recall the marginal polytope is a convex hull M(G) = conv{φ(e); e X m } M F (G) contains all the extreme points If it is a strict subset, then it must be non-convex X m φ(e) Example: two-node Ising model M F (G) ={0 τ 1 1, 0 τ 2 1,τ 12 = τ 1 τ 2 } τ 1 = τ 2 It has a parabolic cross section along, hence non-convex M 8

Bethe Approximation and Sum-Product 9

Historical Information Bethe (1935): a physicist who first developed the ideas related to the loopy belief propagation in the Bethe approximation; not fully appreciated outside the physics community until recently Gallager (1963): an electrical engineer who explored the loopy belief propagation in his work on LDPC (Low Density Parity Check) codes Yedidia (2001): a physicist who made an explicit connection from the loopy belief propagation to the Bethe approximation and further developed generalized belief propagation algorithm 10

Error Correcting Codes Graphical model for (7,4) Hamming code Potential functions with hard constraint ψ stu (x s,x t,x u ) := { 1 if xs x t x u =1 0 otherwise. ode shown in Figure 2.9, the parity checks range o Marginal probabilities = A posterior bit probabilities 11

Example of LDPC Decoding ) parity bits 12

Example of LDPC Decoding received: 0 1 2 3 10 11 12 13 13

Sum-Product/Belief Propagation Algorithm Message passing rule: M ts (x s ) κ { } ψ st (x s,x t)ψ t (x t) M ut (x t) x t u N(t)/s re Marginals: 0 again denotes a normalization constant. µ s (x s ) = κψ s (x s ) Mts(x s ) t N(s) Exact for trees, but approximate for loopy graphs (so called loopy belief propagation) Question: How is the algorithm on trees related to variational principle? What is the algorithm doing for graphs with cycles? 14

Tree Graphical Models Discrete variables X s {0, 1,..., m s 1} on a tree T = (V, E) Sufficient statistics: Exponential representation of distribution: where I j (x s ) for s = 1,... n, P j X P s I jk (x s, x t ) for(s, t) E, (j, P k) X s X t p(x; θ) exp X θ s (x s ) + X θ st (x s, x t ) Xs V (s,t) E X this θ s (xrepresentation s ) := P j X s θ s;j Iis j (x useful? s ) How (and similarly is it related for θ st to (xinfe s, x t )) Mean parameters are marginal probabilities: µ s;j = E p [I j (X s )] = P[X s = j] j X s, µ s (x s )= µ s;j I j (x s )=P(X s = x s ) j X s edge ( ) µ st;jk, we = Ehave p [I st;jk (X s,x t )] = P[X s = j,x t = k] (j,k) X s X t. (3.3 µ st (x s,x t )= µ st;jk I jk (x s,x t )=P(X s = x s,x t = x t ) (j,k) X s X t 15

Marginal Polytope for Trees Recall marginal polytope for general graphs M(G) ={µ R d p with marginals µ s;j,µ st;jk } By junction tree theorem (see Prop. 2.1 & Prop. 4.1) M(T )= µ 0 µ s (x s )=1, µ st (x s,x t )=µ s (x s ) x s x t In particular, if µ M(T ), then p µ (x) := µ s (x s ) s V (s,t) E µ st (x s,x t ) µ s (x s )µ t (x t ). has the corresponding 0 := 0 in casesmarginals of zeros in the elements o 16

Decomposition of Entropy for Trees For trees, the entropy decomposes as H(p(x; µ)) = p(x; µ) log p(x; µ) x = µ s (x s ) log µ s (x s ) s V x s H s (µ s ) µ st (x s,x t ) log µ st(x s,x t ) µ (s,t) E x s,x s (x s )µ t (x t ) t I st (µ s t), KL-Divergence = H s (µ s ) I st (µ st ) s V (s,t) E The dual function has an explicit form A (µ) = H(p(x; µ)) 17

Exact Variational Principle for Trees Variational formulation A(θ) = max µ M(T ) θ,µ + H s (µ s ) I st (µ st ) s V (s,t) E P M Assign Lagrange multiplier λ ss for the normalization constraint C ss (µ) :=1 x s µ s (x s )=0 P ; and λ ts (x s ) for each marginalization constraint C ts (x s ; µ) :=µ s (x s ) x t µ st (x s,x t )=0 The Lagrangian has the form X L(µ, λ) =θ,µ + X H s (µ s ) I st (µ st ) + λ ss C ss (µ) s V (s,t) E s V + X ˆ X X λ st (x t )C st (x t ) + λ ts (x s )C ts (x s ) x t x s (s,t) E 18

Lagrangian Derivation Taking the derivatives of the Lagrangian w.r.t. and X X L µ s (x s ) L µ st (x s, x t ) = θ s (x s ) log µ s (x s ) + X t N (s) µ s λ ts (x s ) + C µ st = θ st (x s, x t ) log µ st(x s, x t ) µ s (x s )µ t (x t ) λ ts(x s ) λ st (x t ) + C Setting them to zeros yields µ s (x s ) exp{θ s (x s )} t N (s) exp{λ ts (x s )} M ts (x s ) Y µ s (x s, x t ) exp θ s (x s ) + θ t (x t ) + θ st (x s, x t ) Y exp λ Y us (x s ) exp λ vt (x t ) u N (s)\t v N (t)\s 19

Lagrangian Derivation (continued) Adjusting the Lagrange multipliers or messages to enforce C ts (x s ; µ) :=µ s (x s ) x t µ st (x s,x t )=0 yields M ts (x s ) X x t exp θ t (x t ) + θ st (x s, x t ) Y M ut (x t ) u N (t)\s Conclusion: the message passing updates are a Lagrange method to solve the stationary condition of the variational formulation 20

BP on Arbitrary Graphs Two main difficulties of the variational formulation A(θ) = sup{θ T µ A (µ)} µ M The marginal polytope is hard to characterize, so let s use the treebased outer bound M L(G) = τ 0 τ s (x s )=1, τ st (x s,x t )=τ s (x s ) x s x t These locally consistent vectors τ are called pseudo-marginals. A (µ) Exact entropy lacks explicit form, so let s approximate it by the exact expression for trees A (τ) H Bethe (τ) := H s (τ s ) s V (s,t) E I st (τ st ). 21

Bethe Variational Problem (BVP) Combining these two ingredient leads to the Bethe variational problem (BVP): { max θ, τ + τ L(G) s V H s (τ s ) (s,t) E } I st (τ st ). A simple structured problem (differentiable & constraint set is a simple convex polytope) Loopy BP can be derived as am iterative method for solving a Lagrangian formulation of the BVP (Theorem 4.2); similar proof as for tree graphs 22

Geometry of BP Consider the following assignment of pseudo-marginals Can easily verify τ M(G) τ L(G) However, (need a bit more work) 2 Tree-based outer bound 1 3 For any graph, L(G) M(G) Equality holds if and only if the graph is a tree Question: does solution to the BVP ever fall into the gap? Yes, for any element of outer bound L(G), it is possible to construct a distribution with it as a BP fixed point (Wainwright et. al. 2003) 23

Inexactness of Bethe Entropy Approximation Consider a fully connected graph with µ s (x s )= [ 0.5 0.5 ] for s =1,2,3,4 [ ] 0.5 0 µ st (x s,x t )= (s,t) E. 0 0.5 1 4 2 3 τ M(G) It is globally valid: ; realized by the distribution that places mass 1/2 on each of configuration (0,0,0,0) and (1,1,1,1) H Bethe (µ) = 4log2 6log2 = 2log2 < 0, A (µ) = log 2 > 0. 24

Discussions This connection provides a principled basis for applying the sum-product algorithm for loopy graphs However, Although there is always a fixed point of loopy BP, there is no guarantees on the convergence of the algorithm on loopy graphs The Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimum Generally, no guarantees that is a lower bound of Nevertheless, A Bethe (θ) lue ( ). A(θ) mply The connection and understanding suggest a number of avenues for improving upon the ordinary sum-product algorithm, via progressively better approximations to the entropy function and outer bounds on the marginal polytope (Kikuchi clustering) 25

Summary Variational methods in general turn inference into an optimization problem via exponential families and convex duality The exact variational principle is intractable to solve; there are two distinct components for approximations: Either inner or outer bound to the marginal polytope Various approximation to the entropy function Mean field: non-convex inner bound and exact form of entropy BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (Yedidia et. al. 2002) 26

Summary Off-the-Shelf solution to inference problem? Mean field: yields lower bound on the log partition function (likelihood function); widely used as an approximate E-step in EM algorithm Sum-product: works well if the graph is locally tree-like and typically performs better than mean field; successfully used in error-correcting coding and low-level vision community 27