Graphical models and message-passing Part II: Marginals and likelihoods

Similar documents
14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Graphical models and message-passing Part I: Basics and MAP computation

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic and Bayesian Machine Learning

Probabilistic Graphical Models

Exact MAP estimates by (hyper)tree agreement

Probabilistic Graphical Models

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Junction Tree, BP and Variational Methods

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

Probabilistic Graphical Models

A new class of upper bounds on the log partition function

Variational Inference (11/04/13)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Inconsistent parameter estimation in Markov random fields: Benefits in the computation-limited setting

13 : Variational Inference: Loopy Belief Propagation

Dual Decomposition for Marginal Inference

Message-passing, convex relaxations, and all that. Graphical models and variational methods: Introduction

17 Variational Inference

Concavity of reweighted Kikuchi approximation

CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

Semidefinite relaxations for approximate inference on graphs with cycles

Lecture 12: Binocular Stereo and Belief Propagation

Probabilistic Graphical Models

Expectation Propagation in Factor Graphs: A Tutorial

High dimensional Ising model selection

Introduction to graphical models: Lecture III

arxiv: v1 [stat.ml] 26 Feb 2013

13 : Variational Inference: Loopy Belief Propagation and Mean Field

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Convergence Analysis of Reweighted Sum-Product Algorithms

On the optimality of tree-reweighted max-product message-passing

On the Convergence of the Convex Relaxation Method and Distributed Optimization of Bethe Free Energy

12 : Variational Inference I

Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs

Concavity of reweighted Kikuchi approximation

Probabilistic Graphical Models

Machine Learning for Data Science (CS4786) Lecture 24

Variational algorithms for marginal MAP

13: Variational inference II

Fractional Belief Propagation

High-dimensional graphical model selection: Practical and information-theoretic limits

Inference and Representation

Dual Decomposition for Inference

Information Geometric view of Belief Propagation

Information Projection Algorithms and Belief Propagation

Variational Algorithms for Marginal MAP

Expectation Propagation Algorithm

3 : Representation of Undirected GM

CS Lecture 19. Exponential Families & Expectation Propagation

Learning and Evaluating Boltzmann Machines

High-dimensional graphical model selection: Practical and information-theoretic limits

CS281A/Stat241A Lecture 19

Does Better Inference mean Better Learning?

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Learning MN Parameters with Approximation. Sargur Srihari

Variational Inference. Sargur Srihari

Linear Response for Approximate Inference

Convergence Analysis of Reweighted Sum-Product Algorithms

CS Lecture 13. More Maximum Likelihood

Bayesian Machine Learning - Lecture 7

Convexifying the Bethe Free Energy

Graphs with Cycles: A Summary of Martin Wainwright s Ph.D. Thesis

Lecture 4 October 18th

Graphical Model Inference with Perfect Graphs

9 Forward-backward algorithm, sum-product on factor graphs

Structured Variational Inference

Probabilistic Graphical Models

Belief Propagation, Information Projections, and Dykstra s Algorithm

Convergent Tree-reweighted Message Passing for Energy Minimization

Chalmers Publication Library

Lagrangian Relaxation for MAP Estimation in Graphical Models

Undirected graphical models

Inference in Bayesian Networks

Linear Response Algorithms for Approximate Inference in Graphical Models

Bounding the Partition Function using Hölder s Inequality

3 Undirected Graphical Models

Inference as Optimization

Machine Learning 4771

Expectation Maximization

Preconditioner Approximations for Probabilistic Graphical Models

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Learning discrete graphical models via generalized inverse covariance matrices

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

LINK scheduling algorithms based on CSMA have received

Sparse Graph Learning via Markov Random Fields

Graphical models. Sunita Sarawagi IIT Bombay

Factor Graphs and Message Passing Algorithms Part 1: Introduction

A brief introduction to Conditional Random Fields

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

The Generalized Distributive Law and Free Energy Minimization

Probabilistic Graphical Models

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

6.867 Machine learning, lecture 23 (Jaakkola)

DUAL OPTIMALITY FRAMEWORKS FOR EXPECTATION PROPAGATION. John MacLaren Walsh

On the Choice of Regions for Generalized Belief Propagation

ABSTRACT. Inference by Reparameterization using Neural Population Codes. Rajkumar Vasudeva Raju

Transcription:

Graphical models and message-passing Part II: Marginals and likelihoods Martin Wainwright UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ wainwrig/kyoto12 September 3, 2012 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 1 / 23

Graphs and factorization 2 ψ ψ 7 47 3 4 7 v 1 5 ψ 456 6 clique C is a fully connected subset of vertices compatibility function ψ C defined on variables x C = {x s,s C} factorization over all cliques p(x 1,...,x N ) = 1 Z ψ C (x C ). C C Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 2 / 23

Core computational challenges Given an undirected graphical model (Markov random field): How to efficiently compute? p(x 1,x 2,...,x N) = 1 Z most probable configuration (MAP estimate): Maximize : ψ C(x C) C C x = arg max p(x1,...,xn) = arg max ψ x XN C(x C). x X N C C the data likelihood or normalization constant Sum/integrate : Z = x X N C C ψ C(x C) marginal distributions at single sites, or subsets: Sum/integrate : p(x s = x s) = 1 Z x t, t s C C ψ C(x C) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 3 / 23

1. Sum-product message-passing on trees Goal: Compute marginal distribution at node u on a tree: x = arg max exp(θ s (x s ) exp(θ st (x s,x t )) x X N. s V (s,t) E M 12 M 32 1 2 3 x 1,x 2,x 3 p(x) = [ exp(θ 1 (x 1 )) { x 2 t 1,3 x t exp[θ t (x t )+θ 2t (x 2,x t )] }]

Putting together the pieces Sum-product is an exact algorithm for any tree. T w w T u u M ut M wt t s M ts M ts message from node t to s N(t) neighbors of node t M vt v T v Update: M ts(x s) { [ ] exp θ st(x s,x t) + θ t(x } t) M vt(x t) x t X t v N(t)\s Sum-marginals: p s(x s;θ) exp{θ s(x s)} t N(s) Mts(xs). Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 5 / 23

Summary: sum-product on trees converges in at most graph diameter # of iterations updating a single message is an O(m 2 ) operation overall algorithm requires O(Nm 2 ) operations upon convergence, yields the exact node and edge marginals: p s (x s ) e θs(xs) M us (x s ) u N(s) p st (x s,x t ) e θs(xs)+θt(xt)+θst(xs,xt) M us (x s ) M ut (x t ) u N(s) u N(t) messages can also be used to compute the partition function Z = e θs(xs) e θst(xs,xt). x 1,...,x N s V (s,t) E Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 6 / 23

2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23

2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. some concerns with sum-product with cycles: no convergence guarantees can have multiple fixed points final estimate of Z is not a lower/upper bound Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23

2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. some concerns with sum-product with cycles: no convergence guarantees can have multiple fixed points final estimate of Z is not a lower/upper bound as before, can consider a broader class of reweighted sum-product algorithms Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23

Tree-reweighted sum-product algorithms Message update from node t to node s: M ts(x s) κ { [ θst(x s,x exp t) ρ st x t X t }{{} reweighted edge ] + θ t(x t) reweighted messages {[ }} ] { ρ Mvt(x vt t) } v N(t)\s [ Mst(x t) ] (1 ρ ts ) }{{} opposite message. Properties: 1. Modified updates remain distributed and purely local over the graph. Messages are reweighted with ρ st [0,1]. 2. Key differences: Potential on edge (s,t) is rescaled by ρ st [0,1]. Update involves the reverse direction edge. 3. The choice ρ st = 1 for all edges (s,t) recovers standard update. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 8 / 23

Bethe entropy approximation define local marginal distributions (e.g., for m = 3 states): µ s(x s) = µs(0) µ s(1) µst(0,0) µst(0,1) µst(0,2) µ st(x s,x t) = µ st(1,0) µ st(1,1) µ st(1,2) µ s(2) µ st(2,0) µ st(2,1) µ st(2,2) define node-based entropy and edge-based mutual information: Node-based entropy:h s (µ s ) = x s µ s (x s )logµ s (x s ) Mutual information:i st (µ st ) = x s,x t µ st (x s,x t )log µ st(x s,x t ) µ s (x s )µ t (x t ). ρ-reweighted Bethe entropy H Bethe (µ) = s V H s (µ s ) (s,t) E ρ st I st (µ st ), Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 9 / 23

Bethe entropy is exact for trees exact for trees, using the factorization: p(x;θ) = s V µ s (x s ) (s,t) E µ st (x s,x t ) µ s (x s )µ t (x t ) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 10 / 23

Reweighted sum-product and Bethe variational principle Define the local constraint set L(G) = { τ s,τ st τ 0, x s τ s (x s ) = 1, x t τ st (x s,x t ) = τ s (x s ) }

Reweighted sum-product and Bethe variational principle Define the local constraint set L(G) = { τ s,τ st τ 0, x s τ s (x s ) = 1, x t τ st (x s,x t ) = τ s (x s ) } Theorem For any choice of positive edge weights ρ st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with { A Bethe (θ;ρ) := max τ s, θ s + } τ st, θ st +H Bethe (τ;ρ). τ L(G) s V (s,t) E

Reweighted sum-product and Bethe variational principle Define the local constraint set L(G) = { τ s,τ st τ 0, x s τ s (x s ) = 1, x t τ st (x s,x t ) = τ s (x s ) } Theorem For any choice of positive edge weights ρ st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with { A Bethe (θ;ρ) := max τ s, θ s + } τ st, θ st +H Bethe (τ;ρ). τ L(G) s V (s,t) E (b) For valid choices of edge weights {ρ st }, the fixed points are unique and moreover logz(θ) A Bethe (θ;ρ). In addition, reweighted sum-product converges with appropriate scheduling.

Lagrangian derivation of ordinary sum-product let s try to solve this problem by a (partial) Lagrangian formulation assign a Lagrange multiplier λ ts(x s) for each constraint C ts(x s) := τ s(x s) x t τ st(x s,x t) = 0 will enforce the normalization ( x s τ s(x s) = 1) and non-negativity constraints explicitly the Lagrangian takes the form: L(τ;λ) = θ, τ + s V H s(τ s) + (s,t) E (s,t) E(G) I st(τ st) [ λ st(x t)c st(x t)+ λ ] ts(x s)c ts(x s) x t x s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 12 / 23

Lagrangian derivation (part II) taking derivatives of the Lagrangian w.r.t τ s and τ st yields L τ s(x s) L τ st(x s,x t) = θ s(x s) logτ s(x s)+ t N(s) λ ts(x s)+c = θ st(x s,x t) log τst(xs,xt) τ s(x s)τ t(x t) λts(xs) λst(xt)+c setting these partial derivatives to zero and simplifying: τ s(x s) exp { θ } s(x s) exp { λ } ts(x s) u N(s)\t t N(s) τ s(x s,x t) exp { θ } s(x s)+θ t(x t)+θ st(x s,x t) exp { λ } us(x s) exp { λ } vt(x t) v N(t)\s enforcing the constraint C ts(x s) = 0 on these representations yields the familiar update rule for the messages M ts(x s) = exp(λ ts(x s)): M ts(x s) exp { θ } t(x t)+θ st(x s,x t) M ut(x t) x t u N(t)\s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 13 / 23

Convex combinations of trees Idea: Upper bound A(θ) := logz(θ) with a convex combination of tree-structured problems. θ = ρ(t 1 )θ(t 1 ) + ρ(t 2 )θ(t 2 ) + ρ(t 3 )θ(t 3 ) A(θ) ρ(t 1 )A(θ(T 1 )) + ρ(t 2 )A(θ(T 2 )) + ρ(t 3 )A(θ(T 3 )) ρ = {ρ(t)} probability distribution over spanning trees θ(t) tree-structured parameter vector Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 14 / 23

Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23

Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Example: On the 2-D lattice: Grid size # trees 9 192 16 100352 36 3.26 10 13 100 5.69 10 42 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23

Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. By a suitable dual reformulation, problem can be avoided: Key duality relation: T { min = max µ, θ +HBethe (µ;ρ st ) }. ρ(t)θ(t)=θρ(t)a(θ(t)) µ L(G) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23

Edge appearance probabilities Experiment: What is the probability ρ e that a given edge e E belongs to a tree T drawn randomly under ρ? f f f f b b b b e e e e (a) Original (b) ρ(t 1 ) = 1 3 (c) ρ(t 2 ) = 1 3 (d) ρ(t 3 ) = 1 3 In this example: ρ b = 1; ρ e = 2 3 ; ρ f = 1 3. The vector ρ e = { ρ e e E } must belong to the spanning tree polytope. (Edmonds, 1971) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 16 / 23

Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = x p(x 1,...,x N )logp(x 1,...,x N ) subject to expectation constraints of the form p(x)φ α (x) = µ α. x Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23

Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = x p(x 1,...,x N )logp(x 1,...,x N ) subject to expectation constraints of the form p(x)φ α (x) = µ α. x Maximum likelihood in exponential family Maximize likelihood of parameterized densities p(x 1,...,x N ;θ) = exp { θ α φ α (x) A(θ) }. α Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23

Conjugate dual functions conjugate duality is a fertile source of variational representations any function f can be used to define another function f as follows: f (v) := sup u R n { v, u f(u) }. easy to show that f is always a convex function how about taking the dual of the dual? I.e., what is (f )? when f is well-behaved (convex and lower semi-continuous), we have (f ) = f, or alternatively stated: { f(u) = sup u, v f (v) } v R n

Geometric view: Supporting hyperplanes Question: Given all hyperplanes in R n R with normal (v, 1), what is the intercept of the one that supports epi(f)? β f(u) Epigraph of f: epi(f) := {(u,β) R n+1 f(u) β}. Analytically, we require the smallest c R such that: c a c b v, u c a v, u c b u (v, 1) v, u c f(u) for all u R n By re-arranging, we find that this optimal c is the dual value: c = sup u R n { v, u f(u) }.

Example: Single Bernoulli Random variable X {0,1} yields exponential family of the form: p(x;θ) exp { θx } with A(θ) = log [ 1+exp(θ) ]. Let s compute the dual A { } (µ) := sup µθ log[1+exp(θ)]. θ R (Possible) stationary point: µ = exp(θ)/[1 + exp(θ)]. A(θ) A(θ) µ, θ A (µ) θ θ µ, θ c (a) Epigraph supported We find that: A (µ) = (b) Epigraph cannot be supported { µlogµ+(1 µ)log(1 µ) if µ [0,1] + otherwise.. Leads to the variational representation: A(θ) = max µ [0,1] { µ θ A (µ) }. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 20 / 23

Geometry of Bethe variational problem µ int M(G) µ frac L(G) belief propagation uses a polyhedral outer approximation to M(G): for any graph, L(G) M(G). equality holds G is a tree. Natural question: Do BP fixed points ever fall outside of the marginal polytope M(G)? Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 21 / 23

Illustration: Globally inconsistent BP fixed points Consider the following assignment of pseudomarginals τ s,τ st : Locally consistent (pseudo)marginals ¼ ¼ ½ ¼ ¼ 2 ¼ ½ ¼ 1 ¼ ½ ¼ ¼ ¼ ¼ ¼ ½ ¼ ½ ¼ 3 ¼ ¼ ½ ¼ ¼ can verify that τ L(G), and that τ is a fixed point of belief propagation (with all constant messages) however, τ is globally inconsistent Note: More generally: for any τ in the interior of L(G), can construct a distribution with τ as a BP fixed point. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 22 / 23

High-level perspective: A broad class of methods message-passing algorithms (e.g., mean field, belief propagation) are solving approximate versions of exact variational principle in exponential families there are two distinct components to approximations: (a) can use either inner or outer bounds to M (b) various approximations to entropy function A (µ) Refining one or both components yields better approximations: BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (e.g.,yedidia et al., 2002) Expectation-propagation: better outer bounds and Bethe-like entropy approximations (Minka, 2002) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 23 / 23