Graphical models and message-passing Part II: Marginals and likelihoods

Graphical models and message-passing Part II: Marginals and likelihoods Martin Wainwright UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ wainwrig/kyoto12 September 3, 2012 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 1 / 23

Graphs and factorization 2 ψ ψ 7 47 3 4 7 v 1 5 ψ 456 6 clique C is a fully connected subset of vertices compatibility function ψ C defined on variables x C = {x s,s C} factorization over all cliques p(x 1,...,x N ) = 1 Z ψ C (x C ). C C Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 2 / 23

Core computational challenges Given an undirected graphical model (Markov random field): How to efficiently compute? p(x 1,x 2,...,x N) = 1 Z most probable configuration (MAP estimate): Maximize : ψ C(x C) C C x = arg max p(x1,...,xn) = arg max ψ x XN C(x C). x X N C C the data likelihood or normalization constant Sum/integrate : Z = x X N C C ψ C(x C) marginal distributions at single sites, or subsets: Sum/integrate : p(x s = x s) = 1 Z x t, t s C C ψ C(x C) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 3 / 23

1. Sum-product message-passing on trees Goal: Compute marginal distribution at node u on a tree: x = arg max exp(θ s (x s ) exp(θ st (x s,x t )) x X N. s V (s,t) E M 12 M 32 1 2 3 x 1,x 2,x 3 p(x) = [ exp(θ 1 (x 1 )) { x 2 t 1,3 x t exp[θ t (x t )+θ 2t (x 2,x t )] }]

Putting together the pieces Sum-product is an exact algorithm for any tree. T w w T u u M ut M wt t s M ts M ts message from node t to s N(t) neighbors of node t M vt v T v Update: M ts(x s) { [ ] exp θ st(x s,x t) + θ t(x } t) M vt(x t) x t X t v N(t)\s Sum-marginals: p s(x s;θ) exp{θ s(x s)} t N(s) Mts(xs). Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 5 / 23

Summary: sum-product on trees converges in at most graph diameter # of iterations updating a single message is an O(m 2 ) operation overall algorithm requires O(Nm 2 ) operations upon convergence, yields the exact node and edge marginals: p s (x s ) e θs(xs) M us (x s ) u N(s) p st (x s,x t ) e θs(xs)+θt(xt)+θst(xs,xt) M us (x s ) M ut (x t ) u N(s) u N(t) messages can also be used to compute the partition function Z = e θs(xs) e θst(xs,xt). x 1,...,x N s V (s,t) E Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 6 / 23

2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. some concerns with sum-product with cycles: no convergence guarantees can have multiple fixed points final estimate of Z is not a lower/upper bound Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23

2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: error-control coding: Gallager, 1963 artificial intelligence: Pearl, 1988 turbo decoding: Berroux et al., 1993 etc.. some concerns with sum-product with cycles: no convergence guarantees can have multiple fixed points final estimate of Z is not a lower/upper bound as before, can consider a broader class of reweighted sum-product algorithms Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23

Tree-reweighted sum-product algorithms Message update from node t to node s: M ts(x s) κ { [ θst(x s,x exp t) ρ st x t X t }{{} reweighted edge ] + θ t(x t) reweighted messages {[ }} ] { ρ Mvt(x vt t) } v N(t)\s [ Mst(x t) ] (1 ρ ts ) }{{} opposite message. Properties: 1. Modified updates remain distributed and purely local over the graph. Messages are reweighted with ρ st [0,1]. 2. Key differences: Potential on edge (s,t) is rescaled by ρ st [0,1]. Update involves the reverse direction edge. 3. The choice ρ st = 1 for all edges (s,t) recovers standard update. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 8 / 23

Bethe entropy approximation define local marginal distributions (e.g., for m = 3 states): µ s(x s) = µs(0) µ s(1) µst(0,0) µst(0,1) µst(0,2) µ st(x s,x t) = µ st(1,0) µ st(1,1) µ st(1,2) µ s(2) µ st(2,0) µ st(2,1) µ st(2,2) define node-based entropy and edge-based mutual information: Node-based entropy:h s (µ s ) = x s µ s (x s )logµ s (x s ) Mutual information:i st (µ st ) = x s,x t µ st (x s,x t )log µ st(x s,x t ) µ s (x s )µ t (x t ). ρ-reweighted Bethe entropy H Bethe (µ) = s V H s (µ s ) (s,t) E ρ st I st (µ st ), Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 9 / 23

Bethe entropy is exact for trees exact for trees, using the factorization: p(x;θ) = s V µ s (x s ) (s,t) E µ st (x s,x t ) µ s (x s )µ t (x t ) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 10 / 23

Reweighted sum-product and Bethe variational principle Define the local constraint set L(G) = { τ s,τ st τ 0, x s τ s (x s ) = 1, x t τ st (x s,x t ) = τ s (x s ) }

Reweighted sum-product and Bethe variational principle Define the local constraint set L(G) = { τ s,τ st τ 0, x s τ s (x s ) = 1, x t τ st (x s,x t ) = τ s (x s ) } Theorem For any choice of positive edge weights ρ st > 0: (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with { A Bethe (θ;ρ) := max τ s, θ s + } τ st, θ st +H Bethe (τ;ρ). τ L(G) s V (s,t) E (b) For valid choices of edge weights {ρ st }, the fixed points are unique and moreover logz(θ) A Bethe (θ;ρ). In addition, reweighted sum-product converges with appropriate scheduling.

Lagrangian derivation of ordinary sum-product let s try to solve this problem by a (partial) Lagrangian formulation assign a Lagrange multiplier λ ts(x s) for each constraint C ts(x s) := τ s(x s) x t τ st(x s,x t) = 0 will enforce the normalization ( x s τ s(x s) = 1) and non-negativity constraints explicitly the Lagrangian takes the form: L(τ;λ) = θ, τ + s V H s(τ s) + (s,t) E (s,t) E(G) I st(τ st) [ λ st(x t)c st(x t)+ λ ] ts(x s)c ts(x s) x t x s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 12 / 23

Lagrangian derivation (part II) taking derivatives of the Lagrangian w.r.t τ s and τ st yields L τ s(x s) L τ st(x s,x t) = θ s(x s) logτ s(x s)+ t N(s) λ ts(x s)+c = θ st(x s,x t) log τst(xs,xt) τ s(x s)τ t(x t) λts(xs) λst(xt)+c setting these partial derivatives to zero and simplifying: τ s(x s) exp { θ } s(x s) exp { λ } ts(x s) u N(s)\t t N(s) τ s(x s,x t) exp { θ } s(x s)+θ t(x t)+θ st(x s,x t) exp { λ } us(x s) exp { λ } vt(x t) v N(t)\s enforcing the constraint C ts(x s) = 0 on these representations yields the familiar update rule for the messages M ts(x s) = exp(λ ts(x s)): M ts(x s) exp { θ } t(x t)+θ st(x s,x t) M ut(x t) x t u N(t)\s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 13 / 23

Convex combinations of trees Idea: Upper bound A(θ) := logz(θ) with a convex combination of tree-structured problems. θ = ρ(t 1 )θ(t 1 ) + ρ(t 2 )θ(t 2 ) + ρ(t 3 )θ(t 3 ) A(θ) ρ(t 1 )A(θ(T 1 )) + ρ(t 2 )A(θ(T 2 )) + ρ(t 3 )A(θ(T 3 )) ρ = {ρ(t)} probability distribution over spanning trees θ(t) tree-structured parameter vector Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 14 / 23

Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Example: On the 2-D lattice: Grid size # trees 9 192 16 100352 36 3.26 10 13 100 5.69 10 42 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23

Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. By a suitable dual reformulation, problem can be avoided: Key duality relation: T { min = max µ, θ +HBethe (µ;ρ st ) }. ρ(t)θ(t)=θρ(t)a(θ(t)) µ L(G) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23

Edge appearance probabilities Experiment: What is the probability ρ e that a given edge e E belongs to a tree T drawn randomly under ρ? f f f f b b b b e e e e (a) Original (b) ρ(t 1 ) = 1 3 (c) ρ(t 2 ) = 1 3 (d) ρ(t 3 ) = 1 3 In this example: ρ b = 1; ρ e = 2 3 ; ρ f = 1 3. The vector ρ e = { ρ e e E } must belong to the spanning tree polytope. (Edmonds, 1971) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 16 / 23

Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = x p(x 1,...,x N )logp(x 1,...,x N ) subject to expectation constraints of the form p(x)φ α (x) = µ α. x Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23

Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation Maximize entropy H(p) = x p(x 1,...,x N )logp(x 1,...,x N ) subject to expectation constraints of the form p(x)φ α (x) = µ α. x Maximum likelihood in exponential family Maximize likelihood of parameterized densities p(x 1,...,x N ;θ) = exp { θ α φ α (x) A(θ) }. α Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23

Conjugate dual functions conjugate duality is a fertile source of variational representations any function f can be used to define another function f as follows: f (v) := sup u R n { v, u f(u) }. easy to show that f is always a convex function how about taking the dual of the dual? I.e., what is (f )? when f is well-behaved (convex and lower semi-continuous), we have (f ) = f, or alternatively stated: { f(u) = sup u, v f (v) } v R n

Geometric view: Supporting hyperplanes Question: Given all hyperplanes in R n R with normal (v, 1), what is the intercept of the one that supports epi(f)? β f(u) Epigraph of f: epi(f) := {(u,β) R n+1 f(u) β}. Analytically, we require the smallest c R such that: c a c b v, u c a v, u c b u (v, 1) v, u c f(u) for all u R n By re-arranging, we find that this optimal c is the dual value: c = sup u R n { v, u f(u) }.

Example: Single Bernoulli Random variable X {0,1} yields exponential family of the form: p(x;θ) exp { θx } with A(θ) = log [ 1+exp(θ) ]. Let s compute the dual A { } (µ) := sup µθ log[1+exp(θ)]. θ R (Possible) stationary point: µ = exp(θ)/[1 + exp(θ)]. A(θ) A(θ) µ, θ A (µ) θ θ µ, θ c (a) Epigraph supported We find that: A (µ) = (b) Epigraph cannot be supported { µlogµ+(1 µ)log(1 µ) if µ [0,1] + otherwise.. Leads to the variational representation: A(θ) = max µ [0,1] { µ θ A (µ) }. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 20 / 23

Geometry of Bethe variational problem µ int M(G) µ frac L(G) belief propagation uses a polyhedral outer approximation to M(G): for any graph, L(G) M(G). equality holds G is a tree. Natural question: Do BP fixed points ever fall outside of the marginal polytope M(G)? Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 21 / 23

Illustration: Globally inconsistent BP fixed points Consider the following assignment of pseudomarginals τ s,τ st : Locally consistent (pseudo)marginals ¼ ¼ ½ ¼ ¼ 2 ¼ ½ ¼ 1 ¼ ½ ¼ ¼ ¼ ¼ ¼ ½ ¼ ½ ¼ 3 ¼ ¼ ½ ¼ ¼ can verify that τ L(G), and that τ is a fixed point of belief propagation (with all constant messages) however, τ is globally inconsistent Note: More generally: for any τ in the interior of L(G), can construct a distribution with τ as a BP fixed point. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 22 / 23

High-level perspective: A broad class of methods message-passing algorithms (e.g., mean field, belief propagation) are solving approximate versions of exact variational principle in exponential families there are two distinct components to approximations: (a) can use either inner or outer bounds to M (b) various approximations to entropy function A (µ) Refining one or both components yields better approximations: BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (e.g.,yedidia et al., 2002) Expectation-propagation: better outer bounds and Bethe-like entropy approximations (Minka, 2002) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 23 / 23