School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4 X 4 1
Recap: Variational Inference Variational formulation A(θ) = sup{θ T µ A (µ)} µ M M : the marginal polytope, difficult to characterize A : the negative entropy function, no explicit form Mean field method: non-convex inner bound and exact form of entropy Bethe approximation and loopy belief propagation: polyhedral outer bound and non-convex Bethe approximation 2
Mean Field Approximation 3
Tractable Subgraphs Definition: A subgraph F of the graph G is tractable if it is feasible to perform exact inference Example: 4
Mean Field Methods For an exponential family with sufficient statistics defined on graph G, the set of realizable mean parameter set M(G; φ) :={µ R d p s.t. E p [φ(x)] = µ} φ For a given tractable subgraph F, a subset of mean parameters of interest M(F ; φ) :={τ R d τ = E θ [φ(x)] for some θ Ω(F )} Inner approximation Mean field solves the relaxed problem A F = A MF (G) set M (G). As M(F ; φ) o M(G; φ) o max τ M F (G) {τ,θ A F (τ)} is the exact dual function restricted to M M F (G). involves 5
Example: Naïve Mean Field for Ising Model Ising model in {0,1} representation p(x) exp x s θ s + ( ) x s x t θ st Mean parameters ( s V (s,t) E ) 1 2 3 iated µ s = mean E p [X s ]=P[X parameters s = 1] correspond for all s V, and to particular µ st = E p [X s X t ]=P[(X s,x t ) = (1,1)] for all (s,t) E. 4 5 6 For fully disconnected graph F, 7 8 9 M F (G) :={τ R V + E 0 τ s 1, s V,τ st = τ s τ t, (s, t) E} The dual decomposes into sum, one for each node A F (τ) = s V [τ s log τ s +(1 τ s ) log(1 τ s )] 6
Example: Naïve Mean Field for Ising Model Mean field problem A(θ) max (τ 1,...,τ m ) [0,1] m θ s τ s + s V The same objective function as in free energy based approach (s,t) E θ st τ s τ t A F (τ) The naïve mean field update equations τ s σ θ s + θ s τ t t N(s) Also yields lower bound on log partition function 7
Geometry of Mean Field Mean field optimization is always non-convex for any exponential family in which the state space is finite Recall the marginal polytope is a convex hull M(G) = conv{φ(e); e X m } M F (G) contains all the extreme points If it is a strict subset, then it must be non-convex X m φ(e) Example: two-node Ising model M F (G) ={0 τ 1 1, 0 τ 2 1,τ 12 = τ 1 τ 2 } τ 1 = τ 2 It has a parabolic cross section along, hence non-convex M 8
Bethe Approximation and Sum-Product 9
Historical Information Bethe (1935): a physicist who first developed the ideas related to the loopy belief propagation in the Bethe approximation; not fully appreciated outside the physics community until recently Gallager (1963): an electrical engineer who explored the loopy belief propagation in his work on LDPC (Low Density Parity Check) codes Yedidia (2001): a physicist who made an explicit connection from the loopy belief propagation to the Bethe approximation and further developed generalized belief propagation algorithm 10
Error Correcting Codes Graphical model for (7,4) Hamming code Potential functions with hard constraint ψ stu (x s,x t,x u ) := { 1 if xs x t x u =1 0 otherwise. ode shown in Figure 2.9, the parity checks range o Marginal probabilities = A posterior bit probabilities 11
Example of LDPC Decoding ) parity bits 12
Example of LDPC Decoding received: 0 1 2 3 10 11 12 13 13
Sum-Product/Belief Propagation Algorithm Message passing rule: M ts (x s ) κ { } ψ st (x s,x t)ψ t (x t) M ut (x t) x t u N(t)/s re Marginals: 0 again denotes a normalization constant. µ s (x s ) = κψ s (x s ) Mts(x s ) t N(s) Exact for trees, but approximate for loopy graphs (so called loopy belief propagation) Question: How is the algorithm on trees related to variational principle? What is the algorithm doing for graphs with cycles? 14
Tree Graphical Models Discrete variables X s {0, 1,..., m s 1} on a tree T = (V, E) Sufficient statistics: Exponential representation of distribution: where I j (x s ) for s = 1,... n, P j X P s I jk (x s, x t ) for(s, t) E, (j, P k) X s X t p(x; θ) exp X θ s (x s ) + X θ st (x s, x t ) Xs V (s,t) E X this θ s (xrepresentation s ) := P j X s θ s;j Iis j (x useful? s ) How (and similarly is it related for θ st to (xinfe s, x t )) Mean parameters are marginal probabilities: µ s;j = E p [I j (X s )] = P[X s = j] j X s, µ s (x s )= µ s;j I j (x s )=P(X s = x s ) j X s edge ( ) µ st;jk, we = Ehave p [I st;jk (X s,x t )] = P[X s = j,x t = k] (j,k) X s X t. (3.3 µ st (x s,x t )= µ st;jk I jk (x s,x t )=P(X s = x s,x t = x t ) (j,k) X s X t 15
Marginal Polytope for Trees Recall marginal polytope for general graphs M(G) ={µ R d p with marginals µ s;j,µ st;jk } By junction tree theorem (see Prop. 2.1 & Prop. 4.1) M(T )= µ 0 µ s (x s )=1, µ st (x s,x t )=µ s (x s ) x s x t In particular, if µ M(T ), then p µ (x) := µ s (x s ) s V (s,t) E µ st (x s,x t ) µ s (x s )µ t (x t ). has the corresponding 0 := 0 in casesmarginals of zeros in the elements o 16
Decomposition of Entropy for Trees For trees, the entropy decomposes as H(p(x; µ)) = p(x; µ) log p(x; µ) x = µ s (x s ) log µ s (x s ) s V x s H s (µ s ) µ st (x s,x t ) log µ st(x s,x t ) µ (s,t) E x s,x s (x s )µ t (x t ) t I st (µ s t), KL-Divergence = H s (µ s ) I st (µ st ) s V (s,t) E The dual function has an explicit form A (µ) = H(p(x; µ)) 17
Exact Variational Principle for Trees Variational formulation A(θ) = max µ M(T ) θ,µ + H s (µ s ) I st (µ st ) s V (s,t) E P M Assign Lagrange multiplier λ ss for the normalization constraint C ss (µ) :=1 x s µ s (x s )=0 P ; and λ ts (x s ) for each marginalization constraint C ts (x s ; µ) :=µ s (x s ) x t µ st (x s,x t )=0 The Lagrangian has the form X L(µ, λ) =θ,µ + X H s (µ s ) I st (µ st ) + λ ss C ss (µ) s V (s,t) E s V + X ˆ X X λ st (x t )C st (x t ) + λ ts (x s )C ts (x s ) x t x s (s,t) E 18
Lagrangian Derivation Taking the derivatives of the Lagrangian w.r.t. and X X L µ s (x s ) L µ st (x s, x t ) = θ s (x s ) log µ s (x s ) + X t N (s) µ s λ ts (x s ) + C µ st = θ st (x s, x t ) log µ st(x s, x t ) µ s (x s )µ t (x t ) λ ts(x s ) λ st (x t ) + C Setting them to zeros yields µ s (x s ) exp{θ s (x s )} t N (s) exp{λ ts (x s )} M ts (x s ) Y µ s (x s, x t ) exp θ s (x s ) + θ t (x t ) + θ st (x s, x t ) Y exp λ Y us (x s ) exp λ vt (x t ) u N (s)\t v N (t)\s 19
Lagrangian Derivation (continued) Adjusting the Lagrange multipliers or messages to enforce C ts (x s ; µ) :=µ s (x s ) x t µ st (x s,x t )=0 yields M ts (x s ) X x t exp θ t (x t ) + θ st (x s, x t ) Y M ut (x t ) u N (t)\s Conclusion: the message passing updates are a Lagrange method to solve the stationary condition of the variational formulation 20
BP on Arbitrary Graphs Two main difficulties of the variational formulation A(θ) = sup{θ T µ A (µ)} µ M The marginal polytope is hard to characterize, so let s use the treebased outer bound M L(G) = τ 0 τ s (x s )=1, τ st (x s,x t )=τ s (x s ) x s x t These locally consistent vectors τ are called pseudo-marginals. A (µ) Exact entropy lacks explicit form, so let s approximate it by the exact expression for trees A (τ) H Bethe (τ) := H s (τ s ) s V (s,t) E I st (τ st ). 21
Bethe Variational Problem (BVP) Combining these two ingredient leads to the Bethe variational problem (BVP): { max θ, τ + τ L(G) s V H s (τ s ) (s,t) E } I st (τ st ). A simple structured problem (differentiable & constraint set is a simple convex polytope) Loopy BP can be derived as am iterative method for solving a Lagrangian formulation of the BVP (Theorem 4.2); similar proof as for tree graphs 22
Geometry of BP Consider the following assignment of pseudo-marginals Can easily verify τ M(G) τ L(G) However, (need a bit more work) 2 Tree-based outer bound 1 3 For any graph, L(G) M(G) Equality holds if and only if the graph is a tree Question: does solution to the BVP ever fall into the gap? Yes, for any element of outer bound L(G), it is possible to construct a distribution with it as a BP fixed point (Wainwright et. al. 2003) 23
Inexactness of Bethe Entropy Approximation Consider a fully connected graph with µ s (x s )= [ 0.5 0.5 ] for s =1,2,3,4 [ ] 0.5 0 µ st (x s,x t )= (s,t) E. 0 0.5 1 4 2 3 τ M(G) It is globally valid: ; realized by the distribution that places mass 1/2 on each of configuration (0,0,0,0) and (1,1,1,1) H Bethe (µ) = 4log2 6log2 = 2log2 < 0, A (µ) = log 2 > 0. 24
Discussions This connection provides a principled basis for applying the sum-product algorithm for loopy graphs However, Although there is always a fixed point of loopy BP, there is no guarantees on the convergence of the algorithm on loopy graphs The Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimum Generally, no guarantees that is a lower bound of Nevertheless, A Bethe (θ) lue ( ). A(θ) mply The connection and understanding suggest a number of avenues for improving upon the ordinary sum-product algorithm, via progressively better approximations to the entropy function and outer bounds on the marginal polytope (Kikuchi clustering) 25
Summary Variational methods in general turn inference into an optimization problem via exponential families and convex duality The exact variational principle is intractable to solve; there are two distinct components for approximations: Either inner or outer bound to the marginal polytope Various approximation to the entropy function Mean field: non-convex inner bound and exact form of entropy BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (Yedidia et. al. 2002) 26
Summary Off-the-Shelf solution to inference problem? Mean field: yields lower bound on the log partition function (likelihood function); widely used as an approximate E-step in EM algorithm Sum-product: works well if the graph is locally tree-like and typically performs better than mean field; successfully used in error-correcting coding and low-level vision community 27