Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Size: px

Start display at page:

Download "Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013"

Patience Watts
5 years ago
Views:

1 School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two families of approximate inference algorithms Loopy belief propagation (sum-product) Mean-field approximation Are there some connections of these two approaches? We will re-exam them from a unified point of view based on the variational principle: Loop BP: outer approximation Mean-field: inner approximation 2 1

2 Variational Methods Variational : fancy name for optimization-based formulations i.e., represent the quantity of interest as the solution to an optimization problem approximate the desired solution by relaxing/approximating the intractable optimization problem Examples: Courant-Fischer for eigenvalues: Linear system of equations: variational formulation: x = arg min x for large system, apply conjugate gradient method λ max (A) = max x 2 =1 xt Ax Ax = b, A 0,x = A 1 b 1 2 xt Ax b T x 3 Inference Problems in Graphical Models Undirected graphical model (MRF): p(x) = 1 ψ C (x C ) Z The quantities of interest: marginal distributions: C C p(x i )= x j,j=i normalization constant (partition function): p(x) Question: how to represent these quantities in a variational form? Z Use tools from (1) exponential families; (2) convex analysis 4 2

3 Exponential Families Canonical parameterization Canonical Parameters Sufficient Statistics Log partition Function Log normalization constant: A(θ) = log exp{θ T φ(x)}dx it is a convex function (Prop 3.1) Effective canonical parameters: Graphical Models as Exponential Families Undirected graphical model (MRF): p(x; θ) = 1 Z(θ) ψ(x C ; θ C ) C C MRF in an exponential form: p(x; θ) =exp log ψ(x C ; θ C ) log Z(θ) C C log ψ(x C ; θ C ) can be written in a linear form after some parameterization 6 3

4 Example: Gaussian MRF Consider a zero-mean multivariate Gaussian distribution that respects the Markov property of a graph Hammersley-Clifford theorem states that the precision matrix Λ=Σ 1 also respects the graph structure Gaussian MRF in the exponential form 1 p(x) =exp Θ, xx T A(Θ), where Θ = Λ 2 Sufficient statistics are {x 2 s,s V ; x s x t, (s, t) E} 7 Example: Discrete MRF ts θ t (x t ) θ st (x s, x t ) θ 8 s (x s ) < 1 if x s = j Indicators: I j (x s ) = : 0 otherwise Parameters: θ s = {θ s;j, j X s } θ st = {θ st;jk, (j, k) X s X t } In exponential form p(x; θ) exp θ s;j I j (x s )+ θ st;jk I j (x s )I k (x t ) s V j (s,t) E 8 4

5 Why Exponential Families? Computing the expectation of sufficient statistics (mean parameters) given the canonical parameters yields the marginals µ s;j = E p [I j (X s )] = P[X s = j] j X s, edge ( ), we have µ st;jk = E p [I st;jk (X s,x t )] = P[X s = j,x t = k] (j,k) X s X t. (3.3 Computing the normalizer yields the log partition function log Z(θ) = A(θ) 9 Computing Mean Parameter: Bernoulli A single Bernoulli random variable X θ p(x; θ) =exp{θx A(θ)},x {0, 1},A(θ) = log(1 + e θ ) Inference = Computing the mean parameter µ(θ) =E θ [X] =1 p(x = 1; θ)+0 p(x = 0; θ) = 1+e θ eθ Want to do it in a variational manner: cast the procedure of computing mean (summation) in an optimization-based formulation 10 5

6 Conjugate Dual Function f(θ) Given any function, its conjugate dual function is: f (µ) =sup{θ, µ f(θ)} θ θ f(x) xy θµ (0, f (y)) µ θ x Conjugate dual is always a convex function: point-wise supremum of a class of linear functions 11 Dual of the Dual is the Original Under some technical condition on (convex and lower semicontinuous), the dual of dual is itself: f =(f ) f(θ) =sup{θ, µ f (µ)} µ f For log partition function A(θ) =sup{θ, µ A (µ)}, µ µ θ Ω The dual variable has a natural interpretation as the mean parameters 12 6

7 ˆ Computing Mean Parameter: Bernoulli The conjugate Stationary condition If µ µ (0, 1), θ(µ) = log 1 µ A (µ) := sup µθ log[1 + exp(θ)]. θ R µ = eθ 1+e θ (µ = A(θ)),A (µ) =µ log(µ)+(1 µ) log(1 µ) If µ [0, 1], A (µ) =+ 8 < We have A µ log µ + (1 µ) log(1 µ) if µ [0, 1] (µ) = : + otherwise.. The variational form: A(θ) = max µ [0,1] µ θ A (µ). The optimum is achieved at µ(θ) =. This is the mean! 1+e θ eθ 13 Remark The last few identities are not coincidental but rely on a deep theory in general exponential family. The dual function is the negative entropy function The mean parameter is restricted Solving the optimization returns the mean parameter and log partition function Next step: develop this framework for general exponential families/graphical models. However, Computing the conjugate dual (entropy) is in general intractable The constrain set of mean parameter is hard to characterize Hence we need approximation 14 7

8 Computation of Conjugate Dual Given an exponential family d { p(x 1,...,x m ; θ) =exp θ i φ i (x) A(θ) The dual function The stationary condition: Derivatives of A yields mean parameters A (θ) =E θ [φ i (X)] = φ i (x)p(x; θ) dx θ i The stationary condition becomes Question: for which µ R d does it have a solution θ(µ)? i=1 A (µ) := sup { µ, θ A(θ)} θ Ω µ A(θ) = 0. µ = E θ [φ(x)] 15 Computation of Conjugate Dual Let s assume there is a solution θ(µ) such that µ = E θ(u) [φ(x)] The dual has the form A (µ) = θ(µ),µ A(θ(µ)) = E θ(µ) [θ(µ), φ(x) A(θ(µ)] = E θ(µ) [log p(x; θ(µ)] The entropy is defined as H(p(x)) = p(x) log p(x) dx So the dual is A (µ) = H(p(x; θ(µ)) when there is a solution θ(µ) 16 8

9 Complexity of Computing Conjugate Dual The dual function is implicitly defined: µ = E θ [φ(x)] Solving the inverse mapping for canonical parameters θ(µ) is nontrivial Evaluating the negative entropy requires high-dimensional integration (summation) Question: for which µ R d does it have a solution θ(µ)? i.e., the domain of A (µ). the ones in marginal polytope! 17 Marginal Polytope φ(x) For any distribution p(x) and a set of sufficient statistics, define a vector of mean parameters µ i = E p [φ i (X)] = φ i (x)p(x) dx p(x) is not necessarily an exponential family The set of all realizable mean parameters M := {µ R d p s.t. E p [φ(x)] = µ}. It is a convex set e that a great deal hinges on the answers to For discrete exponential families, this is called marginal polytope 18 9

10 Convex Polytope Convex hull representation Half-plane representation Minkowski-Weyl Theorem: any non-empty convex polytope can be characterized by a finite collection of linear inequality constraints 19 Example: Two-node Ising Model ( ) Sufficient statistics: Mean parameters: ( φ(x) := x s,s V ; x s x t, (s,t) E ) R V + E. ciated µ s = mean E p [X s ]=P[X parameters s = 1] correspond for all s V, and to particular µ st = E p [X s X t ]=P[(X s,x t ) = (1,1)] for all (s,t) E. Two-node Ising model Convex hull representation { } } conv{(0,0,0),(1,0,0),(0,1,0),(1,1,1)}, X 1 X 2 Half-plane representation µ 1 µ 12 µ 2 µ 12 µ µ 12 µ 1 + µ

Marginal Polytope for General Graphs Still doable for connected binary graphs with 3 nodes: 16 constraints For tree graphical models, the number of half-planes (facet complexity) grows only linearly

11 Marginal Polytope for General Graphs Still doable for connected binary graphs with 3 nodes: 16 constraints For tree graphical models, the number of half-planes (facet complexity) grows only linearly in the graph size General graphs? extremely hard to characterize the marginal polytope 21 Variational Principle (Theorem 3.4) The dual function takes the form A (µ) = { H(pθ(µ) ) if µ M + if µ/ M. θ(µ) satisfies µ = E θ(u) [φ(x)] The log partition function has the variational form A(θ) = sup{θ T µ A (µ)} µ M For all θ Ω, the above optimization problem is attained uniquely at µ(θ) M o that satisfies µ(θ) =E θ [φ(x)] 22 11

12 Example: Two-node Ising Model The distribution Sufficient statistics The marginal polytope is characterized by The dual has an explicit form p(x; θ) exp{θ 1 x 1 + θ 2 x 2 + θ 12 x 12 } X 1 X 2 µ 1 µ 12 µ 2 µ 12 µ µ 12 µ 1 + µ 2 A (µ) =µ 12 log µ 12 +(µ 1 µ 12 ) log(µ 1 µ 12 )+(µ 2 µ 12 ) log(µ 2 µ 12 ) +(1 + µ 12 µ 1 µ 2 ) log(1 + µ 12 µ 1 µ 2 ) The variational problem The optimum is attained at µ 1 (θ) = A(θ) = max {θ 1µ 1 + θ 2 µ 2 + θ 12 µ 12 A (µ)} {µ 1,µ 2,µ 12} M exp{θ 1 } +exp{θ 1 + θ 2 + θ 12 } 1+exp{θ 1 } +exp{θ 2 } +exp{θ 1 + θ 2 + θ 12 } 23 Variational Principle Exact variational formulation A(θ) = sup{θ T µ A (µ)} µ M M: the marginal polytope, difficult to characterize A : the negative entropy function, no explicit form Mean field method: non-convex inner bound and exact form of entropy Bethe approximation and loopy belief propagation: polyhedral outer bound and non-convex Bethe approximation 24 12

13 Mean Field Approximation 25 Tractable Subgraphs Definition: A subgraph F of the graph G is tractable if it is feasible to perform exact inference Example: 26 13

14 Mean Field Methods For an exponential family with sufficient statistics defined on graph G, the set of realizable mean parameter set M(G; φ) :={µ R d p s.t. E p [φ(x)] = µ} For a given tractable subgraph F, a subset of mean parameters of interest M(F ; φ) :={τ R d τ = E θ [φ(x)] for some θ Ω(F )} φ Inner approximation Mean field solves the relaxed problem A F = A MF (G) set M (G). As M(F ; φ) o M(G; φ) o max τ M F (G) {τ,θ A F (τ)} is the exact dual function restricted to M M F (G). involves 27 Example: Naïve Mean Field for Ising Model Ising model in {0,1} representation p(x) exp x s θ s + ( ) x s x t θ st s V (s,t) E ( Mean parameters ) iated µ s = mean E p [X s ]=P[X parameters s = 1] correspond for all s V, and to particular µ st = E p [X s X t ]=P[(X s,x t ) = (1,1)] for all (s,t) E For fully disconnected graph F, M F (G) :={τ R V + E 0 τ s 1, s V,τ st = τ s τ t, (s, t) E} The dual decomposes into sum, one for each node A F (τ) = [τ s log τ s +(1 τ s ) log(1 τ s )] s V 28 14

15 Example: Naïve Mean Field for Ising Model Mean field problem A(θ) max (τ 1,...,τ m) [0,1] m θ s τ s + s V The same objective function as in free energy based approach (s,t) E θ st τ s τ t A F (τ) The naïve mean field update equations τ s σ θ s + θ s τ t t N(s) Also yields lower bound on log partition function 29 Geometry of Mean Field Mean field optimization is always non-convex for any exponential family in which the state space is finite Recall the marginal polytope is a convex hull M(G) = conv{φ(e); e X m } M F (G) contains all the extreme points If it is a strict subset, then it must be non-convex X m φ(e) Example: two-node Ising model M F (G) ={0 τ 1 1, 0 τ 2 1,τ 12 = τ 1 τ 2 } τ 1 = τ 2 It has a parabolic cross section along, hence non-convex M 30 15

16 Bethe Approximation and Sum-Product 31 Sum-Product/Belief Propagation Algorithm Message passing rule: { ψ st (x s,x t)ψ t (x t) M ts (x s ) κ x t u N(t)/s re Marginals: 0 again denotes a normalization constant. µ s (x s ) = κψ s (x s ) Mts(x s ) t N(s) Exact for trees, but approximate for loopy graphs (so called loopy belief propagation) } M ut (x t) Question: How is the algorithm on trees related to variational principle? What is the algorithm doing for graphs with cycles? 32 16

17 Tree Graphical Models Discrete variables X s {0, 1,..., m s 1} on a tree T = (V, E) Sufficient statistics: I j (x s ) for s = 1,... n, P j X P s I jk (x s, x t ) for(s, t) E, (j, P k) X s X t Exponential representation of distribution: p(x; θ) exp X θ s (x s ) + X θ st (x s, x t ) Xs V (s,t) E X where this θ s(xrepresentation s) := P j X s θ s;jiis j(x useful? s) How is it related to infe (and similarly for θ st(x s, x t)) Mean parameters are marginal probabilities: µ s;j = E p [I j (X s )] = P[X s = j] j X s, µ s (x s )= µ s;j I j (x s )=P(X s = x s ) j X s edge ( ) µ st;jk, we = Ehave p [I st;jk (X s,x t )] = P[X s = j,x t = k] (j,k) X s X t. (3.3 µ st (x s,x t )= µ st;jk I jk (x s,x t )=P(X s = x s,x t = x t ) (j,k) X s X t 33 Marginal Polytope for Trees Recall marginal polytope for general graphs M(G) ={µ R d p with marginals µ s;j,µ st;jk } By junction tree theorem (see Prop. 2.1 & Prop. 4.1) M(T )= µ 0 µ s (x s )=1, µ st (x s,x t )=µ s (x s ) x s x t In particular, if µ M(T ), then p µ (x) := µ s (x s ) s V (s,t) E µ st (x s,x t ) µ s (x s )µ t (x t ). has the corresponding 0 := 0 in casesmarginals of zeros in the elements o 34 17

18 Decomposition of Entropy for Trees For trees, the entropy decomposes as H(p(x; µ)) = p(x; µ) log p(x; µ) x = µ s (x s ) log µ s (x s ) s V x s H s(µ s) µ st (x s,x t ) log µ st(x s,x t ) µ (s,t) E x s (x s )µ t (x t ) s,x t I st(µ st), KL-Divergence = H s (µ s ) I st (µ st ) s V (s,t) E The dual function has an explicit form A (µ) = H(p(x; µ)) 35 Exact Variational Principle for Trees Variational formulation A(θ) = max µ M(T ) θ,µ + H s (µ s ) I st (µ st ) s V (s,t) E P M Assign Lagrange multiplier λ ss for the normalization constraint C ss (µ) :=1 x s µ s (x s )=0 P ; and λ ts (x s ) for each marginalization constraint C ts (x s ; µ) :=µ s (x s ) x t µ st (x s,x t )=0 The Lagrangian has the form X L(µ, λ) =θ,µ + X H s (µ s ) I st (µ st ) + λ ss C ss (µ) s V (s,t) E s V + X ˆ X X λ st (x t )C st (x t ) + λ ts (x s )C ts (x s ) x t x s (s,t) E 36 18

19 Lagrangian Derivation Taking the derivatives of the Lagrangian w.r.t. and X L X = θ s (x s ) log µ s (x s ) + X λ ts (x s ) + C µ s (x s ) L µ st (x s, x t ) t N (s) µ s µ st = θ st (x s, x t ) log µ st(x s, x t ) µ s (x s )µ t (x t ) λ ts(x s ) λ st (x t ) + C Setting them to zeros yields µ s (x s ) exp{θ s (x s )} Y exp{λ ts (x s )} t N (s) M ts(x s) µ s (x s, x t ) exp θ s (x s ) + θ t (x t ) + θ st (x s, x t ) Y exp λ Y us (x s ) exp λ vt (x t ) u N (s)\t v N (t)\s 37 Lagrangian Derivation (continued) Adjusting the Lagrange multipliers or messages to enforce C ts (x s ; µ) :=µ s (x s ) x t µ st (x s,x t )=0 yields M ts (x s ) X exp θ Y t (x t ) + θ st (x s, x t ) M ut (x t ) x t u N (t)\s Conclusion: the message passing updates are a Lagrange method to solve the stationary condition of the variational formulation 38 19

20 BP on Arbitrary Graphs Two main difficulties of the variational formulation A(θ) = sup{θ T µ A (µ)} µ M The marginal polytope M is hard to characterize, so let s use the treebased outer bound L(G) = τ 0 τ s (x s )=1, τ st (x s,x t )=τ s (x s ) x s x t These locally consistent vectors τ are called pseudo-marginals. Exact entropy A (µ) lacks explicit form, so let s approximate it by the exact expression for trees A (τ) H Bethe (τ) := H s (τ s ) I st (τ st ). s V (s,t) E 39 Bethe Variational Problem (BVP) Combining these two ingredient leads to the Bethe variational problem (BVP): { max θ, τ + H s (τ s ) τ L(G) s V (s,t) E } I st (τ st ). A simple structured problem (differentiable & constraint set is a simple convex polytope) Loopy BP can be derived as am iterative method for solving a Lagrangian formulation of the BVP (Theorem 4.2); similar proof as for tree graphs 40 20

21 Geometry of BP Consider the following assignment of pseudo-marginals Can easily verify τ M(G) τ L(G) However, (need a bit more work) 2 Tree-based outer bound 1 3 For any graph, M(G) L(G) Equality holds if and only if the graph is a tree Question: does solution to the BVP ever fall into the gap? Yes, for any element of outer bound L(G), it is possible to construct a distribution with it as a BP fixed point (Wainwright et. al. 2003) 41 Inexactness of Bethe Entropy Approximation Consider a fully connected graph with µ s (x s )= [ ] for s =1,2,3,4 [ ] µ st (x s,x t )= (s,t) E It is globally valid: ; realized by the distribution that places mass 1/2 on each of configuration (0,0,0,0) and (1,1,1,1) τ M(G) H Bethe (µ) = 4log2 6log2 = 2log2 < 0, A (µ) = log2 >

22 Remark This connection provides a principled basis for applying the sum-product algorithm for loopy graphs However, Although there is always a fixed point of loopy BP, there is no guarantees on the convergence of the algorithm on loopy graphs The Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimum Generally, no guarantees that A Bethe (θ) is a lower bound of A(θ) lue ( ). mply Nevertheless, The connection and understanding suggest a number of avenues for improving upon the ordinary sum-product algorithm, via progressively better approximations to the entropy function and outer bounds on the marginal polytope (Kikuchi clustering) 43 Summary Variational methods in general turn inference into an optimization problem via exponential families and convex duality The exact variational principle is intractable to solve; there are two distinct components for approximations: Either inner or outer bound to the marginal polytope Various approximation to the entropy function Mean field: non-convex inner bound and exact form of entropy BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (Yedidia et. al. 2002) 44 22

Probabilistic Graphical Models

School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4