Probabilistic Graphical Models

Size: px

Start display at page:

Download "Probabilistic Graphical Models"

Gervase Flowers
6 years ago
Views:

1 School of Computer Science Probabilistic Graphical Models Variational Inference III: Variational Principle I Junming Yin Lecture 16, March 19, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4 X 4 1

2 What have we learned so far Free energy based approaches Direct approximation of Gibbs free energy: Bethe free energy and loop BP Restricting the family of approximation distribution: mean field method Convex duality based approaches 1. Exponential Families 2. Indicator Sufficient Statistics Convex Duality Inference Computing Mean Parameters Variational Principle 2

3 Computing Mean Parameter: Bernoulli A single Bernoulli random variable X θ p(x; θ) =exp{θx A(θ)},x {0, 1},A(θ) = log(1 + e θ ) Inference = Computing the mean parameter µ(θ) =E θ [X] =p(x = 1; θ) = eθ 1+e θ Want to do it in a variational manner: cast the procedure of computing mean (summation) in an optimization-based formulation 3

4 Conjugate Dual Function f(θ) Given any function, its conjugate dual function is: f (µ) =sup θ θ f(x) {θ, µ f(θ)} xy θµ θ x µ (0, f (y)) Conjugate dual is always a convex function: pointwise supremum of a class of linear functions 4

5 Dual of the Dual is the Original Under some technical condition on (convex and lower semicontinuous), the dual of dual is itself: f =(f ) f f(θ) =sup µ {θ, µ f (µ)} For log partition function A(θ) =sup{θ, µ A (µ)}, µ µ θ Ω The dual variable has a natural interpretation as mean parameters 5

6 ˆ Computing Mean Parameter: Bernoulli The conjugate Stationary condition If µ (0, 1), θ(µ) = log The variational form: A (µ) := sup µθ log[1 + exp(θ)]. θ R µ = eθ 1+e θ (µ = A(θ)) µ 1 µ,a (µ) =µ log(µ)+(1 µ) log(1 µ) If µ [0, 1], A (µ) =+ 8 < We have A µ log µ + (1 µ) log(1 µ) if µ [0, 1] (µ) = : + otherwise.. A(θ) = max µ [0,1] µ θ A (µ). The optimum is achieved at µ(θ) =. This is the mean! 1+e θ eθ 6

7 Remark The last few identities are not coincidental but rely on a deep theory in general exponential family The dual function is the negative entropy function The mean parameter is restricted Solving the optimization returns the mean parameter Next step: develop this framework for general exponential families/graphical models 7

8 Computation of Conjugate Dual Given an exponential family d { p(x 1,...,x m ; θ) =exp θ i φ i (x) A(θ) The dual function A (µ) := sup θ Ω The stationary condition: Derivatives of A yields mean parameters A (θ) =E θ [φ i (X)] = φ i (x)p(x; θ) dx θ i The stationary condition becomes Question: for which µ R d does it have a solution θ(µ)? i=1 { µ, θ A(θ)} µ A(θ) = 0. µ = E θ [φ(x)] 8

9 Computation of Conjugate Dual Let s assume there is a solution θ(µ) such that µ = E θ(u) [φ(x)] The dual has the form A (µ) = θ(µ),µ A(θ(µ)) = E θ(µ) [θ(µ), φ(x) A(θ(µ)] = E θ(µ) [log p(x; θ(µ)] The entropy is defined as H(p(x)) = p(x) log p(x) dx So the dual is A (µ) = H(p(x; θ(µ)) when there is a solution θ(µ) Question: for which µ R d does it have a solution θ(µ)? 9

10 Marginal Polytope For any distribution p(x) and a set of sufficient statistics, define a vector of mean parameters µ i = E p [φ i (X)] = φ i (x)p(x) dx p(x) is not necessarily an exponential family The set of all realizable mean parameters M := {µ R d p s.t. E p [φ(x)] = µ}. It is a convex set e that a great deal hinges on the answers to For discrete exponential families, this is called marginal polytope 10

11 Convex Polytope Convex hull representation Half-plane representation Minkowski-Weyl Theorem: any non-empty convex polytope can be characterized by a finite collection of linear inequality constraints 11

12 Example: Ising Model ( ) Sufficient statistics: Mean parameters: ( φ(x) := x s,s V ; x s x t, (s,t) E ) R V + E. ciated µ s = mean E p [X s ]=P[X parameters s = 1] correspond for all s V, and to particular µ st = E p [X s X t ]=P[(X s,x t ) = (1,1)] for all (s,t) E. Two-node Ising model Convex hull representation { } } conv{(0,0,0),(1,0,0),(0,1,0),(1,1,1)}, Half-plane representation µ 1 µ 12 µ 2 µ 12 µ µ 12 µ 1 + µ 2 Exercise: three-node Ising model X 1 X 2 12

13 Example: Discrete MRF Sufficient statistics: I j (x s ) for s = 1,... n, j X s I jk (x s, x t ) for(s, t) E, (j, k) X s X t Mean parameters are marginal probabilities: Marginal Polytope µ s;j = E p [I j (X s )] = P[X s = j] j X s, µ st;jk = edge E p [I st;jk ( (X ) s,x, t )] we = have P[X s = j,x t = k] (j,k) X s X t. (3.3 M(G) ={µ R d p with marginals µ s;j,µ st;jk } For tree graphical models, the number of half-planes (facet complexity) grows only linearly in the graph size For general graphs, it is extremely difficult to characterize the marginal polytope 13

14 Variational Principle (Theorem 3.4) The dual function takes the form A (µ) = { H(pθ(µ) ) if µ M + if µ/ M. θ(µ) satisfies µ = E θ(u) [φ(x)] The log partition function has the variational form A(θ) = sup{θ T µ A (µ)} µ M For all θ Ω, the above optimization problem is attained uniquely at µ(θ) M o that satisfies µ(θ) =E θ [φ(x)] 14

15 Example: Two-node Ising Model The distribution p(x; θ) exp{θ 1 x 1 + θ 2 x 2 + θ 12 x 12 } X 1 X 2 The marginal polytope is characterized by µ 1 µ 12 µ 2 µ 12 µ 12 0 The dual has an explicit form 1+µ 12 µ 1 + µ 2 A (µ) =µ 12 log µ 12 +(µ 1 µ 12 ) log(µ 1 µ 12 )+(µ 2 µ 12 ) log(µ 2 µ 12 ) +(1 + µ 12 µ 1 µ 2 ) log(1 + µ 12 µ 1 µ 2 ) The variational problem The optimum is attained at µ 1 (θ) = A(θ) = max {θ 1µ 1 + θ 2 µ 2 + θ 12 µ 12 A (µ)} {µ 1,µ 2,µ 12 } M exp{θ 1 } +exp{θ 1 + θ 2 + θ 12 } 1+exp{θ 1 } +exp{θ 2 } +exp{θ 1 + θ 2 + θ 12 } 15

16 Challenges In general graphical models, the marginal polytope can be very difficult to characterize explicitly The dual function is implicitly defined: Inverse mapping is nontrivial Evaluating the entropy requires high-dimensional integration (summation) 16

17 Variational Inference Variational formulation A(θ) = sup{θ T µ A (µ)} µ M General idea of variational inference for graphical models: Approximate the function to be optimized, i.e., the entropy term (Bethe- Kikuchi, sum-product) Restrict the set over which the optimization takes place to a subset, i.e., the marginal polytope (mean field methods) 17

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two