CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds
General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2
General Optimization subject to: min ff 0() R nn ff 0 is not necessarily convex ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 3
General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp Constraints can be arbitrary functions 4
Lagrangian mm LL, λλ, νν = ff 0 + ii=1 pp λλ ii ff ii + ii=1 νν ii h ii () Incorporate constraints into a new objective function λλ 0 and νν are vectors of Lagrange multipliers The Lagrange multipliers can be thought of as soft constraints 5
Duality Construct a dual function by minimizing the Lagrangian over the primal variables gg λλ, νν = inf LL(, λλ, νν) gg λλ, νν = whenever the Lagrangian is not bounded from below for a fixed λλ and νν 6
The Primal Problem subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp Equivalently, inf sup λλ 0,νν LL(, λλ, νν) 7
The Dual Problem Equivalently, sup λλ 0,νν sup λλ 0,νν inf gg(λλ, νν) LL(, λλ, νν) The dual problem is always concave, even if the primal problem is not convex 8
Primal vs. Dual Why? sup λλ 0,νν inf LL(, λλ, νν) inf gg λλ, νν LL(, λλ, νν) for all sup λλ 0,νν LL(, λλ, νν) LL, λλ, νν ff 0 ( ) for any feasible, λλ 0 is feasible if it satisfies all of the constraints Let be the optimal solution to the primal problem and λλ 0 gg λλ, νν LL, λλ, νν ff 0 9
Duality Under certain conditions, the two optimization problems are equivalent sup λλ 0,νν inf LL(, λλ, νν) = inf This is called strong duality sup λλ 0,νν LL(, λλ, νν) If the inequality is strict, then we say that there is a duality gap Size of gap measured by the difference between the two sides of the inequality 10
Slater s Condition For any optimization problem of the form subject to: min ff 0() R nn ff ii 0, ii = 1,, mm AAAA = bb where ff 0,, ff mm are convex functions, strong duality holds if there exists an such that ff ii < 0, ii = 1,, mm AAAA = bb 11
Some Examples Minimize 2 + yy 2 subject to + yy 2 Maximize log yy log yy zz log zz subject to, yy, zz 0 and + yy + zz = 1 Minimize subject to + yy 1
Approximate Marginal Inference Last week: approximate MAP inference Reparamaterizations Linear programming over the local marginal polytope Approximate marginal inference (e.g., pp(yy ii )) Sampling methods (MCMC, etc.) Variational methods (loopy belief propagation, TRW, etc.) 13
KL Divergence In order to perform approximate marginal inference, we will try to find distributions that approximate the true distribution Ideally, the marginals of the approximating distribution should be easy to compute For this, we need a notion of closeness of distributions 14
KL Divergence pp DD(pp qq = pp log qq Called the Kullback-Leibler divergence DD(pp qq 0 with equality if and only if pp = qq Not symmetric, DD(pp qq DD(qq pp) 15
Jensen's Inequality Let ff() be a convex function and aa ii 0 such that ii aa ii = 1 aa ii ff( ii ) ff ii aa ii ii ii Useful inequality when dealing with convex/concave functions When does equality hold? 16
KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) 17
KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) Called the M-projection Called the I-projection 18
KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) As hard as the original inference problem Potentially easier 19
Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq qq DD(qq pp = qq log pp = qq log qq() qq log pp = HH(qq) qq log pp = HH(qq) + log ZZ qq log ψψ cc cc CC = HH(qq) + log ZZ CC CC qq CC CC log ψψ cc cc 20
Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq qq DD(qq pp = qq log pp = qq log qq() qq log pp = HH qq qq log pp = HH qq + log ZZ qq log ψψ cc cc CC Where have we seen this before? = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc 21
MAP Integer Program max ττ such that ii VV ii ii,jj EE ττ ii ii log φφ ii ii + ττ iiii ii, jj log ψψ iiii ( ii, jj ) ii, jj ττ ii ii = 1 ii jj ττ iiii ( ii, jj ) = ττ ii ( ii ) ττ ii ii {0,1} For all ii VV For all ii, jj EE, ii For all ii VV, ii ττ iiii ii, jj {0,1} For all ii, jj EE, ii, jj 22
Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH qq + CC CC qq CC CC log ψψ cc cc 23
Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc This lower bound holds for any qq 24
Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH qq + CC CC qq CC CC log ψψ cc cc Maximizing this over qq gives equality 25
Variational Inference log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc The right hand side is a concave function of qq Despite that, this optimization problem is hard! (surprised?) Exponentially many distributions, qq We need a more compact way to express them Computing the entropy is non-trivial 26
Variational Inference log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc Two kinds of methods that are used to deal with these difficulties Mean-field methods: assume that the approximating distribution factorizes as qq ii VV qq ii ii Similar idea to naïve Bayes Relaxation based methods: replace hard pieces of the optimization with easier optimization problems Similar to the MAP IP -> MAP LP relaxation 27
Relaxation Approach log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc To handle the representation problem, we can use the same LP relaxation trick that we did before For each ττ in the marginal polytope, we can rewrite the RHS as log ZZ HH ττ + CC CC ττ CC CC log ψψ cc cc 28
Relaxation Approach log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc To handle the representation problem, we can use the same LP relaxation trick that we did before For each ττ in the marginal polytope, we can rewrite the RHS as log ZZ HH(ττ) + CC CC ττ CC CC log ψψ cc cc Maximum entropy over all ττ with these marginals 29
Relaxation Approach max ττ M HH ττ + CC CC ττ CC CC log ψψ cc cc Marginal polytope, MM, is intractable to optimize over Use the local polytope, TT! ττ CC CC = ττ ii ii ffffff aaaaaa CC, ii VV CC ii ii ττ ii ii = 1 ffffff aaaaaa ii VV 30
Relaxation Approach max ττ TT HH(ττ) + CC CC ττ CC CC log ψψ cc cc Even with the polytope relaxation, the optimization problem still remains challenging as computing the entropy remains nontrivial We will need to approximate the entropy as well For which distributions is it easy to compute the entropy? 31
Tree Reparameterization On a tree, the joint distribution factorizes in a special way pp 1,, nn = 1 ZZ ii VV pp ii ( ii ) ii,jj EE pp iiii ( ii, jj ) pp ii ii pp jj ( jj ) pp ii is the marginal distribution of the ii ttt variable and pp iiii is the maxmarginal distribution for the edge ii, jj EE This applies to clique trees as well (i.e., when the factor graph is a tree) 32
Tree Reparameterization On a tree, the joint distribution factorizes in a special way pp 1,, nn = 1 ZZ ii VV pp CC ( CC ) pp ii ( ii ) ii CC pp ii ii CC pp ii is the marginal distribution of the ii ttt variable and pp iiii is the maxmarginal distribution for the edge ii, jj EE This applies to clique trees as well (i.e., when the factor graph is a tree) 33
Entropy of a Tree Given this factorization, we can easily compute the entropy of a tree structured distribution HH TTTTTTTT = pp ii ii log pp ii ( ii ) ii V ii CC CC pp CC CC pp CC ( CC ) log ii CC pp ii ii This only depends on the marginals Use this as an approximation for general distributions! 34
Bethe Free Energy Combining these two approximations gives us the so-called Bethe free energy approximation max ττ TT HH BB ττ + CC CC ττ CC CC log ψψ cc cc where HH BB ττ = ττ ii ii log ττ ii ( ii ) ii V ii CC CC ττ CC CC ττ CC ( CC ) log ii CC ττ ii ii 35
Bethe Free Energy max ττ TT HH BB ττ + CC CC ττ CC CC log ψψ cc cc This is not a concave optimization problem for general graphs It is still difficult to maximize However, fixed points of loopy belief propagation correspond to saddle points of this objective over the local marginal polytope 36