CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds

General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2

General Optimization subject to: min ff 0() R nn ff 0 is not necessarily convex ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 3

General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp Constraints can be arbitrary functions 4

Lagrangian mm LL, λλ, νν = ff 0 + ii=1 pp λλ ii ff ii + ii=1 νν ii h ii () Incorporate constraints into a new objective function λλ 0 and νν are vectors of Lagrange multipliers The Lagrange multipliers can be thought of as soft constraints 5

Duality Construct a dual function by minimizing the Lagrangian over the primal variables gg λλ, νν = inf LL(, λλ, νν) gg λλ, νν = whenever the Lagrangian is not bounded from below for a fixed λλ and νν 6

The Primal Problem subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp Equivalently, inf sup λλ 0,νν LL(, λλ, νν) 7

The Dual Problem Equivalently, sup λλ 0,νν sup λλ 0,νν inf gg(λλ, νν) LL(, λλ, νν) The dual problem is always concave, even if the primal problem is not convex 8

Primal vs. Dual Why? sup λλ 0,νν inf LL(, λλ, νν) inf gg λλ, νν LL(, λλ, νν) for all sup λλ 0,νν LL(, λλ, νν) LL, λλ, νν ff 0 ( ) for any feasible, λλ 0 is feasible if it satisfies all of the constraints Let be the optimal solution to the primal problem and λλ 0 gg λλ, νν LL, λλ, νν ff 0 9

Duality Under certain conditions, the two optimization problems are equivalent sup λλ 0,νν inf LL(, λλ, νν) = inf This is called strong duality sup λλ 0,νν LL(, λλ, νν) If the inequality is strict, then we say that there is a duality gap Size of gap measured by the difference between the two sides of the inequality 10

Slater s Condition For any optimization problem of the form subject to: min ff 0() R nn ff ii 0, ii = 1,, mm AAAA = bb where ff 0,, ff mm are convex functions, strong duality holds if there exists an such that ff ii < 0, ii = 1,, mm AAAA = bb 11

Some Examples Minimize 2 + yy 2 subject to + yy 2 Maximize log yy log yy zz log zz subject to, yy, zz 0 and + yy + zz = 1 Minimize subject to + yy 1

Approximate Marginal Inference Last week: approximate MAP inference Reparamaterizations Linear programming over the local marginal polytope Approximate marginal inference (e.g., pp(yy ii )) Sampling methods (MCMC, etc.) Variational methods (loopy belief propagation, TRW, etc.) 13

KL Divergence In order to perform approximate marginal inference, we will try to find distributions that approximate the true distribution Ideally, the marginals of the approximating distribution should be easy to compute For this, we need a notion of closeness of distributions 14

KL Divergence pp DD(pp qq = pp log qq Called the Kullback-Leibler divergence DD(pp qq 0 with equality if and only if pp = qq Not symmetric, DD(pp qq DD(qq pp) 15

Jensen's Inequality Let ff() be a convex function and aa ii 0 such that ii aa ii = 1 aa ii ff( ii ) ff ii aa ii ii ii Useful inequality when dealing with convex/concave functions When does equality hold? 16

KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) 17

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq qq DD(qq pp = qq log pp = qq log qq() qq log pp = HH(qq) qq log pp = HH(qq) + log ZZ qq log ψψ cc cc CC = HH(qq) + log ZZ CC CC qq CC CC log ψψ cc cc 20

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq qq DD(qq pp = qq log pp = qq log qq() qq log pp = HH qq qq log pp = HH qq + log ZZ qq log ψψ cc cc CC Where have we seen this before? = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc 21

MAP Integer Program max ττ such that ii VV ii ii,jj EE ττ ii ii log φφ ii ii + ττ iiii ii, jj log ψψ iiii ( ii, jj ) ii, jj ττ ii ii = 1 ii jj ττ iiii ( ii, jj ) = ττ ii ( ii ) ττ ii ii {0,1} For all ii VV For all ii, jj EE, ii For all ii VV, ii ττ iiii ii, jj {0,1} For all ii, jj EE, ii, jj 22

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH qq + CC CC qq CC CC log ψψ cc cc 23

Variational Inference log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc The right hand side is a concave function of qq Despite that, this optimization problem is hard! (surprised?) Exponentially many distributions, qq We need a more compact way to express them Computing the entropy is non-trivial 26

Variational Inference log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc Two kinds of methods that are used to deal with these difficulties Mean-field methods: assume that the approximating distribution factorizes as qq ii VV qq ii ii Similar idea to naïve Bayes Relaxation based methods: replace hard pieces of the optimization with easier optimization problems Similar to the MAP IP -> MAP LP relaxation 27

Relaxation Approach log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc To handle the representation problem, we can use the same LP relaxation trick that we did before For each ττ in the marginal polytope, we can rewrite the RHS as log ZZ HH ττ + CC CC ττ CC CC log ψψ cc cc 28

Relaxation Approach max ττ M HH ττ + CC CC ττ CC CC log ψψ cc cc Marginal polytope, MM, is intractable to optimize over Use the local polytope, TT! ττ CC CC = ττ ii ii ffffff aaaaaa CC, ii VV CC ii ii ττ ii ii = 1 ffffff aaaaaa ii VV 30

Relaxation Approach max ττ TT HH(ττ) + CC CC ττ CC CC log ψψ cc cc Even with the polytope relaxation, the optimization problem still remains challenging as computing the entropy remains nontrivial We will need to approximate the entropy as well For which distributions is it easy to compute the entropy? 31

Tree Reparameterization On a tree, the joint distribution factorizes in a special way pp 1,, nn = 1 ZZ ii VV pp ii ( ii ) ii,jj EE pp iiii ( ii, jj ) pp ii ii pp jj ( jj ) pp ii is the marginal distribution of the ii ttt variable and pp iiii is the maxmarginal distribution for the edge ii, jj EE This applies to clique trees as well (i.e., when the factor graph is a tree) 32

Tree Reparameterization On a tree, the joint distribution factorizes in a special way pp 1,, nn = 1 ZZ ii VV pp CC ( CC ) pp ii ( ii ) ii CC pp ii ii CC pp ii is the marginal distribution of the ii ttt variable and pp iiii is the maxmarginal distribution for the edge ii, jj EE This applies to clique trees as well (i.e., when the factor graph is a tree) 33

Entropy of a Tree Given this factorization, we can easily compute the entropy of a tree structured distribution HH TTTTTTTT = pp ii ii log pp ii ( ii ) ii V ii CC CC pp CC CC pp CC ( CC ) log ii CC pp ii ii This only depends on the marginals Use this as an approximation for general distributions! 34

Bethe Free Energy Combining these two approximations gives us the so-called Bethe free energy approximation max ττ TT HH BB ττ + CC CC ττ CC CC log ψψ cc cc where HH BB ττ = ττ ii ii log ττ ii ( ii ) ii V ii CC CC ττ CC CC ττ CC ( CC ) log ii CC ττ ii ii 35

Bethe Free Energy max ττ TT HH BB ττ + CC CC ττ CC CC log ψψ cc cc This is not a concave optimization problem for general graphs It is still difficult to maximize However, fixed points of loopy belief propagation correspond to saddle points of this objective over the local marginal polytope 36