CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Similar documents
Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic Graphical Models

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

14 : Theory of Variational Inference: Inner and Outer Approximation

13: Variational inference II

Constrained Optimization and Lagrangian Duality

Variational algorithms for marginal MAP

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Inferring the origin of an epidemic with a dynamic message-passing algorithm

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 11. Kernel Methods

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Lagrangian Duality and Convex Optimization

Probabilistic and Bayesian Machine Learning

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Charge carrier density in metals and semiconductors

Uncertain Compression & Graph Coloring. Madhu Sudan Harvard

Advanced data analysis

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Lecture #21. c T x Ax b. maximize subject to

Posterior Regularization

Bayesian Learning in Undirected Graphical Models

Variational Inference (11/04/13)

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Probabilistic Graphical Models

Convex Optimization & Lagrange Duality

Support Vector Machines: Maximum Margin Classifiers

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

CS249: ADVANCED DATA MINING

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Junction Tree, BP and Variational Methods

Classical RSA algorithm

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

EE364a Review Session 5

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

12 : Variational Inference I

Convex Optimization and Support Vector Machine

4. Algebra and Duality

13 : Variational Inference: Loopy Belief Propagation

Probabilistic Graphical Models

CS249: ADVANCED DATA MINING

Lagrangian Duality. Richard Lusby. Department of Management Engineering Technical University of Denmark

A notion of Total Dual Integrality for Convex, Semidefinite and Extended Formulations

5. Duality. Lagrangian

Convex Optimization Boyd & Vandenberghe. 5. Duality

EE 227A: Convex Optimization and Applications October 14, 2008

Linear and Combinatorial Optimization

Tutorial on Convex Optimization: Part II

Algorithms for Reasoning with Probabilistic Graphical Models. Class 3: Approximate Inference. International Summer School on Deep Learning July 2017

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

A linear operator of a new class of multivalent harmonic functions

Duality. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization M2

Lecture 9: PGM Learning

Bayesian Machine Learning - Lecture 7

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Lagrange Relaxation and Duality

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Section Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

General Strong Polarization

Example: feasibility. Interpretation as formal proof. Example: linear inequalities and Farkas lemma

Lecture: Duality.

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality.

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lecture: Duality of LP, SOCP and SDP

17 Variational Inference

Estimate by the L 2 Norm of a Parameter Poisson Intensity Discontinuous

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

Solving Dual Problems

Bayesian Learning in Undirected Graphical Models

ICS-E4030 Kernel Methods in Machine Learning

Systems of Linear Equations

Lecture 18: Optimization Programming

16.1 L.P. Duality Applied to the Minimax Theorem

Lecture Support Vector Machine (SVM) Classifiers

Dan Roth 461C, 3401 Walnut

Integer Programming ISE 418. Lecture 8. Dr. Ted Ralphs

EE/AA 578, Univ of Washington, Fall Duality

An introductory example

subject to (x 2)(x 4) u,

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Inference as Optimization

MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 17: Combinatorial Problems as Linear Programs III. Instructor: Shaddin Dughmi

9. Interpretations, Lifting, SOS and Moments

The Derivative. Leibniz notation: Prime notation: Limit Definition of the Derivative: (Used to directly compute derivative)

Random Vectors Part A

Lecture 8. Strong Duality Results. September 22, 2008

Angular Momentum, Electromagnetic Waves

Lecture Note 18: Duality

Lecture 7: Semidefinite programming

Transcription:

CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds

General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2

General Optimization subject to: min ff 0() R nn ff 0 is not necessarily convex ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 3

General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp Constraints can be arbitrary functions 4

Lagrangian mm LL, λλ, νν = ff 0 + ii=1 pp λλ ii ff ii + ii=1 νν ii h ii () Incorporate constraints into a new objective function λλ 0 and νν are vectors of Lagrange multipliers The Lagrange multipliers can be thought of as soft constraints 5

Duality Construct a dual function by minimizing the Lagrangian over the primal variables gg λλ, νν = inf LL(, λλ, νν) gg λλ, νν = whenever the Lagrangian is not bounded from below for a fixed λλ and νν 6

The Primal Problem subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp Equivalently, inf sup λλ 0,νν LL(, λλ, νν) 7

The Dual Problem Equivalently, sup λλ 0,νν sup λλ 0,νν inf gg(λλ, νν) LL(, λλ, νν) The dual problem is always concave, even if the primal problem is not convex 8

Primal vs. Dual Why? sup λλ 0,νν inf LL(, λλ, νν) inf gg λλ, νν LL(, λλ, νν) for all sup λλ 0,νν LL(, λλ, νν) LL, λλ, νν ff 0 ( ) for any feasible, λλ 0 is feasible if it satisfies all of the constraints Let be the optimal solution to the primal problem and λλ 0 gg λλ, νν LL, λλ, νν ff 0 9

Duality Under certain conditions, the two optimization problems are equivalent sup λλ 0,νν inf LL(, λλ, νν) = inf This is called strong duality sup λλ 0,νν LL(, λλ, νν) If the inequality is strict, then we say that there is a duality gap Size of gap measured by the difference between the two sides of the inequality 10

Slater s Condition For any optimization problem of the form subject to: min ff 0() R nn ff ii 0, ii = 1,, mm AAAA = bb where ff 0,, ff mm are convex functions, strong duality holds if there exists an such that ff ii < 0, ii = 1,, mm AAAA = bb 11

Some Examples Minimize 2 + yy 2 subject to + yy 2 Maximize log yy log yy zz log zz subject to, yy, zz 0 and + yy + zz = 1 Minimize subject to + yy 1

Approximate Marginal Inference Last week: approximate MAP inference Reparamaterizations Linear programming over the local marginal polytope Approximate marginal inference (e.g., pp(yy ii )) Sampling methods (MCMC, etc.) Variational methods (loopy belief propagation, TRW, etc.) 13

KL Divergence In order to perform approximate marginal inference, we will try to find distributions that approximate the true distribution Ideally, the marginals of the approximating distribution should be easy to compute For this, we need a notion of closeness of distributions 14

KL Divergence pp DD(pp qq = pp log qq Called the Kullback-Leibler divergence DD(pp qq 0 with equality if and only if pp = qq Not symmetric, DD(pp qq DD(qq pp) 15

Jensen's Inequality Let ff() be a convex function and aa ii 0 such that ii aa ii = 1 aa ii ff( ii ) ff ii aa ii ii ii Useful inequality when dealing with convex/concave functions When does equality hold? 16

KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) 17

KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) Called the M-projection Called the I-projection 18

KL Divergence pp DD(pp qq = pp log qq Suppose that we want to approximate the distribution pp with some other distribution qq in some family of distributions QQ Could minimize KL divergence in one of two ways arg min qq QQ DD(pp qq) arg min qq QQ DD(qq pp) As hard as the original inference problem Potentially easier 19

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq qq DD(qq pp = qq log pp = qq log qq() qq log pp = HH(qq) qq log pp = HH(qq) + log ZZ qq log ψψ cc cc CC = HH(qq) + log ZZ CC CC qq CC CC log ψψ cc cc 20

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq qq DD(qq pp = qq log pp = qq log qq() qq log pp = HH qq qq log pp = HH qq + log ZZ qq log ψψ cc cc CC Where have we seen this before? = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc 21

MAP Integer Program max ττ such that ii VV ii ii,jj EE ττ ii ii log φφ ii ii + ττ iiii ii, jj log ψψ iiii ( ii, jj ) ii, jj ττ ii ii = 1 ii jj ττ iiii ( ii, jj ) = ττ ii ( ii ) ττ ii ii {0,1} For all ii VV For all ii, jj EE, ii For all ii VV, ii ττ iiii ii, jj {0,1} For all ii, jj EE, ii, jj 22

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH qq + CC CC qq CC CC log ψψ cc cc 23

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc This lower bound holds for any qq 24

Variational Inference Let's let pp = 1 ZZ cc ψψ cc cc be the distribution that we want to approximate with distribution qq DD(qq pp = HH qq + log ZZ CC CC qq CC CC log ψψ cc cc Using the observation that the KL divergence is non-negative log ZZ HH qq + CC CC qq CC CC log ψψ cc cc Maximizing this over qq gives equality 25

Variational Inference log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc The right hand side is a concave function of qq Despite that, this optimization problem is hard! (surprised?) Exponentially many distributions, qq We need a more compact way to express them Computing the entropy is non-trivial 26

Variational Inference log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc Two kinds of methods that are used to deal with these difficulties Mean-field methods: assume that the approximating distribution factorizes as qq ii VV qq ii ii Similar idea to naïve Bayes Relaxation based methods: replace hard pieces of the optimization with easier optimization problems Similar to the MAP IP -> MAP LP relaxation 27

Relaxation Approach log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc To handle the representation problem, we can use the same LP relaxation trick that we did before For each ττ in the marginal polytope, we can rewrite the RHS as log ZZ HH ττ + CC CC ττ CC CC log ψψ cc cc 28

Relaxation Approach log ZZ HH(qq) + CC CC qq CC CC log ψψ cc cc To handle the representation problem, we can use the same LP relaxation trick that we did before For each ττ in the marginal polytope, we can rewrite the RHS as log ZZ HH(ττ) + CC CC ττ CC CC log ψψ cc cc Maximum entropy over all ττ with these marginals 29

Relaxation Approach max ττ M HH ττ + CC CC ττ CC CC log ψψ cc cc Marginal polytope, MM, is intractable to optimize over Use the local polytope, TT! ττ CC CC = ττ ii ii ffffff aaaaaa CC, ii VV CC ii ii ττ ii ii = 1 ffffff aaaaaa ii VV 30

Relaxation Approach max ττ TT HH(ττ) + CC CC ττ CC CC log ψψ cc cc Even with the polytope relaxation, the optimization problem still remains challenging as computing the entropy remains nontrivial We will need to approximate the entropy as well For which distributions is it easy to compute the entropy? 31

Tree Reparameterization On a tree, the joint distribution factorizes in a special way pp 1,, nn = 1 ZZ ii VV pp ii ( ii ) ii,jj EE pp iiii ( ii, jj ) pp ii ii pp jj ( jj ) pp ii is the marginal distribution of the ii ttt variable and pp iiii is the maxmarginal distribution for the edge ii, jj EE This applies to clique trees as well (i.e., when the factor graph is a tree) 32

Tree Reparameterization On a tree, the joint distribution factorizes in a special way pp 1,, nn = 1 ZZ ii VV pp CC ( CC ) pp ii ( ii ) ii CC pp ii ii CC pp ii is the marginal distribution of the ii ttt variable and pp iiii is the maxmarginal distribution for the edge ii, jj EE This applies to clique trees as well (i.e., when the factor graph is a tree) 33

Entropy of a Tree Given this factorization, we can easily compute the entropy of a tree structured distribution HH TTTTTTTT = pp ii ii log pp ii ( ii ) ii V ii CC CC pp CC CC pp CC ( CC ) log ii CC pp ii ii This only depends on the marginals Use this as an approximation for general distributions! 34

Bethe Free Energy Combining these two approximations gives us the so-called Bethe free energy approximation max ττ TT HH BB ττ + CC CC ττ CC CC log ψψ cc cc where HH BB ττ = ττ ii ii log ττ ii ( ii ) ii V ii CC CC ττ CC CC ττ CC ( CC ) log ii CC ττ ii ii 35

Bethe Free Energy max ττ TT HH BB ττ + CC CC ττ CC CC log ψψ cc cc This is not a concave optimization problem for general graphs It is still difficult to maximize However, fixed points of loopy belief propagation correspond to saddle points of this objective over the local marginal polytope 36