Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25
Today s lecture 1 Dual decomposition 2 MAP inference as an integer linear program 3 Linear programming relaxations for MAP inference David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 2 / 25
MAP inference Recall the MAP inference task, arg max p(x), x p(x) = 1 Z φ c (x c ) (we assume any evidence has been subsumed into the potentials, as discussed in the last lecture) Since the normalization term is simply a constant, this is equivalent to arg max φ c (x c ) x c C (called the max-product inference task) Furthermore, since log is monotonic, letting θ c (x c ) = lg φ c (x c ), we have that this is equivalent to arg max θ c (x c ) x (called max-sum) c C c C David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 3 / 25
Exactly solving MAP, beyond trees MAP as a discrete optimization problem is arg max θ i (x i ) + θ c (x c ) x i V c C Very general discrete optimization problem many hard combinatorial optimization problems can be written as this (e.g., 3-SAT) Studied in operations research communities, theoretical computer science, AI (constraint satisfaction, weighted SAT), etc. Very fast moving field, both for theory and heuristics David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 4 / 25
Motivating application: protein side-chain placement Probabilistic inference Find minimum energy conformation of amino acid side-chains along Given a fixed desired carbon 3D backbone: structure, choose amino-acids giving the most stable folding (Yanover, Meltzer, Weiss 06) Side-chain X 1 (corresponding to 1 amino acid) X 3 X 2 θ 13 (x 1, x 3 ) X 3 X 1 θ 12 (x 1, x 2 ) X 2 Potential function for each edge Protein backbone θ 34 (x 3, x 4 ) Joint distribution over the variables is given by Orientations of the side-chains are represented by discretized angles called rotamers Rotamer choices Partition forfunction nearby amino acids are energetically coupled Key (attractive problemsand repulsive forces) Find marginals: Find most likely assignment (MAP): Focus of this talk David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 5 / 25 X 4
Motivating application: dependency parsing tivating Applications 5 Given a sentence, predict the dependency tree that relates the words: Figure 1.1: Example of dependency parsing for a sentence in English. Every Arc fromword head has one word parent, of each i.e. a valid phrase dependency to words parse is that a directed modify tree. itthe red arc demonstrates a non-projective dependency. May be non-projective: each word and its descendents may not be a contiguous subsequence x = {x i } i2v that maximizes m words = m(m 1) Xbinary arc i (x i )+ X selection variables x ij {0, 1} ij (x i,x j ). Let x i = {x ij } j i (all outgoing edges). Predict with: i2v ij2e Without additional restrictions on the choice of potential functions, or which max θ edges to include, the T (x) + θ problem is known ij (x ij ) + θ to be NP-hard. i (x i ) x Using the dual decomposition approach, we will ijbreak the problem i into much simpler subproblems involving maximizations of each single node potential i (x i ) and each edge potential ij (x i,x j ) independently from the other terms. Although David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 6 / 25
Dual decomposition Consider the MAP problem for pairwise Markov random fields: MAP(θ) = max θ i (x i ) + θ ij (x i, x j ). x i V ij E If we push the maximizations inside the sums, the value can only increase: MAP(θ) max θ i (x i ) + max θ ij (x i, x j ) x i x i,x j i V ij E Note that the right-hand side can be easily evaluated In PS3, problem 2, you showed that one can always reparameterize a distribution by operations like θi new (x i ) = θi old (x i ) + f (x i ) θij new (x i, x j ) = θij old (x i, x j ) f (x i ) for any function f (x i ), without changing the distribution/energy David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 7 / 25
Figure 1.2: Illustration of the the dual decomposition objective. Left: The original pairwise model consisting of four factors. Right: The maximization problems corresponding to the objective L( ). Each blue ellipse contains the David factor Sontag to(nyu) be maximized over. InGraphical all figures Models the singleton terms Lecture i(x 6, March i) are7, set 2013 8 / 25 Dual decomposition Introduction to Dual Decomposition for Inference f (x 1,x 2 ) x 1 f1(x 1) x 2 f2(x 2) f(x1,x2) x 1 x 2 g(x1,x3) h(x2,x4) x 3 x 4 k(x3,x4) f1(x 1) f2(x 2) x 1 + x 1 x 2 + x g (x 1,x 3 ) 2 g1(x 1 ) h2(x 2) g3(x 3 ) h (x 2,x 4 ) g1(x 1 ) h2(x 2) x g3(x 3 ) 3 k4(x 4) h4(x 4) + x 3 x 4 + x 4 k3(x 3) h4(x 4) x 3 x 4 k (x 3,x 4 ) k3(x 3) k4(x 4)
Dual decomposition Define: θ i (x i ) = θ i (x i ) + ij E δ j i (x i ) θ ij (x i, x j ) = θ ij (x i, x j ) δ j i (x i ) δ i j (x j ) It is easy to verify that θ i (x i ) + θ ij (x i, x j ) = i ij E i θ i (x i ) + ij E θ ij (x i, x j ) x Thus, we have that: MAP(θ) = MAP( θ) max θi (x i ) + max θij (x i, x j ) x i x i,x j i V ij E Every value of δ gives a different upper bound on the value of the MAP! The tightest upper bound can be obtained by minimizing the r.h.s. with respect to δ! David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 9 / 25
Dual decomposition We obtain the following dual objective: L(δ) = ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E DUAL-LP(θ) = min L(δ) δ This provides an upper bound on the MAP assignment! MAP(θ) DUAL-LP(θ) L(δ) How can find δ which give tight bounds? David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 10 / 25
Solving the dual efficiently Many ways to solve the dual linear program, i.e. minimize with respect to δ: ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E One option is to use the subgradient method Can also solve using block coordinate-descent, which gives algorithms that look very much like max-sum belief propagation: David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 11 / 25
Max-product linear programming (MPLP) algorithm Input: A set of factors θ i (x i ), θ ij (x i, x j ) Output: An assignment x 1,..., x n that approximates the MAP Algorithm: Initialize δ i j (x j ) = 0, δ j i (x i ) = 0, ij E, x i, x j Iterate until small enough change in L(δ): For each edge ij E (sequentially), perform the updates: δ j i (x i ) = 1 2 δ j i (x i ) + 1 [ ] 2 max θ ij (x i, x j ) + δ i x j (x j ) j δ i j (x j ) = 1 2 δ i j (x j ) + 1 [ ] 2 max θ ij (x i, x j ) + δ j x i (x i ) i x i x j where δ j i (x i ) = θ i (x i ) + ik E,k j δ k i(x i ) Return x i arg maxˆxi θδ i (ˆx i ) David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 12 / 25
Generalization to arbitrary factor graphs Introduction to Dual Decomposition for Inference Inputs: A set of factors θ i(x i), θ f (x f ). Output: An assignment x 1,...,x n that approximates the MAP. Algorithm: Initialize δ fi(x i)=0, f F, i f,x i. Iterate until small enough change in L(δ) (seeeq.1.2): For each f F, perform the updates δ fi(x i)= δ f i (x 1 i)+ f max θ f (x f )+ x f\i î f δ f î (xî), (1.16) simultaneously for all i f and x i.wedefineδ f i (x i)=θ i(x i)+ ˆf=f δ ˆfi (x i). Return x i arg maxˆxi θδ i (ˆx i)(seeeq.1.6). Figure 1.4: Description of the MPLP block coordinate descent algorithm for minimizing the dual L(δ) (see Section 1.5.2). Similar algorithms can be devised for different choices of coordinate blocks. See sections 1.5.1 and David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 13 / 25
Figure 1.5: Comparison of three coordinate descent algorithms on a 10 10 David Sontag two dimensional (NYU) Ising grid. TheGraphical dual objective Models L( ) is shown Lecture as a6, function March 7, 2013 of 14 / 25 Experimental results Introduction to Dual Decomposition for Inference Comparison of four coordinate descent algorithms on a 10 10 two dimensional Ising grid: 52 51 50 MSD MSD++ MPLP Star Objective 49 48 47 46 0 10 20 30 40 50 60 Iteration
Experimental results Performance on stereo vision inference task: Solved optimally! Duality gap! Dual obj.! Objective! Decoded assignment! Iteration! David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 15 / 25
Today s lecture 1 Dual decomposition 2 MAP inference as an integer linear program 3 Linear programming relaxations for MAP inference David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 16 / 25
MAP as an integer linear program (ILP) MAP as a discrete optimization problem is arg max θ i (x i ) + θ ij (x i, x j ). x i V ij E To turn this into an integer linear program, we introduce indicator variables 1 µ i (x i ), one for each i V and state x i 2 µ ij (x i, x j ), one for each edge ij E and pair of states x i, x j The objective function is then max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j i V ij E What is the dimension of µ, if binary variables? David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 17 / 25
Visualization of feasible µ vectors µ = 1! 1! 1! 0" 0" 1" 0" 1! 1! 0" Marginal polytope! (Wainwright & Jordan, 03)! µ = 1 µ 2 + µ valid marginal probabilities! X 1! = 1! 1! 1! 1! 1" 0" 0" 0" 1! 1! 0" Assignment for X 1" Assignment for X 2" Assignment for X 3! Edge assignment for" X 1 X 3! Edge assignment for" X 1 X 2! Edge assignment for" X 2 X 3! X 1! = X 2! = 1! X 3 =! X 2! = 1! X 3 =! Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes that have states in {0, 1}. The vertices correspond one-to-one with global assignments to David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 18 / 25
What are the constraints? Force every cluster of variables to choose a local assignment: µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ ij (x i, x j ) {0, 1} ij E, x i, x j x i,x j µ ij (x i, x j ) = 1 ij E Enforce that these local assignments are globally consistent: µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 19 / 25
MAP as an integer linear program (ILP) subject to: MAP(θ) = max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j i V ij E µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Many extremely good off-the-shelf solvers, such as CPLEX and Gurobi David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 20 / 25
Linear programming relaxation for MAP Integer linear program was: MAP(θ) = max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j subject to i V ij E µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Relax integrality constraints, allowing the variables to be between 0 and 1: µ i (x i ) [0, 1] i V, x i David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 21 / 25
Linear programming relaxation for MAP Linear programming relaxation is: LP(θ) = max θ µ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) i V x i ij E x i,x j µ i (x i ) [0, 1] i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = µ j (x j ) = µ ij (x i, x j ) ij E, x i x i µ ij (x i, x j ) ij E, x j Linear programs can be solved efficiently! Simplex method, interior point, ellipsoid algorithm Since the LP relaxation maximizes over a larger set of solutions, its value can only be higher MAP(θ) LP(θ) LP relaxation is tight for tree-structured MRFs David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 22 / 25
Dual decomposition = LP relaxation Recall we obtained the following dual linear program: L(δ) = ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E DUAL-LP(θ) = min L(δ) δ We showed two ways of upper bounding the value of the MAP assignment: MAP(θ) LP(θ) (1) MAP(θ) DUAL-LP(θ) L(δ) (2) Although we derived these linear programs in seemingly very different ways, in turns out that: LP(θ) = DUAL-LP(θ) The dual LP allows us to upper bound the value of the MAP assignment without solving a LP to optimality David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 23 / 25
Linear programming duality (Dual) LP relaxation! Optimal assignment! µ x*! θ (Primal) LP relaxation! Marginal polytope! MAP(θ) LP(θ) = DUAL-LP(θ) L(δ) David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 24 / 25
Other approaches to solve MAP Graph cuts Local search Start from an arbitrary assignment (e.g., random). Iterate: Choose a variable. Change a new state for this variable to maximize the value of the resulting assignment Branch-and-bound Exhaustive search over space of assignments, pruning branches that can be provably shown not to contain a MAP assignment Can use the LP relaxation or its dual to obtain upper bounds Lower bound obtained from value of any assignment found Branch-and-cut (most powerful method; used by CPLEX & Gurobi) Same as branch-and-bound; spend more time getting tighter bounds Adds cutting-planes to cut off fractional solutions of the LP relaxation, making the upper bound tighter David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 25 / 25