Probabilistic Graphical Models

Similar documents
Dual Decomposition for Inference

Inference and Representation

Tightness of LP Relaxations for Almost Balanced Models

Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passing

14 : Theory of Variational Inference: Inner and Outer Approximation

A Combined LP and QP Relaxation for MAP

Junction Tree, BP and Variational Methods

Revisiting the Limits of MAP Inference by MWSS on Perfect Graphs

Variational algorithms for marginal MAP

Probabilistic Graphical Models

Dual Decomposition for Marginal Inference

13 : Variational Inference: Loopy Belief Propagation

Directed and Undirected Graphical Models

Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space

Variational Algorithms for Marginal MAP

MVE165/MMG631 Linear and integer optimization with applications Lecture 8 Discrete optimization: theory and algorithms

Probabilistic Graphical Models

Intelligent Systems:

arxiv: v1 [stat.ml] 26 Feb 2013

CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation Algorithm

Variational Inference (11/04/13)

Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation Algorithm

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Probabilistic Graphical Models

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic and Bayesian Machine Learning

Discrete Inference and Learning Lecture 3

Texas A&M University

Dual Decomposition for Natural Language Processing. Decoding complexity

A new look at reweighted message passing

Graphical Model Inference with Perfect Graphs

Part 7: Structured Prediction and Energy Minimization (2/2)

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs

Lecture 15. Probabilistic Models on Graph

THE topic of this paper is the following problem:

Disconnecting Networks via Node Deletions

Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation

SOLVING INTEGER LINEAR PROGRAMS. 1. Solving the LP relaxation. 2. How to deal with fractional solutions?

On the optimality of tree-reweighted max-product message-passing

Section Notes 8. Integer Programming II. Applied Math 121. Week of April 5, expand your knowledge of big M s and logical constraints.

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Max$Sum(( Exact(Inference(

3 : Representation of Undirected GM

Integer Linear Programs

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Side-chain positioning with integer and linear programming

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

CPSC 540: Machine Learning

Semi-Markov/Graph Cuts

The CPLEX Library: Mixed Integer Programming

Probabilistic Graphical Models

5. Sum-product algorithm

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Joint Optimization of Segmentation and Appearance Models

Semidefinite relaxations for approximate inference on graphs with cycles

Decision Procedures An Algorithmic Point of View

9 Forward-backward algorithm, sum-product on factor graphs

Outline. Outline. Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING. 1. Scheduling CPM/PERT Resource Constrained Project Scheduling Model

Lecture 9: PGM Learning

(P ) Minimize 4x 1 + 6x 2 + 5x 3 s.t. 2x 1 3x 3 3 3x 2 2x 3 6

Undirected Graphical Models

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Column Generation. i = 1,, 255;

Directed and Undirected Graphical Models

3.10 Lagrangian relaxation

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

23. Cutting planes and branch & bound

3.4 Relaxations and bounds

Machine Learning 4771

Bounding the Partition Function using Hölder s Inequality

Probabilistic Graphical Models

Graphs and Network Flows IE411. Lecture 12. Dr. Ted Ralphs

Optimization Bounds from Binary Decision Diagrams

CS Lecture 3. More Bayesian Networks

Introduction To Graphical Models

Unifying Local Consistency and MAX SAT Relaxations for Scalable Inference with Rounding Guarantees

Linear Programming. Scheduling problems

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Introduction to Integer Linear Programming

Resource Constrained Project Scheduling Linear and Integer Programming (1)

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

How to Compute Primal Solution from Dual One in LP Relaxation of MAP Inference in MRF?

Undirected graphical models

Decomposition-based Methods for Large-scale Discrete Optimization p.1

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Convexifying the Bethe Free Energy

Sampling Algorithms for Probabilistic Graphical models

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

An Integrated Approach to Truss Structure Design

Factor Graphs and Message Passing Algorithms Part 1: Introduction

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

Does Better Inference mean Better Learning?

Graphical models and message-passing Part II: Marginals and likelihoods

Relaxed Marginal Inference and its Application to Dependency Parsing

Lagrange Relaxation: Introduction and Applications

Transcription:

Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25

Today s lecture 1 Dual decomposition 2 MAP inference as an integer linear program 3 Linear programming relaxations for MAP inference David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 2 / 25

MAP inference Recall the MAP inference task, arg max p(x), x p(x) = 1 Z φ c (x c ) (we assume any evidence has been subsumed into the potentials, as discussed in the last lecture) Since the normalization term is simply a constant, this is equivalent to arg max φ c (x c ) x c C (called the max-product inference task) Furthermore, since log is monotonic, letting θ c (x c ) = lg φ c (x c ), we have that this is equivalent to arg max θ c (x c ) x (called max-sum) c C c C David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 3 / 25

Exactly solving MAP, beyond trees MAP as a discrete optimization problem is arg max θ i (x i ) + θ c (x c ) x i V c C Very general discrete optimization problem many hard combinatorial optimization problems can be written as this (e.g., 3-SAT) Studied in operations research communities, theoretical computer science, AI (constraint satisfaction, weighted SAT), etc. Very fast moving field, both for theory and heuristics David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 4 / 25

Motivating application: protein side-chain placement Probabilistic inference Find minimum energy conformation of amino acid side-chains along Given a fixed desired carbon 3D backbone: structure, choose amino-acids giving the most stable folding (Yanover, Meltzer, Weiss 06) Side-chain X 1 (corresponding to 1 amino acid) X 3 X 2 θ 13 (x 1, x 3 ) X 3 X 1 θ 12 (x 1, x 2 ) X 2 Potential function for each edge Protein backbone θ 34 (x 3, x 4 ) Joint distribution over the variables is given by Orientations of the side-chains are represented by discretized angles called rotamers Rotamer choices Partition forfunction nearby amino acids are energetically coupled Key (attractive problemsand repulsive forces) Find marginals: Find most likely assignment (MAP): Focus of this talk David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 5 / 25 X 4

Motivating application: dependency parsing tivating Applications 5 Given a sentence, predict the dependency tree that relates the words: Figure 1.1: Example of dependency parsing for a sentence in English. Every Arc fromword head has one word parent, of each i.e. a valid phrase dependency to words parse is that a directed modify tree. itthe red arc demonstrates a non-projective dependency. May be non-projective: each word and its descendents may not be a contiguous subsequence x = {x i } i2v that maximizes m words = m(m 1) Xbinary arc i (x i )+ X selection variables x ij {0, 1} ij (x i,x j ). Let x i = {x ij } j i (all outgoing edges). Predict with: i2v ij2e Without additional restrictions on the choice of potential functions, or which max θ edges to include, the T (x) + θ problem is known ij (x ij ) + θ to be NP-hard. i (x i ) x Using the dual decomposition approach, we will ijbreak the problem i into much simpler subproblems involving maximizations of each single node potential i (x i ) and each edge potential ij (x i,x j ) independently from the other terms. Although David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 6 / 25

Dual decomposition Consider the MAP problem for pairwise Markov random fields: MAP(θ) = max θ i (x i ) + θ ij (x i, x j ). x i V ij E If we push the maximizations inside the sums, the value can only increase: MAP(θ) max θ i (x i ) + max θ ij (x i, x j ) x i x i,x j i V ij E Note that the right-hand side can be easily evaluated In PS3, problem 2, you showed that one can always reparameterize a distribution by operations like θi new (x i ) = θi old (x i ) + f (x i ) θij new (x i, x j ) = θij old (x i, x j ) f (x i ) for any function f (x i ), without changing the distribution/energy David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 7 / 25

Figure 1.2: Illustration of the the dual decomposition objective. Left: The original pairwise model consisting of four factors. Right: The maximization problems corresponding to the objective L( ). Each blue ellipse contains the David factor Sontag to(nyu) be maximized over. InGraphical all figures Models the singleton terms Lecture i(x 6, March i) are7, set 2013 8 / 25 Dual decomposition Introduction to Dual Decomposition for Inference f (x 1,x 2 ) x 1 f1(x 1) x 2 f2(x 2) f(x1,x2) x 1 x 2 g(x1,x3) h(x2,x4) x 3 x 4 k(x3,x4) f1(x 1) f2(x 2) x 1 + x 1 x 2 + x g (x 1,x 3 ) 2 g1(x 1 ) h2(x 2) g3(x 3 ) h (x 2,x 4 ) g1(x 1 ) h2(x 2) x g3(x 3 ) 3 k4(x 4) h4(x 4) + x 3 x 4 + x 4 k3(x 3) h4(x 4) x 3 x 4 k (x 3,x 4 ) k3(x 3) k4(x 4)

Dual decomposition Define: θ i (x i ) = θ i (x i ) + ij E δ j i (x i ) θ ij (x i, x j ) = θ ij (x i, x j ) δ j i (x i ) δ i j (x j ) It is easy to verify that θ i (x i ) + θ ij (x i, x j ) = i ij E i θ i (x i ) + ij E θ ij (x i, x j ) x Thus, we have that: MAP(θ) = MAP( θ) max θi (x i ) + max θij (x i, x j ) x i x i,x j i V ij E Every value of δ gives a different upper bound on the value of the MAP! The tightest upper bound can be obtained by minimizing the r.h.s. with respect to δ! David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 9 / 25

Dual decomposition We obtain the following dual objective: L(δ) = ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E DUAL-LP(θ) = min L(δ) δ This provides an upper bound on the MAP assignment! MAP(θ) DUAL-LP(θ) L(δ) How can find δ which give tight bounds? David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 10 / 25

Solving the dual efficiently Many ways to solve the dual linear program, i.e. minimize with respect to δ: ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E One option is to use the subgradient method Can also solve using block coordinate-descent, which gives algorithms that look very much like max-sum belief propagation: David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 11 / 25

Max-product linear programming (MPLP) algorithm Input: A set of factors θ i (x i ), θ ij (x i, x j ) Output: An assignment x 1,..., x n that approximates the MAP Algorithm: Initialize δ i j (x j ) = 0, δ j i (x i ) = 0, ij E, x i, x j Iterate until small enough change in L(δ): For each edge ij E (sequentially), perform the updates: δ j i (x i ) = 1 2 δ j i (x i ) + 1 [ ] 2 max θ ij (x i, x j ) + δ i x j (x j ) j δ i j (x j ) = 1 2 δ i j (x j ) + 1 [ ] 2 max θ ij (x i, x j ) + δ j x i (x i ) i x i x j where δ j i (x i ) = θ i (x i ) + ik E,k j δ k i(x i ) Return x i arg maxˆxi θδ i (ˆx i ) David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 12 / 25

Generalization to arbitrary factor graphs Introduction to Dual Decomposition for Inference Inputs: A set of factors θ i(x i), θ f (x f ). Output: An assignment x 1,...,x n that approximates the MAP. Algorithm: Initialize δ fi(x i)=0, f F, i f,x i. Iterate until small enough change in L(δ) (seeeq.1.2): For each f F, perform the updates δ fi(x i)= δ f i (x 1 i)+ f max θ f (x f )+ x f\i î f δ f î (xî), (1.16) simultaneously for all i f and x i.wedefineδ f i (x i)=θ i(x i)+ ˆf=f δ ˆfi (x i). Return x i arg maxˆxi θδ i (ˆx i)(seeeq.1.6). Figure 1.4: Description of the MPLP block coordinate descent algorithm for minimizing the dual L(δ) (see Section 1.5.2). Similar algorithms can be devised for different choices of coordinate blocks. See sections 1.5.1 and David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 13 / 25

Figure 1.5: Comparison of three coordinate descent algorithms on a 10 10 David Sontag two dimensional (NYU) Ising grid. TheGraphical dual objective Models L( ) is shown Lecture as a6, function March 7, 2013 of 14 / 25 Experimental results Introduction to Dual Decomposition for Inference Comparison of four coordinate descent algorithms on a 10 10 two dimensional Ising grid: 52 51 50 MSD MSD++ MPLP Star Objective 49 48 47 46 0 10 20 30 40 50 60 Iteration

Experimental results Performance on stereo vision inference task: Solved optimally! Duality gap! Dual obj.! Objective! Decoded assignment! Iteration! David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 15 / 25

Today s lecture 1 Dual decomposition 2 MAP inference as an integer linear program 3 Linear programming relaxations for MAP inference David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 16 / 25

MAP as an integer linear program (ILP) MAP as a discrete optimization problem is arg max θ i (x i ) + θ ij (x i, x j ). x i V ij E To turn this into an integer linear program, we introduce indicator variables 1 µ i (x i ), one for each i V and state x i 2 µ ij (x i, x j ), one for each edge ij E and pair of states x i, x j The objective function is then max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j i V ij E What is the dimension of µ, if binary variables? David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 17 / 25

Visualization of feasible µ vectors µ = 1! 1! 1! 0" 0" 1" 0" 1! 1! 0" Marginal polytope! (Wainwright & Jordan, 03)! µ = 1 µ 2 + µ valid marginal probabilities! X 1! = 1! 1! 1! 1! 1" 0" 0" 0" 1! 1! 0" Assignment for X 1" Assignment for X 2" Assignment for X 3! Edge assignment for" X 1 X 3! Edge assignment for" X 1 X 2! Edge assignment for" X 2 X 3! X 1! = X 2! = 1! X 3 =! X 2! = 1! X 3 =! Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes that have states in {0, 1}. The vertices correspond one-to-one with global assignments to David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 18 / 25

What are the constraints? Force every cluster of variables to choose a local assignment: µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ ij (x i, x j ) {0, 1} ij E, x i, x j x i,x j µ ij (x i, x j ) = 1 ij E Enforce that these local assignments are globally consistent: µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 19 / 25

MAP as an integer linear program (ILP) subject to: MAP(θ) = max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j i V ij E µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Many extremely good off-the-shelf solvers, such as CPLEX and Gurobi David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 20 / 25

Linear programming relaxation for MAP Integer linear program was: MAP(θ) = max θ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) µ x i x i,x j subject to i V ij E µ i (x i ) {0, 1} i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j Relax integrality constraints, allowing the variables to be between 0 and 1: µ i (x i ) [0, 1] i V, x i David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 21 / 25

Linear programming relaxation for MAP Linear programming relaxation is: LP(θ) = max θ µ i (x i )µ i (x i ) + θ ij (x i, x j )µ ij (x i, x j ) i V x i ij E x i,x j µ i (x i ) [0, 1] i V, x i x i µ i (x i ) = 1 i V µ i (x i ) = µ j (x j ) = µ ij (x i, x j ) ij E, x i x i µ ij (x i, x j ) ij E, x j Linear programs can be solved efficiently! Simplex method, interior point, ellipsoid algorithm Since the LP relaxation maximizes over a larger set of solutions, its value can only be higher MAP(θ) LP(θ) LP relaxation is tight for tree-structured MRFs David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 22 / 25

Dual decomposition = LP relaxation Recall we obtained the following dual linear program: L(δ) = ( max θ i (x i ) + ) δ j i (x i ) + ( ) max θ ij (x i, x j ) δ j i (x i ) δ i j (x j ), x i x i,x j i V ij E ij E DUAL-LP(θ) = min L(δ) δ We showed two ways of upper bounding the value of the MAP assignment: MAP(θ) LP(θ) (1) MAP(θ) DUAL-LP(θ) L(δ) (2) Although we derived these linear programs in seemingly very different ways, in turns out that: LP(θ) = DUAL-LP(θ) The dual LP allows us to upper bound the value of the MAP assignment without solving a LP to optimality David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 23 / 25

Linear programming duality (Dual) LP relaxation! Optimal assignment! µ x*! θ (Primal) LP relaxation! Marginal polytope! MAP(θ) LP(θ) = DUAL-LP(θ) L(δ) David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 24 / 25

Other approaches to solve MAP Graph cuts Local search Start from an arbitrary assignment (e.g., random). Iterate: Choose a variable. Change a new state for this variable to maximize the value of the resulting assignment Branch-and-bound Exhaustive search over space of assignments, pruning branches that can be provably shown not to contain a MAP assignment Can use the LP relaxation or its dual to obtain upper bounds Lower bound obtained from value of any assignment found Branch-and-cut (most powerful method; used by CPLEX & Gurobi) Same as branch-and-bound; spend more time getting tighter bounds Adds cutting-planes to cut off fractional solutions of the LP relaxation, making the upper bound tighter David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 25 / 25