Dual Decomposition for Inference

Similar documents
Probabilistic Graphical Models

Graphical Models and Independence Models

Max$Sum(( Exact(Inference(

Undirected Graphical Models

Junction Tree, BP and Variational Methods

Dual Decomposition for Marginal Inference

Part 7: Structured Prediction and Energy Minimization (2/2)

Tightness of LP Relaxations for Almost Balanced Models

Undirected graphical models

Intelligent Systems:

14 : Theory of Variational Inference: Inner and Outer Approximation

arxiv: v1 [stat.ml] 26 Feb 2013

Probabilistic and Bayesian Machine Learning

14 : Theory of Variational Inference: Inner and Outer Approximation

Machine Learning 4771

Revisiting the Limits of MAP Inference by MWSS on Perfect Graphs

Dual Decomposition for Natural Language Processing. Decoding complexity

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

A Combined LP and QP Relaxation for MAP

Markov Random Fields

13 : Variational Inference: Loopy Belief Propagation

Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passing

Variational Inference (11/04/13)

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Variational Algorithms for Marginal MAP

Probabilistic Graphical Models

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Review: Directed Models (Bayes Nets)

3 : Representation of Undirected GM

9 Forward-backward algorithm, sum-product on factor graphs

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs

Lecture 9: PGM Learning

CS Lecture 4. Markov Random Fields

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 15. Probabilistic Models on Graph

3.10 Lagrangian relaxation

Probabilistic Graphical Models

Graphical Models for Collaborative Filtering

Does Better Inference mean Better Learning?

CPSC 540: Machine Learning

CPSC 540: Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

17 Variational Inference

Chapter 17: Undirected Graphical Models

Variational algorithms for marginal MAP

Texas A&M University

Primal/Dual Decomposition Methods

Graphical models and message-passing Part II: Marginals and likelihoods

Accelerated dual decomposition for MAP inference

Accelerated dual decomposition for MAP inference

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

CS-E4830 Kernel Methods in Machine Learning

Probabilistic Graphical Models Lecture Notes Fall 2009

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

CSE 254: MAP estimation via agreement on (hyper)trees: Message-passing and linear programming approaches

Graphical Models and Kernel Methods

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

11. Learning graphical models

ECS289: Scalable Machine Learning

A brief introduction to Conditional Random Fields

CS Lecture 13. More Maximum Likelihood

Lecture 8: PGM Inference

Probabilistic Graphical Models (I)

How to Compute Primal Solution from Dual One in LP Relaxation of MAP Inference in MRF?

A new look at reweighted message passing

Joint Optimization of Segmentation and Appearance Models

Machine Learning for Data Science (CS4786) Lecture 24

3 Undirected Graphical Models

Inference and Representation

Graphical Model Inference with Perfect Graphs

Machine Learning Lecture 14

Directed and Undirected Graphical Models

Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation Algorithm

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Inference as Optimization

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space

Probabilistic Graphical Models

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

Tighter Relaxations for MAP-MRF Inference: A Local Primal-Dual Gap based Separation Algorithm

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

3.4 Relaxations and bounds

6.867 Machine learning, lecture 23 (Jaakkola)

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

CSC 412 (Lecture 4): Undirected Graphical Models

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Structured Prediction

Chris Bishop s PRML Ch. 8: Graphical Models

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

MAP Examples. Sargur Srihari

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CPSC 540: Machine Learning

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

5. Sum-product algorithm

Transcription:

Dual Decomposition for Inference Yunshu Liu ASPITRG Research Group 2014-05-06

References: [1]. D. Sontag, A. Globerson and T. Jaakkola, Introduction to Dual Decomposition for Inference, Optimization for Machine Learning, MIT Press, 2011. [2]. A. Globerson and T. Jaakkola, Tutorial: Linear Programming Relaxations for Graphical Models, NIPS, 2011. [3]. C. Yanover, T. Meltzer and Y. Weiss, Linear Programming Relaxations and Belief Propagation : An Empirical Study, JMLR, Volume 7, September 2006, pages 1887-1907. [4]. A. M. Rush and M. Collins, A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing, JAIR, Volume 45 Issue 1, September 2012, pages 305-362. [5]. D. Batra, A.C. Gallagher, D. Parikh and T. Chen, Beyond Trees: MRF Inference via Outer-Planar Decomposition, CVPR, 2010, pages 2496-2503.

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

,x 4 } is not a clique because of the missing link from x 1 to x 4. re The define MAP the factors inference theproblem: decomposition Markov of the joint random distribution fields the variables In the undirected in the cliques. case, theinprobability fact, we can distribution consider factorizes functions ques, according without loss to functions of generality, definedbecause on the clique otherof cliques the graph. must be l cliques. A clique Thus, is aif subset {x 1,x 2 of,xnodes 3 } is ainmaximal a graph such clique that and there we define exist a on over linkthis between clique, allthen pairsincluding of nodes inanother subset. factor defined over a iablesawould maximal be redundant. clique is a clique such that it is not possible to a clique by C and the set of variables in that clique by x C. Then include any other nodes from the graph in the set without it ceasing to be a clique. ted graph showing a clique (outlined in al clique (outlined in blue). x 1 Example of cliques: x 2 {x 1,x 2 },{x 2,x 3 },{x 3,x 4 },{x 2,x 4 },{x 1,x 3 }, {x 1,x 2,x 3 }, {x 2,x 3,x 4 } Maximal cliques: {x 1,x 2,x 3 }, {x 2,x 3,x 4 } x 3 x 4

The MAP inference problem: Markov random fields Markov random fields: Definition Denote C as a clique, x C the set of variables in clique C and ψ C (x C ) a nonnegative potential function associated with clique C. Then a Markov random field is a collection of distributions that factorize as a product of potential functions ψ C (x C ) over the maximal cliques of the graph: p(x) = 1 ψ Z C (x C ) where normalization constant Z = x called the partition function. C C ψ C(x C ) sometimes

The MAP inference problem: Markov random fields Factorization of undirected graphs Question: how to write the joint distribution for this undirected graph? x 1 x 3 x 2 (2 3 1) hold Answer: p(x) = 1 Z ψ 12(x 1, x 2 )ψ 13 (x 1, x 3 ) where ψ 12 (x 1, x 2 ) and ψ 13 (x 1, x 3 ) are the potential functions and Z is the partition function that make sure p(x) satisfy the conditions to be a probability distribution.

The MAP inference problem MAP: find the maximum probability assignment arg max p(x), where p(x) = 1 x Z ψ C (x C ) Converting to a max-product problem(since Z is constant) arg max x ψ C (x C ) Converting to a max-sum inference task by taking the log and letting θ C (x C ) = log ψ C (x C ) arg max x C θ C (x C ) C C

The MAP inference problem MAP: find the maximum probability assignment Consider a set of n discrete variables x 1,, x n, and a set F of subsets on these variables(i.e., f F is a subset of V = {1,, n}), where each subset corresponds to the domain of one of the factors. The MAP inference task is to find an assignment x = (x 1,, x n ) which maximizes the sum of the factors: MAP(θ) = max ( θ i (x i ) + θ f (x f )) x i V f F Note we also included θ i (x i ) on each of the individual variables, which is useful in some applications.

The MAP inference problem An example of MAP inference Consider three binary variables x 1, x 2 and x 3 and two factors {1, 2} and {1, 3}, for which we know the values of the factors: 00 01 10 11 12 (x 12 ): 23 (x 23 ): 2 100 3 0.1 10 5 200 0 The goal is MAP(θ) = max x (θ 12 (x 12 ) + θ 23 (x 23 ))

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

Motivating Applications Protein side-chain placement (see reference [3] for detail) Amino acid Side-chain x 1 12 (x 1,x 2 ) x 2 x 5 Protein backbone (main chain) 45 (x 4,x 5 ) x 4 x 3 34 (x 3,x 4 ) For fixed protein backbone, find the maximum probability assignment of amino acid side-chains. Orientations of the side-chains are represented by discretized angles called rotamers, rotamer choices for nearby amino acids are energetically coupled(attractive or repulsive forces).

Motivating Applications Dependency parsing (see reference [4] for detail) Given a sentence, predict the dependency tree to relates the words in it: root He watched a basketball game last sunday An arc is drawn from the head word of each phrase to words that modify it. We use θ ij (x ij ) to denote the weight on the arcs between words i and j, use θ i (x i ) for higher order interactions between the arc selections for a given word i, where x ij {0, 1} indicate the existence of arc from i to j and x i = {x ij } j i is the coupled arc selection. The task is to maximizing a function involving ij θ ij(x ij ) + i θ i (x i ).

Motivating Applications Other applications Computational biology -e.g. protein design etc. (see reference [3] for details) Natural language processing -e.g. combined weighted context-free grammar and finite-state tagger etc. (see reference [4] for details) Computer vision - e.g. image segmentation, object labeling etc. (see reference [5] for details) Graphical Model -e.g. Structure learning of bayesian networks and Markov network.

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

Dual decomposition and Lagrangian relaxation: Approximate inference for MAP problem MAP: find the maximum probability assignment The MAP inference task is to find an assignment x = (x 1,, x n ) which maximizes the sum of the factors: MAP(θ) = max ( θ i (x i ) + θ f (x f )) x i V f F MAP inference is hard, even for pairwise MRF the problem is NP-hard. Dynamic programming provide exact inference for some special cases(e.g. tree-structured graph). In general, approximate inference algorithms are used by relax some constraints so that we are able to divide the problem into easy-solving subproblems.

Dual decomposition and Lagrangian relaxation Introduction to Dual Decomposition for Inference For the MAP inference problem: X X MAP(θ) = max( θi (xi ) + θf (xf )) x f F i V anydecomposition given factor f, we make copy 1.3forDual and Lagrangian Relaxation (x, x ) of each variables xi in fthe1 2 x1 factor and denote as xif. h(x2, x4) g (x1, x3 ) x3 x4 k (x3, x4) x1 x2 =1 x g (x1, x3 ) xg3 = xg3 3 (x3 ) x4 = g1 (x1 ) g (xg1, xg3 ) xk3 f 1 (x1 ) xh2 h(xh2, xh4 ) + g1 (x1 ) x1 xh4 = x2 = x1 xg1 = xf2 = f (x1, x2) = f (xf1, xf2 ) xf1 xk4 k (xk3, xk4 ) x3 g3 (x3 ) x3 (x ) k3 decomposi 3 dual + Figure 1.3: Illustration of the the derivation of the (Note: the figures are adapted from reference [1]) objective by creating copies of variables. The graph in Fig. 1.2 is shown

Dual decomposition and Lagrangian relaxation MAP: find the maximum probability assignment The original MAP inference problem is equivalent to: max i V θ i (x i ) + f F θ f (x f f ) s.t. x f i = x i f, i f since we add the constraints x f i = x i for f, i f. Lagrangian Introduce Lagrange multipliers δ = {δ fi (x i ) : f F, i F, x i }, then define the Lagrangian: L(δ, x, x F ) = θ i (x i ) + θ f (x f f ) i V f F + δ fi (ˆx i )(1[x i = ˆx i ] 1[xi f = ˆx i ]) f F i f ˆx i

Dual decomposition and Lagrangian relaxation Lagrangian Then the original MAP inference problem is equivalent to: max x,x F L(δ, x, x F ) s.t. x f i = x i f, i f since if the constraints xi f L(δ, x, x F ) is zero. = x i for f, i f hold, the last term in Meaning of Lagrange multipliers δ fi (x i ) Consider δ fi (x i ) as a message that factor f sends to variable i about its state x i, then iteratively updating δ fi (x i ) is similar to message passing between factor and variables.

Dual decomposition and Lagrangian relaxation Lagrangian relaxation Now we discard the condition xi f = x i f, i f and consider the relaxed dual problem: L(δ) = max x,x F L(δ, x, x F ) = i V max xi (θ i (x i ) + f :i f δ fi (x i )) + f F max x f f (θ f (x f f ) i f δ fi (x f i )) Notice that if xi f = x i then the above problem is no longer equivalent to the original one since we changed position of max and sum. Each of the subproblem only involved maximization over small set of variables x i, we use δ fi (x i ) to help achieve the consensus among the subproblems.

Dual decomposition and Lagrangian relaxation The dual problem Without the assumption xi f = x i f, i f, L(δ) is maximized over a larger space, thus MAP(θ) L(δ), the goal is then to find the tightest upper bound by solving min δ L(δ), where L(δ) = max x,x F L(δ, x, x F ) = i V max xi (θ i (x i ) + f :i f δ fi (x i )) + f F max xf (θ f (x f ) i f δ fi (x i )) Note that for the optimization problem, the maximizations are independent, thus in the updating process we don t distinguish between x i and x f i.

Dual decomposition and Lagrangian relaxation Reparameterization Interpretation of the Dual Given a set of dual variables δ, define θ i δ(x i) = θ i (x i ) + f :i f δ fi(x i ) and θ f δ(x f ) = θ f (x f ) i f δ fi(x i ). For all assignments x we have: θ i (x i ) + θ f (x f ) = θ i δ (x i) + θ f δ (x f ) i V f F i V f F We call θ the reparameterization of the original parameters θ. Observe L(δ) = i V max x i θ δ i (x i) + f F max x f θ δ f (x f ) Thus the dual problem can be considered as searching for the best reparameterization θ to minimize L(δ).

Dual decomposition and Lagrangian relaxation Reparameterization Interpretation of the Dual Introduction to Dual Decomposition for Inference f (x 1,x 2 ) x 1 f1(x 1) x 2 f2(x 2) f(x1,x2) x1 x2 g(x1,x3) h(x2,x4) x 3 x4 k(x3,x4) f1(x 1) f2(x 2) x 1 + x 1 x 2 + x g (x 1,x 3 ) 2 g1(x 1 ) h2(x 2) g3(x 3 ) h (x 2,x 4 ) g1(x 1 ) h2(x 2) x g3(x 3 ) 3 k4(x 4) h4(x 4) + x 3 x 4 + x 4 k3(x 3) h4(x 4) x 3 x 4 k (x 3,x 4 ) k3(x 3) k4(x 4) Figure 1.2: Illustration of the the dual decomposition objective. Left: The original pairwise model consisting of four factors. Right: The maximization problems corresponding to the objective L( ). Each blue ellipse contains the (Note: the figures are adapted from reference [1])

Dual decomposition and Lagrangian relaxation Existence of duality gap The difference between the primal optimal value and the dual optimal value is called duality gap. For our problem MAP(θ) = max ( θ i (x i ) + x i V f F θ f (x f )) min L(δ) δ we don t necessarily have strong duality(i.e., equality hold in the above equation, thus no duality gap). Strong duality hold only when the subproblems agreeing on a maximizing assignment, that is δ, x such that xi arg max xi θδ i (x i ) and x f arg max xf θδ f (x f ), which is not guaranteed. However, in real-world problems, exact solution are frequently found using dual decomposition.

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

Subgradient algorithms Gradient descent/steepest descent If f (x) is differentiable in a neighborhood of a point a, then F(x) decreases fastest if one goes from a in the direction of the negative gradient of f at a, f (a), which give us the gradient descent algorithm: x t+1 = x t α k f (x t ) where α t is the step size, recall different ways to choose step size: fixed, decreasing, adapted. (Note: the figure is adapted from Wikipedia)

Subgradient algorithms Limitations of gradient descent For ill conditioned convex problems, gradient descent perform zigzags The function should be differentiable Our problem of minimizing L(δ) L(δ) = i V max x i θ δ i (x i) + f F max x f θ δ f (x f ) L(δ) is continuous and convex, but not differentiable at all points δ where θ δ i (x i) or θ δ f (x f ) have multiple optima.

Subgradient algorithms Subgradient algorithms A subgradient of a convex function L(δ) at δ is a vector g δ such that for all δ, L(δ ) L(δ) + g δ (δ δ) Suppose that the dual variables at iteration t are given by δ t, then the updating rule is: δ t+1 fi (x i ) = δ t fi (x i) α t g t fi (x i) where g t is the subgradient of L(δ) at δ t.

arising from computer vision. Subgradient algorithms 1.4.1 Calculating the Subgradient of L( ) Updating subgradient g δ Let In xi s this be a section maximizing we show assignment how to calculate of θ i δt (x the i ), and subgradient let x f f be a of L( ), co ing the description of the subgradient algorithm. Given the curren maximizing assignment of θ i δt (x f ), then the subgradient is variables updated by the t, we first choose a maximizing assignment for each subpr following algorithm: Let x (recall L(δ, s x, x F ) = i V θ i(x i ) + t i be a maximizing assignment of f F θ f (xi f f ) (x + i ), and let x f f be a maxi assignment of f F i f ˆx i δ fi (ˆx t f i )(1[x (x f ). The subgradient i = ˆx i ] 1[xi f of L( ) at = ˆx i ])) t is then given following pseudocode: gfi t (x i)=0, 8f,i 2 f,x i For f 2 F and i 2 f: If x f i 6= xs i : gfi t (xs i ) = +1 gfi t (xf i ) = 1. Thus, each time that x f i 6= x s i, the subgradient update has the e (Note: the algorithm is adapted decreasing the value of from reference [1]) t t ) and increasing the value of ). Sim i (x s i i (x f i

Subgradient algorithms Updating subgradient g δ Recall δ t+1 fi (x i ) = δfi t (x i) α t gfi t (x i) θ i δ (x i) = θ i (x i ) + δ fi (x i ) f :i f θ δ f (x f ) = θ f (x f ) i f δ fi (x i ) In the updating process, when ever xi f δt decreasing and θ i (xi f xi s, θ i δt (xi s ) will be δt ) will be increasing; similarly, θ i (xi f, x f \i) will be decreasing and θ δt i (x s i, x f \i ) will be increasing.

Subgradient algorithms Notes on subgradient algorithm If at any iteration k we find that g t = 0, then xi f = xi s for all f F, i f, there is no duality gap and we exactly solved the problem. The subgradient algorithm will solve the dual to optimality whenever the step-sizes are chosen such that lim t α t = 0 and t=0 α t =, e.g. α t = 1 t.

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Subgradient algorithms Block coordinate descent algorithms Discussion

Block coordinate descent algorithms Coordinate descent Fixed all other coordinates, do a line search along the chosen coordinate direction, then change the update coordinate cyclically throughout the procedure.

Block coordinate descent algorithms Block coordinate descent Coordinates are divided into blocks, where each block contains a subset of coordinates, the update rule is to only update one block at a time(fix other blocks): Update Fixed coordinate Update Fixed coordinate Update Fixed coordinate Difficult of designing block coordinate descent algorithms: How to choose coordinate block to update. How to design the update rule for given coordinate block.

Block coordinate descent algorithms Block coordinate descent for MAP inferene The updating rule of subgradient method only involved xi s,x f f corresponding to the maximizing assignment, where in block coordinate descent algorithms we update δ fi (x i ) for all configurations of x i. Two common choices: Fix f and i, update δ fi (x i ) for x i Max-Sum Diffusion(MSD) (Schlesinger 76, Werner 07) Fix f, update δ fi (x i ) for i, x i Max-Product Linear Programming(MPLP) (Globerson & Jaakkola 07)

Block coordinate descent algorithms The Max-Sum Diffusion algorithm δ fi (x i ) = 1 2 δ f i (x i ) + 1 2 max {θ f (x f ) δ x f î (x î )} f \i î f \i where δ f i (x i ) = θ i (x i ) + ˆf f δˆf i (x i ). For fixed f and i, î f \i δ f î(x ) is the message from f to all the î children of f except for i (i.e. f \i); ˆf f δˆf i (x i ) is the message from all of i s parents other than f to i. a e f h m Updated x i x 1 x 2 x k x n Used in Update

Block coordinate descent algorithms The Max-Product Linear Programming algorithm δ fi (x i ) = δ f i (x i ) + 1 f max {θ f (x f ) + x f \i î f δ f i (x i )} where δ f i (x i ) = θ i (x i ) + ˆf f δˆf i (x i ). For fixed f, î f δ f i (x i ) is the message from f s co-parents to all their children; ˆf f δˆf i (x i ) is the message from all of i s parents other than f to i. a e f h m Updated x i x 1 x 2 x k x n Used in Update

Block coordinate descent algorithms Comparison of Subgradient, MSD and MPLP In subgradient algorithm, only x s i,x f f corresponding to the maximizing assignment are used. In MSD, we fix f,i and update δ fi for all x i simultaneously. In MPLP, we fix f and update δ fi for i and x i simultaneously. Larger blocks are used in MPLP; More communication are involved between subproblems in MPLP.

Block coordinate descent algorithms Limitations of coordinate descent For non-smooth problem, coordinate descent may stuck at the corners! Limitations of coordinate descent For non-smooth problem, coordinate descent may stuck at the corners!

Block coordinate descent algorithms Possible solution Use soft-max(convex and differentiable) instead of max: max f i ɛ log i i exp f i ɛ Smoothing Example where soft-max Model converge where MPLP to gets maxstuck. as ɛ 0. Simulation result of using soft-max when MPLP stucked: 135 130 Subgradient MPLP Soft MPLP Optimum objective 125 120 115 10 1 10 2 10 3 iteration (Note: the figure is adapted from reference [2])

Outline The MAP inference problem Motivating Applications Dual decomposition and Lagrangian relaxation Algorithms for solving the dual problem Discussion

Discussion Decoding the MAP assignment After locally decoding the single node reparameterizations: x i arg max ˆx i θδ i (ˆx i ) If the node terms θ δ i (x i) have a unique maximum, we say that δ is locally decodable to x. For subgradient algorithms, when the dual decomposition has a unique solution and it is integral, then the MAP assignment is guaranteed to be find from the dual solution.

Discussion Decoding the MAP assignment For coordinate descent algorithm, when the dual decomposition has a unique solution and it is integral, then For binary pairwise MRFs, fixed points of the coordinate descent algorithms are locally decodable to true MAP assignment. For pairwise tree-structured MRFs and MRFs with a single cycle, fixed points of the coordinate descent algorithms are locally decodable to true MAP assignment. There exist non-binary pairwise MRFs with cycles and dual optimal fixed points of the coordinate descent algorithms that are not locally decodable.

Discussion Combine subgradient and coordinate descent Subgradient is slow but guaranteed to converge, coordinate descent is faster but may not converge to optimal, but we can combine the two algorithms: First use coordinate descent, after it is close to converge, use subgradient to reach the global optimal. Alternating between subgradient and coordinate descent.

Thanks! Questions!