Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation

Size: px

Start display at page:

Download "Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation"

Griselda Wade
6 years ago
Views:

1 Quadratic Programming Relaations for Metric Labeling and Markov Random Field MAP Estimation Pradeep Ravikumar John Lafferty School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Abstract Quadratic program relaations are proposed as an alternative to linear program relaations and tree reweighted belief propagation for the metric labeling or MAP estimation problem. An additional conve relaation of the quadratic approimation is shown to have additive approimation guarantees that apply even when the graph weights have mied sign or do not come from a metric. The approimations are etended in a manner that allows tight variational relaations of the MAP problem, although they generally involve non-conve optimization. Eperiments carried out on synthetic data show that the quadratic approimations can be more accurate and computationally efficient than the linear programming and propagation based alternatives. 1. Introduction Undirected graphical models, or Markov random fields (MRFs), are natural tools in many domains, from image processing to social network modeling. A key inference problem for MRFs is to compute the maimum a posteriori (MAP) configuration the most probable labeling which is used in multiple applications such as image denoising, protein folding and error control coding. For arbitrary graphs and parameter settings this problem is NPhard, but various approimate techniques have been proposed that have enabled the application of MRFs to a range of practical problems. For tree-structured distributions, the MAP estimate for random fields can be computed efficiently by dynamic programming. It can also be computed in polynomial time using graph cuts (Greig et al., 1989) when the parameter settings yield a submodular energy function. In the gen- Appearing in Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh, PA, 26. Copyright 26 by the author(s)/owner(s). eral setting, a widely used approimation technique is maproduct belief propagation (Pearl, 1988). The algorithm is convergent on trees, and its fied point configuration upon convergence can be shown to be locally optimal with respect to a large set of moves (Weiss & Freeman, 21). A similar message passing algorithm, tree-reweighted ma product (Wainwright et al., 25), has stronger correctness and convergence guarantees. Boykov et al. (21) have proposed graph-cut based algorithms that efficiently find a local energy minimum with respect to two types of large moves. A different direction has been taken in recent work on linear program relaations for the MAP problem in the specific setting of metric labeling. In the metric labeling formulation, a weighted graph and a metric on labels specifies the energy or cost of different labelings of a set of obects, and the goal is to find a minimum cost labeling. Casting this as an integer linear program, Kleinberg and Tardos (1999) proposed linear relaations for specific metrics. Chekuri et al. (25) recently etended these techniques using the natural linear relaation of the metric labeling task, and obtained stronger approimation guarantees. In this paper, we propose a quadratic programming (QP) relaation to the MAP or metric labeling problem. While the linear relaations have O( E k 2 ) variables, where E is the number of edges in the graph and k is the number of labels, in our QP formulation there are kn variables, and yet we show that the quadratic obective function more accurately represents the energy in the graphical model. In particular, we show that the QP formulation computes the MAP solution eactly. Under certain conditions the relaation results in a non-conve problem however, which requires an intractable search over local minima. This motivates an additional conve approimation to the relaation, which we show satisfies an additive approimation guarantee. We also etend the relaation to general variational inner polytope relaations which we also show to compute the MAP eactly. Eperiments indicate that our quadratic relaation with the conve approimation outperforms or is comparable to eisting methods under many

2 settings. Quadratic Programming Relaations for Metric Labeling and Markov Random Field MAP Estimation In the following section, we establish some notation and recall the relevant background. In Section 2.1 we review linear relaations for MAP estimation. In Section 3, we describe the quadratic relaation, prove that it is tight, and detail its conve approimation. In Section 4, we then etend the above relaation to show the tightness of various variational inner polytope relaations. Finally, in Section 5 and Section 6, we present our eperimental results and conclusions. 2. Notation and Background Consider a graph G = (V,E), where V denotes the set of nodes and E denotes the set of edges. Let X s be a random variable associated with node s, for s V, yielding a random vector X = {X 1,...,X n }, and let φ = {φ α,α I} denote the set of potential functions (or sufficient statistics) for a set I of cliques in G. Associated with φ is a vector of parameters θ = {θ α,α I}. With this notation, the eponential family of distributions of X, associated with φ and G is given by ( ) p(;θ) = ep θ α φ α Ψ(θ). α As discussed in (Yedidia et al., 21), at the epense of increasing the state space one can assume without loss of generality that the graphical model is a pairwise Markov random field, i.e., the set of cliques I is the set of edges {(s,t) E}, so that p(;θ) ep θ s φ s ( s ) + θ st φ st ( s, t ). s V (s,t) E If each X s takes values in a discrete set X s, we can represent any potential function as a linear combination of indicator functions, φ s ( s ) = φ s() I ( s ) and φ st ( s, t ) = k φ st(,k) I,k ( s, t ) where { 1 s = I ( s ) = otherwise and I,k ( s, t ) = { 1 s = and t = k otherwise. The MAP problem is then given by = argma θ I ( s ) + θ s,;t,k I,k ( s, t ). s, 2.1. Linear Relaations MAP estimation in the discrete case is essentially a combinatorial optimization problem, and it can be cast as an integer program. Recent work has studied approimate MAP estimation using linear relaations (Bertsimas & Tsitsiklis, 1997). Letting variables () and (s,;t,k) correspond to the indicator variables I ( s ) and I,k ( s, t ), we obtain the following integer linear program (ILP), ma such that (1) θ 1 () + θ s,;t,k 2 (s,;t,k) 2 (s,;t,k) = 1 () k 1 () = 1 1 () {,1} 2 (s,;t,k) {,1}. This ILP can then be relaed to the following linear program (LP), ma such that θ 1 () + θ s,;t,k 2 (s,;t,k) 2 (s,;t,k) = 1 () k 1 () = 1 1 () 1 2 (s,;t,k) 1. Chekuri et al. (25) propose the above LP relaation as an approimation algorithm for the metric labeling task, which is the MAP problem with spatially homogeneous MRF parameters; thus, θ s,;t,k = w st d(,k), where w st is a non-negative edge weight and d is a metric that is the same for all the edges. Kleinberg and Tardos (1999) proposed related linear relaations for specific metrics. The above LP relaation was also proposed for the general pairwise graphical model setting by Wainwright and Jordan (23). Letting θ and φ() denote the vectors of parameters and potential functions, respectively, and letting θ, φ() denote the inner product We thus consider pairwise MRFs with indicator potential functions as p( θ) ep θ I ( s ) + θ s,;t,k I,k ( s, t ). θ,φ() = θ I ( s )+ s, (s,t) E;,k (2) θ s,;t,k I,k ( s, t )

3 the MAP problem is then given by = argma θ,φ() = sup θ, M where M is the set of moment parameters { M = : p()φ() = for some distribution p The polytope M can be seen to be upper-bounded by the set LOCAL(G) of all single and pairwise vectors 1 and 2 that satisfy the local consistency constraints k 2(s,;t,k) = 1 () 1(s,) = 1 1 () 1 2 (s,;t,k) 1. Wainwright and Jordan (23) thus proposed the upperbounding relaation of using LOCAL(G) as an outer bound for the polytope M, = sup θ,, (3) LOCAL(G) which is the same LP formulation as in equation (2). Furthermore, Wainwright et al. (25) show that under certain conditions, the tree-reweighted belief propagation updates solve the dual of the LP in equation (3); since strong duality holds, the tree updates also give the optimal primal value for the LP. 3. Quadratic Relaation In the linear relaation of equation (2), the variables 2 (s,;t,k) are relaations of the indicator variables I,k ( s, t ), with a value of one indicating that for edge (s,t) E, variable s is labeled and variable t is labeled k. These pairwise variables are constrained by demanding that they be consistent with the corresponding marginal variables 1 (s,). Note, however, that the binary indicator variables satisfy the additional independence constraint I,k ( s, t ) = I ( s ) I k ( t ). This then suggests that constraining the relaation variables in a similar manner, 2 (s,;t,k) = 1 ()(t;k), might yield a tighter relaation. This leads to the following quadratic program ma θ () + θ s,;t,k ()(t;k) subect to () = 1 (4) () 1 }. The following result shows that the relaation is in fact tight; the proof uses the probabilistic method. Theorem 3.1. The optimal value of problem (4) is equal to the optimal value of the MAP problem (1). Proof. Let the optimal MAP energy be e MAP = ma θ I ( s ) + θ s,;t,k I,k ( s, t ) and let the optimal value of the relaed problem be e = ma θ () + θ s,;t,k ()(t;k) where () = 1 and () [,1]. Clearly, e e MAP since problem (4) is a relaation of problem (1). We now show that e MAP e. Let be an optimal solution of problem (4), and consider the following randomized rounding scheme. For each node s, assign it value with probability (). The epected energy of such a rounding is e R = θ ()) + θ s,;t,k ()(t;k) = e s,;t,k But there has to eist some discrete assignment y whose energy is greater than the epected energy; e(y) e. Since e MAP is the energy of the optimal configuration e MAP e(y), which thus gives e MAP e. Note that the randomization in the proof ust shows the eistence of a discrete solution with the same energy as that of the optimal real relaation. The problem of obtaining such a discrete solution efficiently is considered net. Theorem 3.2. Any solution of the MAP problem (1) efficiently yields a solution of the relaation (4) and vice versa. Thus the relaation (4) is equivalent to the MAP problem (1). Proof. From theorem 3.1, the optimal values of problems (1) and (4) are equal; let e denote this maimum energy. Let ˆ be an optimal solution of the MAP problem (1). As problem (4) is a relaation of the MAP problem, () = I(ˆ;) is also a feasible and optimal solution for (4). For the converse, let be an optimal solution of problem (4). Its energy is given by e = θ () + θ s,;t,k () (t;k) (s,t) E;,k (5)

4 If each () is integer valued, that is, in {,1}, then we can use itself as the feasible optimal solution for the MAP problem (1). Otherwise, consider to be real valued; we (efficiently) construct a labeling y with the maimum energy e. Consider an unlabeled node s. Assign it label y s = argma θ + t:(s,t) E;k θ s,;t,k (t;k). Now, set (s;y s ) = 1 and (s;k) = ; k y s. Continue with this labeling process until all nodes are labeled. It can be shown that the energy of this assignment y is equal to the energy e of the optimal MAP assignment. In particular, each time we take up an unlabeled node t, we select a labeling that does not decrease the epected energy of the unlabeled nodes given the labelings of the labeled nodes. Given that the initial epected energy of all unlabeled nodes was e, the energy at the end of the process, that is, of the assignment y, is thus at least e Conve Approimation The previous section showed that the relaation in equation (4), while a simple etension of the LP in equation (2), is actually equivalent to the MAP problem. This yields the interesting result that the MAP problem is solvable in polynomial time if the edge parameter matri Θ = [θ s,;t,k ] is negative definite, since in this case the QP (4) is a conve program. Note also that the quadratic program has a simple set of constraints (only linear and bo constraints), which are also small in number, and is thus a simple problem instance of general conve optimization. It should also be stressed that for an n node graph, the QP has only kn variables while the LP has O(k 2 E ) variables. The case where the edge parameter matri Θ is not negative definite yields a non-conve program. While one could carry out an iterative search procedure up to a local maimum as in the ma-product algorithm, we now describe a conve approimation that provides a polynomial time solution with additive bound guarantees. Consider the quadratic integer program (QIP) corresponding to the QP, given by ma θ () + θ s,;t,k ()(t;k) subect to () = 1 (6) () {,1} (s,;t,k) {,1}. This is clearly equivalent to the MAP problem in equation (1). Let Θ = [θ s,;t,k ] be a parameter matri that is not negative semi-definite. Let d(s, i) be the (positive) diagonal terms that need to be subtracted from the matri to make it negative semi-definite. An upper bound for d is d(s,i) (t,k) θ s,;t,k, since the negative of a diagonally dominant matri is negative semi-definite. Let Θ = Θ diag {d(s;i)} be the negative semi-definite matri obtained by subtracting off diagonal elements d(s; i). Also, let θ = θ + d(). (7) Now, for binary (s;i) {,1}, we have that (s;i) 2 = (s;i); in particular, d(s;i)() d(s;i)() 2 =. We thus get that the following QIP is equivalent to the QIP (6), ma such that θ () + θ s,;t,k ()(t;k) () = 1 () {,1} Relaing this QIP as before, we obtain the following optimization problem. ma θ () + θ s,;t,k()(t;k) such that () = 1 () [,1] This is a conve program solvable in polynomial time. The optimality results of the previous section do not follow, however, and the relaation (8) is not always tight. But as shown net, we can get an additive approimation bound for the discrete solution obtained using the rounding procedure described in the previous section. Theorem 3.3. Let be the optimal solution for the conve QP (8), and let e be the optimal MAP energy. Then there is a discrete configuration y (from ) with energy E(y) satisfying E(y) e d(s;i) (s;i)(1 (s;i)) s,i e 1 d(s;i). 4 s,i This result shows that if either Θ is close to negative definite, so that s,i d(s;i) is small, or if the solution is close to integral, so that (s;i) is close to zero or one, then the conve relaation achieves a solution that is close to the optimal MAP solution. Proof. Just as in the proof of Theorem 3.2, given, the optimal solution to the conve relaation, we can effi-

5 ciently obtain a discrete solution y whose energy is E(y) = θ ()+ θ s,;t,k () (t;k). (s,t) E;,k On the other hand, the optimal value of the conve QP is given by e CQP We then have that = (θ + d(s;i)) () + θ s,;t,k () (t;k) (s,i) d(s;i) (s;i) 2. E(y) = e CQP (s,i) d(s;i) (s;i)[1 (s;i)] e s,i e s,i and the result follows. d(s;i) (s;i)(1 (s;i)) d(s; i) Iterative Update Procedure Just as tree-reweighted ma product gives a set of iterative updates for solving the LP in equation (3), we might ask if there is an iterative update counterpart for the QP. Maproduct is a co-ordinate ascent algorithm in the dual (Lagrangian) space for the LP; however, since the dual space of the QP (4) is more complicated, we look at a set of fied point co-ordinate ascent updates in its primal space. The QP is given by = ma θ () + θ s,;t,k ()(t;k) (8) subect to (s; ) = 1 (s; ) [, 1]. Consider node s, and suppose that values for (t;.) are fied for other nodes t s. Then the optimal parameter values (s;.) for node s are given by (s;.) = ma (s;.) θ () + t;,k θ s,;t,k ()(t;k) subect to () = 1. This is easily seen to be solved by taking (s) = argma θ + θ s,;t,k (t;k) t;,k and setting (s,) = I (s)(). This is essentially the iterative conditional modes algorithm (Besag, 1986), which iteratively updates each node with a labeling that most increases the energy, holding fied the labels of the other nodes. A better iterative procedure, with stronger and faster convergence properties, albeit for conve programs, is proected conugate gradient ascent (Aelsson & Barker, 21). Thus, another advantage of our conve approimation is that we can use conugate gradient ascent as a simple iterative procedure that is guaranteed to converge (unlike co-ordinate ascent for ma product). This makes the conve approimation to the QP applicable to large scale problems. 4. Inner Polytope Relaations In the previous section, we obtained a quadratic relaation by imposing an independence constraint on the parameters (s,;t,k) in equation (4). We also showed that this relaation is actually tight, and is equivalent to the MAP problem. In this section, we show how one can think of this relaation as the counterpart of mean-field for MAP, and how any of the corresponding relaation counterparts of structured mean-field are also tight. Consider Wainwright and Jordan (23) s polytope formulation of MAP in equation (3), given by = ma θ,φ = sup θ, M where M is the conve hull of all configuration potentials φ(). The second equality follows from the fact that in a linear program, the optimum occurs at an etremal point φ( ). Thus, if M I M is any subset of the marginal polytope that includes all of the vertices, then the equations = ma θ,φ = sup θ, M I still hold. In other words, any relaation of the indicator variables to (s,;t,k) M I would lead to a tight relaation, as long as M I contains all vertices. In contrast, tree-reweighted ma product is not tight, since the domain set for its relaed parameters is LOCAL(G) M; see Section 2.1. As described in (Wainwright & Jordan, 23), one can think of structured mean field methods as inner polytope

6 approimations. For the given graph G and a subgraph H, let E(H) = {θ θ st = θ st 1 (s,t) H } where θ st is the vector of natural parameters associated with edge (s,t). Thus, the parameters not included in the subgraph are set to zero. Now for the subgraph H, we can define the following set of moment parameters: M(G;H) = { = E θ [φ()] for some θ E(H)}. In essence, the moment parameters in M(G; H) must be realizable by a distribution that respects the structure of H. For any H G, the relation M(G;H) M(G) thus always holds, and M(G; H) is an inner polytope approimation to M. In particular, taking H to be the completely disconnected graph (i.e. no edges) H, we have, M(G;H ) = {(),(s,;t,k) () 1 (s,;t,k) = ()(t;k)} which can be seen to be equal to the feasible set of the QP relaation (4). For this subgraph H = H, the mean field relaation thus becomes sup θ, M(G;H ) = sup M(G;H ) = sup M(G;H ) θ () + θ s,;t,k (s,;t,k) st;k θ () + θ s,;t,k ()(t;k) st;k which is equivalent to the quadratic relaation in equation (4). Thus, we can, in principle, use any structured mean-field relaation of the form sup M(G;H) θ, to solve the MAP eactly. The caveat is that this problem, like structured mean field, is a non-conve problem. However, while structured mean field only solves for an approimate value of the log-partition function, the results from Section 3 show that its counterpart for the MAP problem is eact, if the global optimum can be found. 5. Eperiments The quadratic relaation with the conve approimation was evaluated by comparing it against three competing methods: the linear programming relaation (Chekuri et al., 25), the tree-reweighted ma product algorithm (Wainwright et al., 25), and iterative conditional modes () (Besag, 1986). For tree-reweighted ma product, we use the sequential update variant detailed in (Kolmogorov, 25), which has better convergence properties than the originally proposed algorithm. The approimate MAP algorithms were compared on different potential functions and coupling types for 2D nearest neighbor grid graphs with 1 nodes and a label set of size four. The node potentials were generated uniformly U( 1, 1), while the edge potentials were generated as a product of an edge weight and a distance function on labels. For different settings of an edge coupling-strength parameter, d coup, the edge weight was selected from U( d coup,d coup ) for the mied coupling, from U(,2d coup ) for the positive coupling, and from U( 2d coup,) for the negative coupling. The following four commonly used distances were used for the distance function: Ising, φ(l, m) = lm; uniform, or Potts, φ(l,m) = I(l = m); quadratic, φ(l,m) = (l m) 2 ; linear φ(l,m) = l m. Figures 1 and 2 show plots of the value (energy) of the MAP estimates using the different algorithms for a range of model types. For any given setting of parameters and potential functions, a higher value is closer to the MAP estimate and is thus better. As the plots show, the quadratic relaation slightly outperforms tree-reweighted ma product for mied and positive couplings, and is comparable or slightly worse for negative coupling. The quadratic approimation typically beats both and the linear relaation Ising Mied Coupling Number of Nodes Relative wrt QP Ising Mied Coupling Number of Nodes Figure 3. Comparison of and on larger graphs, using Ising potentials with mied coupling. The right plot shows (e e QP)/e and (e e QP)/e. In Figure 3 we compare the MAP estimates from different algorithms on larger graphs, using the Ising potential function with mied coupling. The quadratic relaation is seen to outperform and tree-reweighted ma product, even as the number of nodes increases. We note that since the conve approimation to the QP is a conve program, it can be solved (in polynomial time) using standard QP solvers for small problems, and for largerscale problems one can use iterative proected conugate gradient, which provides a fast iterative method that is guaranteed to converge. In our eperiments, the computation time for the QP method was comparable to that required by tree-reweighted ma product, which in turn required much less time to solve than the linear programming relaation. This is due primarily to the fact that the linear program has E k 2 variables, while the conve quadratic

7 Ising Mied Coupling Ising Positive Coupling Ising Negative Coupling Uniform Mied Coupling Uniform Positive Coupling Uniform Negative Coupling Figure 1. Comparison of linear relaation (LP), iterative conditional modes (), tree-reweighted ma product (), and quadratic programming relaation (QP) on 1 1 grid graphs using Ising potentials (top row) and uniform potentials (bottom) with mied (left), positive (center) and negative (right) couplings. A better MAP estimate has a higher value. relaation has only nk variables, where n is the number of nodes in the graph, E is the number of edges, and k is the number of labels. 6. Conclusions This paper has proposed a quadratic programming relaation to the MAP problem for random fields, or the metric labeling problem. The quadratic obective function more accurately represents the energy in the graphical model while using fewer variables than the linear relaation. It was shown that the QP relaation is tight and computes the MAP solution eactly. However, under certain conditions the relaation results in a non-conve problem, which requires an intractable search over local minima. This led to an additional conve approimation to the relaation, for which there is an additive approimation guarantee. The quadratic programming approimation was also etended to general variational inner polytope relaations that also compute the MAP eactly. Eperiments demonstrated that the quadratic relaation, with the conve approimation, can outperform eisting methods under many settings, while also being computationally attractive. Acknowledgments This research was supported in part by NSF grants IIS and IIS The authors thank Ramin Zabih and Vladimir Kolmogorov for helpful comments. References Aelsson, O., & Barker, V. A. (21). Finite element solution of boundary value problems: theory and computation. Society for Industrial and Applied Mathematics. Bertsimas, D., & Tsitsiklis, J. (1997). Introduction to linear optimization. Athena Scientific. Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B. Boykov, Y., Veksler, O., & Zabih, R. (21). Fast approimate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell., 23. Chekuri, C., Khanna, S., Naor, J., & Zosin, L. (25). A linear programming formulation and approimation algorithms for the metric labeling problem. SIAM Journal on Discrete Mathematics, 18, Greig, D., Porteous, B., & Seheult, A. (1989). Eact maimum a posteriori estimation for binary images. Journal of the Royal Statistical Society, Series B, 51. Kleinberg, J., & Tardos, E. (1999). Approimation algorithms for classification problems with pairwise relationships: Metric partitioning and markov random fields.

8 Linear Mied Coupling Quadratic Mied Coupling Linear Positive Coupling Quadratic Positive Coupling Linear Negative Coupling Quadratic Negative Coupling Figure 2. Comparison of linear relaation (LP), iterative conditional modes (), tree-reweighted ma product (), and quadratic programming relaation (QP) on 1 1 grid graphs using linear potentials (top row) and quadratic potentials (bottom) with mied (left), positive (center) and negative (right) couplings. A better MAP estimate has a higher value. IEEE Symposium on the Foundations of Computer Science. Kolmogorov, V. (25). Convergent tree-reweighted message passing for energy minimization. AISTATS. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc. Wainwright, M., Jaakkola, T., & Willsky, A. (25). Map estimation via agreement on (hyper)trees: Messagepassing and linear-programming approaches. IEEE Transactions on Information Theory, 51, Wainwright, M. J., & Jordan, M. I. (23). Variational inference in graphical models: The view from the marginal polytope. Allerton Conference on Communication, Control, and Computing. Weiss, Y., & Freeman, W. T. (21). On the optimality of solutions of the ma-product belief-propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (21). Understanding belief propagation and its generalizations. IJ- CAI 21 Distinguished Lecture track.

On the optimality of tree-reweighted max-product message-passing

Appeared in Uncertainty on Artificial Intelligence, July 2005, Edinburgh, Scotland. On the optimality of tree-reweighted max-product message-passing Vladimir Kolmogorov Microsoft Research Cambridge, UK