An Alternating Direction Method for Dual MAP LP Relaxation

Size: px

Start display at page:

Download "An Alternating Direction Method for Dual MAP LP Relaxation"

Georgina Burke
6 years ago
Views:

1 An Alternatng Drecton Method for Dual MAP LP Relaxaton Ofer Mesh and Amr Globerson The School of Computer Scence and Engneerng, The Hebrew Unversty of Jerusalem, Jerusalem, Israel Abstract. Maxmum a-posteror MAP) estmaton s an mportant task n many applcatons of probablstc graphcal models. Although fndng an exact soluton s generally ntractable, approxmatons based on lnear programmng LP) relaxaton often provde good approxmate solutons. In ths paper we present an algorthm for solvng the LP relaxaton optmzaton problem. In order to overcome the lack of strct convexty, we apply an augmented Lagrangan method to the dual LP. The algorthm, based on the alternatng drecton method of multplers ADMM), s guaranteed to converge to the global optmum of the LP relaxaton objectve. Our expermental results show that ths algorthm s compettve wth other state-of-the-art algorthms for approxmate MAP estmaton. Keywords: Graphcal Models, Maxmum a-posteror, Approxmate Inference, LP Relaxaton, Augmented Lagrangan Methods 1 Introducton Graphcal models are wdely used to descrbe multvarate statstcs for dscrete varables, and have found wdespread applcatons n numerous domans. One of the basc nference tasks n such models s to fnd the maxmum a-posteror MAP) assgnment. Unfortunately, ths s typcally a hard computatonal problem whch cannot be solved exactly for many problems of nterest. It has turned out that lnear programmng LP) relaxatons provde effectve approxmatons to the MAP problem n many cases e.g., see [15, 1, 4]). Despte the theoretcal computatonal tractablty of MAP-LP relaxatons, solvng them n practce s a challenge for real world problems. Usng off-theshelf LP solvers s typcally nadequate for large models snce the resultng LPs have too many constrants and varables [9]. Ths has led researchers to seek optmzaton algorthms that are talored to the specfc structure of the MAP- LP [7, 13, 14, 16, 0, 8]. The advantage of such methods s that they work wth very smple local updates and are therefore easy to mplement n the large scale settng. The suggested algorthms fall nto several classes, dependng on ther approach to the problem. The TRW-S [14], MSD [8] and MPLP [7] algorthms

2 Ofer Mesh and Amr Globerson employ coordnate descent n the dual of the LP. Whle these methods typcally show good emprcal behavor, they are not guaranteed to reach the global optmum of the LP relaxaton. Ths s a result of non strct-convexty of the dual LP and the fact that block coordnate descent mght get stuck n suboptmal ponts under these condtons. One way to avod ths problem s to use a soft-max functon whch s smooth and strctly convex, hence ths results n globally convergent algorthms [6, 10, 1]. Another class of algorthms [13, 16] uses the same dual objectve, but employs varants of subgradent descent to t. Whle these methods are guaranteed to converge globally, they are typcally slower n practce than the coordnate descent ones e.g., see [13] for a comparson). Fnally, there are also algorthms that optmze the prmal LP drectly. One example s the proxmal pont method of Ravkumar et al. [0]. Whle also globally convergent, t has the dsadvantage of usng a double loop scheme where every update nvolves an teratve algorthm for projectng onto the local polytope. More recently, Martns et al. [17] proposed a globally convergent algorthm for MAP-LP based on the alternatng drecton method of multplers ADMM) [8, 5, 4, ]. Ths method proceeds by teratvely updatng prmal and dual varables n order to fnd a saddle pont of an augmented Lagrangan for the problem. They suggest to use an augmented Lagrangan of the prmal MAP-LP problem. However, ther formulaton s restrcted to bnary parwse factors and several specfc global factors. In ths work, we propose an algorthm that s based on the same key dea of ADMM, however t stems from augmentng the Lagrangan of the dual MAP-LP problem nstead. An mportant advantage of our approach s that the resultng algorthm can be appled to models wth general local factors non-parwse, non-bnary). We also show that n practce our algorthm converges much faster than the prmal ADMM algorthm and that t compares favorably wth other state-of-the-art methods for MAP-LP optmzaton. MAP and LP relaxaton Markov Random Felds MRFs) are probablstc graphcal models that encode the jont dstrbuton of a set of dscrete random varables X = {X 1,..., X n }. The jont probablty s defned by combnng a set C of local functons θ c x c ), termed factors. The factors depend only on small) subsets of the varables X c X ) and model the drect nteractons between them to smplfy notaton we drop the varable name n X c = x c ; see [7]). The jont dstrbuton s then gven by: P x) exp θ x ) + c C θ cx c ) ), where we have ncluded also sngleton factors over ndvdual varables [7]. In many applcatons of MRFs we are nterested n fndng the maxmum probablty assgnment MAP assgnment). Ths yelds the optmzaton problem: arg max x θ x ) + θ c x c ) c C Due to ts combnatoral nature, ths problem s NP-hard for general graphcal models, and tractable only n solated cases such as tree structured graphs. Ths has motvated research on approxmaton algorthms.

3 An Alternatng Drecton Method for Dual MAP LP Relaxaton 3 One of the most successful approxmaton schemes has been to use LP relaxatons of the MAP problem. In ths approach the orgnal combnatoral problem s posed as a LP and then some of the constrants are relaxed to obtan a tractable LP problem that approxmates the orgnal one. In our case, the resultng MAP-LP relaxaton problem s: max µ LG) µ x )θ x ) + µ c x c )θ c x c ) 1) x c x c where µ are auxlary varables that correspond to pseudo) margnal dstrbutons, and LG) s the reduced set of constrants called the local polytope [7], defned by: { } LG) = µ 0 x c\ µ c x c\, x ) = µ x ) c, : c, x x µ x ) = 1 In ths paper we use the dual problem of Eq. 1), whch takes the form: mn max θ x ) + ) δ c x ) + max θ c x c ) ) δ c x ) δ x x c c: c c : c ) where δ are dual varables correspondng to the margnalzaton constrants n LG) see [, 8, 3]). 1 Ths formulaton offers several advantages. Frst, t mnmzes an upper bound on the true MAP value. Second, t provdes an optmalty certfcate through the dualty gap w.r.t. a decoded prmal soluton [3]. Thrd, the resultng problem s unconstraned, whch facltates ts optmzaton. Indeed, several algorthms have been proposed for optmzng ths dual problem. The two man approaches are block coordnate descent [14, 8, 7] and subgradent descent [16], each wth ts advantages and dsadvantages. In partcular, coordnate descent algorthms are typcally much faster at mnmzng the dual, whle the subgradent method s guaranteed to converge to the global optmum see [3] for n-depth dscusson). Recently, Jojc et al. [13] presented an accelerated dual decomposton algorthm whch stems from addng strongly convex smoothng terms to the subproblems n the dual functon Eq. ). Ther method acheves a better convergence rate over the standard subgradent method O ) 1 ɛ vs. O 1 ) ɛ ). An alternatve approach, that s also globally convergent, has been recently suggested by Martns et al. [17]. Ther approach s based on an augmented Lagrangan method, whch we next dscuss. 3 The Alternatng Drecton Method of Multplers We now brefly revew ADMM for convex optmzaton [8, 5, 4, ]. 1 An equvalent optmzaton problem can be derved va a dual decomposton approach [3].

4 4 Ofer Mesh and Amr Globerson Consder the followng optmzaton problem: mnmze fx) + gz) s.t. Ax = z 3) where f and g are convex functons. The ADMM approach begns by addng the functon ρ Ax z to the above objectve, where ρ > 0 s a penalty parameter. Ths results n the optmzaton problem: mnmze fx) + gz) + ρ Ax z s.t. Ax = z 4) Clearly the above has the same optmum as Eq. 3) snce when the constrants Ax = z are satsfed, the added quadratc term equals zero. The Lagrangan of the augmented problem Eq. 4) s gven by: L ρ x, z, ν) = fx) + gz) + ν Ax z) + ρ Ax z 5) where ν s a vector of Lagrange multplers. The soluton to the problem of Eq. 4) s gven by max ν mn x,z L ρ x, z, ν). The ADMM method provdes an elegant algorthm for fndng ths saddle pont. The dea s to combne subgradent descent over ν wth coordnate descent over the x and z varables. The method apples the followng teratons: x t+1 = arg mn L ρ x, z t, ν t ) x z t+1 = arg mn L ρ x t+1, z, ν t ) z ν t+1 =ν t + ρ Ax t+1 z t+1) 6) The algorthm conssts of prmal and dual updates, where the prmal update s executed sequentally, mnmzng frst over x and then over z. Ths splt retans the decomposton of the objectve that has been lost due to the addton of the quadratc term. The algorthm s run ether untl the number of teratons exceeds a predefned lmt, or untl some termnaton crteron s met. A commonly used such stoppng crteron s: Ax z ɛ and z t+1 z t ɛ. These two condtons can serve to bound the suboptmalty of the soluton. The ADMM algorthm s guaranteed to converge to the global optmum of Eq. 3) under rather mld condtons []. However, n terms of convergence rate, the worst case complexty of ADMM s O 1 ɛ ). Despte ths potental caveat, ADMM has been shown to work well n practce e.g., [1, 6]). Recently, accelerated varants on the basc alternatng drecton method have been proposed [9]. These faster algorthms are based on lnearzaton and come wth mproved convergence rate of O 1 ɛ ), achevng the theoretcal lower bound for frst-order methods [19]. In ths paper we focus on the basc ADMM formulaton and leave dervaton of accelerated varants to future work.

5 An Alternatng Drecton Method for Dual MAP LP Relaxaton 5 4 The Augmented Dual LP Algorthm In ths secton we derve our algorthm by applyng ADMM to the dual MAP- LP problem of Eq. ). The challenge s to desgn the constrants n a way that facltates effcent closed-form solutons for all updates. To ths end, we duplcate the dual varables δ and denote the second copy by δ. We then ntroduce addtonal varables λ c correspondng to the summaton of δ s pertanng to factor c. These agreement constrants are enforced through δ, and thus we have a constrant δ c x ) = δ c x ) for all c, : c, x, and λ c x c ) = : c δ c x ) for all c, x c. Followng the ADMM framework, we add quadratc terms and obtan the augmented Lagrangan for the dual MAP-LP problem of Eq. ): L ρδ, λ, δ, γ, µ) = max θ x x ) + ) δ cx ) + max θ x cx c) λ cx c)) c c: c c + γ cx ) δ cx ) δ ) cx ) + ρ δcx ) δ ) cx ) c : c x c : c x + µ cx c) λ cx c) ) δ cx ) + ρ λ cx c) δ cx ) c x c : c c : c To see the relaton of ths formulaton to Eq. 5), notce that δ, λ) subsume the role of x, δ subsumes the role of z wth gz) = 0), and the multplers γ, µ) correspond to ν. The updates of our algorthm, whch stem from Eq. 6), are summarzed n Alg. 1 a detaled dervaton appears n Appendx A). In Alg. 1 we defne N) = {c : c}, and the subroutne w = TRIMv, d) that serves to clp the values n the vector v at some threshold t.e., w = mn{v, t}) such that the sum of removed parts equals d > 0.e., v w = d). Ths can be carred out effcently n lnear tme n expectaton) by parttonng [3]. Notce that all updates can be computed effcently so the cost of each teraton s smlar to that of message passng algorthms lke MPLP [7] or MSD [8], and to that of dual decomposton [13, 16]. Furthermore, sgnfcant speedup s attaned by cachng some results for future teratons. In partcular, the threshold n the TRIM subroutne the new maxmum) can serve as a good ntal guess at the next teraton, especally at later teratons where the change n varable values s qute small. Fnally, many of the updates can be executed n parallel. In partcular, the δ update can be carred out smultaneously for all varables, and lkewse all factors c can be updated smultaneously n the λ and δ updates. In addton, δ and λ can be optmzed ndependently, snce they appear n dfferent parts of the objectve. Ths may result n consderable reducton n runtme when executed on parallel archtecture. In our experments we used sequental updates. x c )

6 6 Ofer Mesh and Amr Globerson Algorthm 1 The Augmented Dual LP Algorthm ADLP) for t = 1 to T do Update δ: for all = 1,..., n Set θ = θ + c: c δ c 1 γc) ρ θ = TRIM θ, N) ) ρ q = θ θ )/ N) Update δ c = δ c 1 ρ γc q c : c Update λ: for all c C Set θ c = θ c : c δ c + 1 ρ µc θ c = TRIM θ c, 1 ρ ) Update λ c = θ c θ c Update δ: for all c C, : c, x Set v cx ) = δ cx ) + 1 γcx) + ρ x c\ λ cx c\, x ) + 1 ρ x c\ µ cx c\, x ) 1 v c = 1+ k:k c X c\k k:k c X c\k x k v ck x k ) Update δ cx ) = 1 1+ X c\ [ v cx ) j:j c,j X c\{,j} xj vcjxj) vc )] Update the multplers: γ cx ) γ cx ) + ρ δ cx ) δ ) cx ) for all c C, : c, x µ cx c) µ cx c) + ρ λ cx c) δ ) : c cx ) for all c C, x c end for 5 Expermental Results To evaluate our augmented dual LP ADLP) algorthm Alg. 1) we compare t to two other algorthms for fndng an approxmate MAP soluton. The frst s MPLP of Globerson and Jaakkola [7], whch mnmzes the dual LP of Eq. ) va block coordnate descent steps cast as message passng). The second s the accelerated dual decomposton ADD) algorthm of Jojc et al. [13]. 3 We conduct experments on proten desgn problems from the dataset of Yanover et al. [9]. In these problems we are gven a 3D structure and the goal s to fnd a sequence of amno-acds that s the most stable for that structure. The problems are modeled by sngleton and parwse factors and can be posed as fndng a MAP assgnment for the gven model. Ths s a demandng settng n whch each problem may have hundreds of varables wth 100 possble states on average [9, 4]. Fgure 1 shows two typcal examples of proten desgn problems. It plots the objectve of Eq. ) computed usng δ varables only) as a functon of the executon tme for all algorthms. Frst, n Fgure 1 left) we observe that the coordnate descent algorthm MPLP) converges faster than the other algorthms, 3 For both algorthms we used the same C++ mplementaton used by Jojc et al. [13], avalable at Our own algorthm was mplemented as an extenson of ther package.

7 An Alternatng Drecton Method for Dual MAP LP Relaxaton 7 Objectve jo8 MPLP ADD ε=1) ADLP ρ=0.05) Objectve ycc MPLP ADD ε=1) ADD ε=10) ADLP ρ=0.01) ADLP ρ=0.05) Runtme secs) Runtme secs) Fg. 1. Comparson of three algorthms for approxmate MAP estmaton: our augmented dual LP algorthm ADLP), accelerated dual decomposton algorthm ADD) by Jojc et al. [13], and the dual coordnate descent MPLP algorthm [7]. The fgure shows two examples of proten desgn problems, for each the dual objectve of Eq. ) s plotted as a functon of executon tme. Dashed lnes denote the value of the best decoded prmal soluton. however t tends to stop prematurely and yeld suboptmal solutons. In contrast, ADD and ADLP take longer to converge but acheve the globally optmal soluton to the approxmate objectve. Second, t can be seen that the convergence tmes of ADD and ADLP are very close, wth a slght advantage to ADD. The dashed lnes n Fgure 1 show the value of the decoded prmal soluton assgnment) [3]. We see that there s generally a correlaton between the qualty of the dual objectve and the decoded prmal soluton, namely the decoded prmal soluton mproves as the dual soluton approaches optmalty. Nevertheless, we note that there s no domnant algorthm n terms of decodng here we show examples where our decodng s superor). In many cases MPLP yelds better decoded solutons despte beng suboptmal n terms of the dual objectve not shown; ths s also noted n [13]). We also conduct experments to study the effect of the penalty parameter ρ. Our algorthm s guaranteed to globally converge for all ρ > 0, but ts choce affects the actual rate of convergence. In Fgure 1 rght) we compare two values of the penalty parameter ρ = 0.01 and ρ = It shows that settng ρ = 0.01 results n somewhat slower convergence to the optmum, however n ths case the fnal prmal soluton dashed lne) s better than that of the other algorthms. In practce, n order to choose an approprate ρ, one can run a few teratons of ADLP wth several values and see whch one acheves the best objectve [17]. We menton n passng that ADD employs an accuracy parameter ɛ whch determnes the desred suboptmalty of the fnal soluton [13]. Settng ɛ to a large value results n faster convergence to a lower accuracy soluton. On the one hand, ths trade-off can be vewed as a mert of ADD, whch allows to obtan coarser approxmatons at reduced cost. On the other hand, an advantage of our method s that the choce of penalty ρ affects only the rate of convergence and does not mpose addtonal reducton n soluton accuracy over that of the LP relaxaton. In Fgure 1 left) we use ɛ = 1, as n Jojc et al., whle n Fgure 1

8 8 Ofer Mesh and Amr Globerson Objectve a8 MPLP ADD ε=1) ADLP ρ=0.05) Objectve jo8 ADLP APLP Runtme secs) Runtme secs) Fg.. Left) Comparson for a sde chan predcton problem smlar to Fgure 1 left). Rght) Comparson of our augmented dual LP algorthm ADLP) and a generalzed varant APLP) of the ADMM algorthm by Martns et al. [17] on a proten desgn problem. The dual objectve of Eq. ) s plotted as a functon of executon tme. Dashed lnes denote the value of the best decoded prmal soluton. rght) we compare two values ɛ = 1 and ɛ = 10 to demonstrate the effect of ths accuracy parameter. We next compare performance of the algorthms on a sde chan predcton problem [9]. Ths problem s the nverse of the proten desgn problem, and nvolves fndng the 3D confguraton of rotamers gven the backbone structure of a proten. Fgure left) shows a comparson of MPLP, ADD and ADLP on one of the largest protens n the dataset 81 varables wth 1 states on average). As n the proten desgn problems, MPLP converges fast to a suboptmal soluton. We observe that here ADLP converges somewhat faster than ADD, possbly because the smaller state space results n faster ADLP updates. As noted earler, Martns et al. [17] recently presented an approach that apples ADMM to the prmal LP.e., Eq. 1)). Although ther method s lmted to bnary parwse factors and several global factors), t can be modfed to handle non-bnary hgher-order factors, as the dervaton n Appendx B shows. We denote ths varant by APLP. As n ADLP, n the APLP algorthm all updates are computed analytcally and executed effcently. Fgure rght) shows a comparson of ADLP and APLP on a proten desgn problem. It llustrates that ADLP converges sgnfcantly faster than APLP smlar results, not shown here, are obtaned for the other protens). 6 Dscusson Approxmate MAP nference methods based on LP relaxaton have drawn much attenton lately due to ther practcal success and attractve propertes. In ths paper we presented a novel globally convergent algorthm for approxmate MAP estmaton va LP relaxaton. Our algorthm s based on the augmented Lagrangan method for convex optmzaton, whch overcomes the lack of strct convexty by addng a quadratc term to smooth the objectve. Importantly, our algorthm proceeds by applyng smple to mplement closed-form updates, and

9 An Alternatng Drecton Method for Dual MAP LP Relaxaton 9 t s hghly scalable and parallelzable. We have shown emprcally that our algorthm compares favorably wth other state-of-the-art algorthms for approxmate MAP estmaton n terms of accuracy and convergence tme. Several exstng globally convergent algorthms for MAP-LP relaxaton rely on addng local entropy terms n order to smooth the objectve [6, 10, 1, 13]. Those methods must specfy a temperature control parameter whch affects the qualty of the soluton. Specfcally, solvng the optmzaton subproblems at hgh temperature reduces soluton accuracy, whle solvng them at low temperature mght rase numercal ssues. In contrast, our algorthm s qute nsenstve to the choce of such control parameters. In fact, the penalty parameter ρ affects the rate of convergence but not the accuracy or numercal stablty of the algorthm. Moreover, despte lack of fast convergence rate guarantees, n practce the algorthm has smlar or better convergence tmes compared to other globally convergent methods n varous settngs. Note that [17] also show an advantage of ther prmal based ADMM method over several baselnes. Several mprovements over our basc algorthm can be consdered. One such mprovement s to use smart ntalzaton of the varables. For example, snce MPLP acheves larger decrease n objectve at early teratons, t s possble to run t for a lmted number of steps and then take the resultng varables δ for the ntalzaton of ADLP. Notce, however, that for ths scheme to work well, the Lagrange multplers γ and µ should be also ntalzed accordngly. Another potental mprovement s to use an adaptve penalty parameter ρ t e.g., [11]). Ths may mprove convergence n practce, as well as reduce senstvty to the ntal choce of ρ. On the downsde, the theoretcal convergence guarantees of ADMM no longer hold n ths case. Martns et al. [17] show that the ADMM framework s also sutable for handlng certan types of global factors, whch nclude a large number of varables n ther scope e.g., XOR factor). Usng an approprate formulaton, t s possble to ncorporate such factors n our dual LP framework as well. 4 Fnally, t s lkely that our method can be further mproved by usng recently ntroduced accelerated varants of ADMM [9]. Snce these varants acheve asymptotcally better convergence rate, the applcaton of such methods to MAP-LP smlar to the one we presented here wll lkely result n faster algorthms for approxmate MAP estmaton. In ths paper, we assumed that the model parameters were gven. However, n many cases one wshes to learn these from data, for example by mnmzng a predcton loss e.g., hnge loss [5]). We have recently shown how to ncorporate dual relaxaton algorthms nto such learnng problems [18]. It wll be nterestng to apply our ADMM approach n ths settng to yeld an effcent learnng algorthm for structured predcton problems. Acknowledgments. We thank Am Wesel and Elad Eban for useful dscussons and comments on ths manuscrpt. We thank Stephen Gould for hs SVL code. Ofer Mesh s a recpent of the Google European Fellowshp n Machne Learnng, and ths research s supported n part by ths Google Fellowshp. 4 The auxlary varables λ c are not used n ths case.

10 10 Ofer Mesh and Amr Globerson A Dervaton of Augmented Dual LP Algorthm In ths secton we derve the ADMM updates for the augmented Lagrangan of the dual MAP-LP whch we restate here for convenence: L ρδ, λ, δ, γ, µ) = max θ x x ) + ) δ cx ) + max θ x cx c) λ cx c)) c c: c c + γ cx ) δ cx ) δ ) cx ) + ρ δcx ) δ ) cx ) c : c x c : c x + µ cx c) λ cx c) ) δ cx ) + ρ λ cx c) δ cx ) c x c : c : c c x c ) Updates: The δ update: For each varable = 1,..., n consder a block δ whch conssts of δ c for all c : c. For ths block we need to mnmze the followng functon: ) max x θ x ) + c: c δ cx ) + c: c x γ cx )δ cx )+ ρ c: c x δcx ) δ cx ) ) Equvalently, ths can be wrtten more compactly n vector notaton as: 1 mn δ δ δ 1 ρ γ ) δ + 1 ρ max θ x ) + δ c x )) x c: c where δ and γ are defned analogous to δ. The closed-form soluton to ths QP s gven by the update n Alg. 1. It s obtaned by nspectng KKT condtons and explotng the structure of the summaton nsde the max for a smlar dervaton see [3]). The λ update: For each factor c C we seek to mnmze the functon: max x c θ c x c ) λ c x c )) + x c µ c x c )λ c x c ) + ρ In equvalent vector notaton we have the problem: x c λ c x c ) : c δ c x ) ) 1 mn λ c λ c δ c 1 ρ µ c λ c + 1 ρ max θ c x c ) λ c x c )) x c : c Ths QP s very smlar to that of the δ update and can be solved usng the same technque. The resultng closed-form update s gven n Alg. 1. )

11 An Alternatng Drecton Method for Dual MAP LP Relaxaton 11 The δ update: For each c C we consder a block whch conssts of δ c for all : c. We seek a mnmzer of the functon: γ c x ) δ c x ) + ρ δc x ) δ c x ) ) : c x : c x µ c x c ) δ c x ) + ρ λ c x c ) ) δ c x ) : c : c x c x c Takng partal dervatve w.r.t. δ c x ) and settng to 0 yelds: 1 δ c x ) = v c x ) X c\{,j} δcj x j ) 1 + X c\ x j j:j c,j where: v c x ) = δ c x ) + 1 ρ γ cx ) + x c\ λ c x c\, x ) + 1 ρ x c\ µ c x c\, x ). Summng ths over x and : c and pluggng back n, we get the update n Alg. 1. Fnally, the multplers update s straghtforward. B Dervaton of Augmented Prmal LP Algorthm We next derve the algorthm for optmzng Eq. 1) wth general local factors. Consder the followng formulaton whch s equvalent to the prmal MAP-LP problem of Eq. 1). Defne: { x f µ ) = µ x )θ x ) µ x ) 0 and µ x x ) = 1 otherwse { µ xc cx c )θ c x c ) µ c x c ) 0 and µ xc cx c ) = 1 f c µ c ) = otherwse f accounts for the non-negatvty and normalzaton constrants n LG). We add the margnalzaton constrants va copes of µ c for each c, denoted by µ c. Thus we get the augmented Lagrangan: L ρµ, µ, δ, β) = f µ ) + f cµ c) c δ cx ) µ cx ) µ x )) ρ µ cx ) µ x )) c : c x c : c x β cx c) µ cx c) µ cx c)) ρ µ cx c) µ cx c)) c x c c x c : c : c

12 1 Ofer Mesh and Amr Globerson where µ c x ) = x c\ µ c x c\, x ). To draw the connecton wth Eq. 5), n ths formulaton µ subsumes the role of x, µ subsumes the role of z wth gz) = 0), and the multplers δ, β) correspond to ν. We next show the updates whch result from applyng Eq. 6) to ths formulaton. Update µ for all = 1,..., n: µ arg max µ µ θ + c: c δ c + ρm µ c ) ) 1 µ ρ N) I)µ where M µ c = x c\ µ c x c\, ). We have to maxmze ths QP under smplex constrants on µ. Notce that the objectve matrx s dagonal, so ths can be solved n closed form by shftng the target vector and then truncatng at 0 such that the sum of postve elements equals 1 see [3]). The soluton can be computed n lnear tme n expectaton) by parttonng [3]. Update µ c for all c C: µ c arg max µ c c µ c θ c + : c β c + ρ µ c ) ) 1 µ c ρ Nc) I)µ c where Nc) = { : c}. Agan we have a projecton onto the smplex wth dagonal objectve matrx, whch can be done effcently. Update µ c for all c C, : c: µ c arg max µ ) c M ρ ρµ δ c ) β c + ρµ c µ c µ c M M + I ) µ c Here we have an unconstraned QP, so the soluton s obtaned by H 1 v. Further notce that the nverse H 1 can be computed n closed form. To see how, M M s a block-dagonal matrx wth blocks of ones wth sze X. Therefore, H = ρ M M + I ) s also block-dagonal. It follows that the nverse H 1 s a block-dagonal matrx where each block s the nverse of the correspondng block n H. Fnally, t s easy to verfy that the nverse of a block ρ 1 X + I X ) s gven by 1 ρ Update the multplers: I X 1 X +1 1 X ). δ c x ) δ c x ) + ρ µ c x ) µ x )) β c x c ) β c x c ) + ρ µ c x c ) µ c x c )) for all c C, : c, x for all c C, : c, x c

13 Bblography [1] M. Afonso, J. Boucas-Das, and M. Fgueredo. Fast mage recovery usng varable splttng and constraned optmzaton. Image Processng, IEEE Transactons on, 199): , sept [] D. P. Bertsekas and J. N. Tstskls. Parallel and dstrbuted computaton: numercal methods. Prentce-Hall, Inc., Upper Saddle Rver, NJ, USA, 003. [3] J. Duch, S. Shalev-Shwartz, Y. Snger, and T. Chandra. Effcent projectons onto the l1-ball for learnng n hgh dmensons. In Proceedngs of the 5th nternatonal conference on Machne learnng, pages 7 79, 008. [4] J. Ecksten and D. P. Bertsekas. On the douglas-rachford splttng method and the proxmal pont algorthm for maxmal monotone operators. Mathematcal Programmng, 55:93 318, June 199. [5] D. Gabay and B. Mercer. A dual algorthm for the soluton of nonlnear varatonal problems va fnte-element approxmatons. Computers and Mathematcs wth Applcatons, :17 40, [6] K. Gmpel and N. A. Smth. Softmax-margn crfs: tranng log-lnear models wth cost functons. In Human Language Technologes: The 010 Annual Conference of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs, pages , 010. [7] A. Globerson and T. Jaakkola. Fxng max-product: Convergent message passng algorthms for MAP LP-relaxatons. In Advances n Neural Informaton Processng Systems, pages , 008. [8] R. Glownsk and A. Marrocco. Sur lapproxmaton, par elements fns dordre un, et la resoluton, par penalsaton-dualté, dune classe de problems de drchlet non lneares. Revue Françase d Automatque, Informatque, et Recherche Opératonelle, 9:4176, [9] D. Goldfarb, S. Ma, and K. Schenberg. Fast alternatng lnearzaton methods for mnmzng the sum of two convex functons. Techncal report, UCLA CAM, 010. [10] T. Hazan and A. Shashua. Norm-product belef propagaton: Prmal-dual message-passng for approxmate nference. Informaton Theory, IEEE Transactons on, 561): , Dec [11] B. S. He, H. Yang, and S. L. Wang. Alternatng drecton method wth selfadaptve penalty parameters for monotone varatonal nequaltes. Journal of Optmzaton Theory and Applcatons, 106: , 000. [1] J. Johnson. Convex Relaxaton Methods for Graphcal Models: Lagrangan and Maxmum Entropy Approaches. PhD thess, EECS, MIT, 008. [13] V. Jojc, S. Gould, and D. Koller. Fast and smooth: Accelerated dual decomposton for MAP nference. In Proceedngs of Internatonal Conference on Machne Learnng, 010. [14] V. Kolmogorov. Convergent tree-reweghted message passng for energy mnmzaton. Pattern Analyss and Machne Intellgence, IEEE Transactons on, 810): , 006.

14 14 Ofer Mesh and Amr Globerson [15] N. Komodaks and N. Paragos. Beyond loose LP-relaxatons: Optmzng MRFs by reparng cycles. In 10th European Conference on Computer Vson, pages , 008. [16] N. Komodaks, N. Paragos, and G. Tzrtas. Mrf energy mnmzaton and beyond va dual decomposton. Pattern Analyss and Machne Intellgence, IEEE Transactons on, 33:531 55, March 011. [17] A. F. T. Martns, M. A. T. Fgueredo, P. M. Q. Aguar, N. A. Smth, and E. P. Xng. An augmented lagrangan approach to constraned map nference. In Internatonal Conference on Machne Learnng, June 011. [18] O. Mesh, D. Sontag, T. Jaakkola, and A. Globerson. Learnng effcently wth approxmate nference va dual losses. In Proceedngs of the 7th Internatonal Conference on Machne Learnng, pages , 010. [19] Y. Nesterov. Smooth mnmzaton of non-smooth functons. Mathematcal Programmng, 103:17 15, 005. [0] P. Ravkumar, A. Agarwal, and M. Wanwrght. Message-passng for graphstructured lnear programs: proxmal projectons, convergence and roundng schemes. In Proc. of the 5th Internatonal Conference on Machne Learnng, pages , 008. [1] A. M. Rush, D. Sontag, M. Collns, and T. Jaakkola. On dual decomposton and lnear programmng relaxatons for natural language processng. In Proceedngs of the 010 Conference on Emprcal Methods n Natural Language Processng EMNLP), 010. [] M. I. Schlesnger. Syntactc analyss of two-dmensonal vsual sgnals n nosy condtons. Kbernetka, 4: , [3] D. Sontag, A. Globerson, and T. Jaakkola. Introducton to dual decomposton for nference. In S. Sra, S. Nowozn, and S. J. Wrght, edtors, Optmzaton for Machne Learnng. MIT Press, 011. [4] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Wess. Tghtenng LP relaxatons for MAP usng message passng. In Proc. of the 4th Annual Conference on Uncertanty n Artfcal Intellgence, pages , 008. [5] B. Taskar, C. Guestrn, and D. Koller. Max margn Markov networks. In S. Thrun, L. Saul, and B. Schölkopf, edtors, Advances n Neural Informaton Processng Systems 16, pages 5 3. MIT Press, Cambrdge, MA, 004. [6] S. Tosserams, L. Etman, P. Papalambros, and J. Rooda. An augmented lagrangan relaxaton for analytcal target cascadng usng the alternatng drecton method of multplers. Structural and Multdscplnary Optmzaton, 31: , 006. [7] M. J. Wanwrght and M. Jordan. Graphcal models, exponental famles, and varatonal nference. Foundatons and Trends n Machne Learnng, 11-):1 305, 008. [8] T. Werner. A lnear programmng approach to max-sum problem: A revew. Pattern Analyss and Machne Intellgence, IEEE Transactons on, 9: , 007. [9] C. Yanover, T. Meltzer, and Y. Wess. Lnear programmng relaxatons and belef propagaton an emprcal study. Journal of Machne Learnng Research, 7: , 006.

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probablstc & Unsupervsed Learnng Convex Algorthms n Approxmate Inference Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt Unversty College London Term 1, Autumn 2008 Convexty A convex