Convergent Propagation Algorithms via Oriented Trees

Size: px

Start display at page:

Download "Convergent Propagation Algorithms via Oriented Trees"

Sherman Berry
5 years ago
Views:

1 Convergent Propagaton Algorthms va Orented Trees Amr Globerson CSAIL Massachusetts Insttute of Technology Cambrdge, MA Tomm Jaakkola CSAIL Massachusetts Insttute of Technology Cambrdge, MA Abstract Inference problems n graphcal models are often approxmated by castng them as constraned optmzaton problems. Message passng algorthms, such as belef propagaton, have prevously been suggested as methods for solvng these optmzaton problems. However, there are few convergence guarantees for such algorthms, and the algorthms are therefore not guaranteed to solve the correspondng optmzaton problem. Here we present an orented tree decomposton algorthm that s guaranteed to converge to the global optmum of the Tree-Reweghted (TRW) varatonal problem. Our algorthm performs local updates n the convex dual of the TRW problem an unconstraned generalzed geometrc program. Prmal updates, also local, correspond to orented reparametrzaton operatons that leave the dstrbuton ntact. 1 Introducton The problem of probablstc nference n graphcal models refers to the task of calculatng margnal dstrbutons or the most lkely assgnment varables. Both these problems are generally NP hard, requrng approxmate methods. Many approxmate nference methods, ncludng message passng algorthms, can be vewed as tryng to solve a varatonal formulaton of the nference problem. The dea n varatonal approaches s to cast approxmate nference as a constraned mnmzaton of a free energy functon (see [14] for a recent revew). Two key questons arse n ths context. The frst s how to choose the free energy, and the second s how to desgn effcent algorthms that mnmze t. When the Bethe free energy s used, t has been shown [16] that fxed-ponts of the belef propagaton (BP) algorthm correspond to local mnma of the free energy. However, BP s not generally guaranteed to converge to a fxed-pont. Although there do exst algorthms that are guaranteed to converge to a local mnmum of the Bethe free energy [15, 17], ts global mnmzaton s stll a hard non-convex problem for whch no effcent algorthms are known. The dffcultes wth the Bethe free energy derve from ts non-convexty and correspondng local mnma problem. To avod ths dffculty, several authors have recently studed convex free energes [6, 7, 13]. The assocated convex optmzaton problems can n prncple be solved usng generc convex optmzaton procedures [1] wth guarantees of fndng the global optmum n polynomal tme. Although ths presents a sgnfcant mprovement over the non-convex case, the generc optmzaton route may be very costly n large practcal problems. For example, when usng a generc convex solver, every update of the varables has complexty O(n), where n s the number of varables. In contrast, the optmzaton usng message passng algorthms can be reduced to local updates wth O(1) operatons. Interestngly, even n the convex settng, the convergence of these message passng algorthms s typcally not guaranteed, and dampng heurstcs are requred to ensure convergence n practce [13]. A promnent excepton s [7] where the author provdes a provably convergent message passng algorthm for free energes where the entropy term s a non-negatve combnaton of jont entropes. Here we provde a provably convergent message passng algorthm for a specfc varatonal setup, namely the Tree-Reweghted (TRW) optmzaton problem of Wanwrght et al. [13]. The algorthm we propose s guaranteed to converge to the global optmum of the free energy, and does not requre addtonal parameters such as the dampng rato. A key step n obtanng the updates s dervng the convex-dual of TRW, whch we show to be an unconstraned nstance of a

2 generalzed geometrc program (GP) [3]. We derve a message passng algorthm, whch we call TRW Geometrc Programmng (TRW-GP), that yelds monotone mprovement of the dual GP. We demonstrate the utlty of our TRW-GP algorthm by provdng an example where the TRW message passng algorthm n [13] does not converge, but TRW-GP does. 2 The Tree-Reweghtng Formulaton We consder parwse Markov random felds (MRF) over a set of varables x = x 1,..., x n. Gven a graph G wth n vertces V and a set of edges E, an MRF s a dstrbuton over x defned by p(x; θ) = 1 Z(θ) ep j E θj(x P,x j)+ V θ(x) (1) where θ j (x, x j ) and θ (x ) are parameters, θ denotes all the parameters, and Z(θ) s the partton functon. Our focus here s on approxmatng sngleton margnals of p(x; θ), namely p(x ; θ). Ths problem s closely related to that of evaluatng the partton functon Z(θ). We focus on the TRW varatonal problem whch yelds an upper bound on Z(θ) as well as a set of approxmate margnals obtaned from the mnmzng soluton. We begn by brefly revewng the TRW formalsm. Consder a set of k spannng trees on G denoted by T 1,..., T k, and a dstrbuton ρ over these trees where ρ 0 and ρ = 1. To avod overloadng notaton n subsequent analyss, we assume here that the trees are drected, so that the same tree structure may appear multple tmes wth dfferent edge orentatons. Ths dffers from the presentaton n [13] though the dstncton s mmateral n the remander of ths secton. We also ntroduce the noton of pseudomargnals defned as the sngleton and parwse margnals µ (x ), µ j (x, x j ) assocated wth the edges and nodes of G. We use µ to denote the set of all these margnals and C(G) the set of µ s that are parwse consstent µ j (x, x j ) = µ (x ), µ j (x, x j ) = µ j (x j ) x j x µ (x ) = 1, µ j (x, x j ) 0. x For a gven tree T and µ C(G), defne the entropy H(µ; T ) to be the entropy of an MRF on the tree T wth margnals gven by µ. Note that only a subset of the parwse dstrbutons n µ wll be used for each tree, namely µ j (x, x j ) such that j s an edge n T. The tree entropy may be wrtten n closed form as (cf. [13]) H(µ; T ) = H(X ) I(X ; X j ) (2) j T where H(X ) s the entropy of µ (x ) and I(X ; X j ) s the mutual nformaton calculated from µ j (x, x j ). Note that ths expresson s ndependent of the drecton of the edges n the tree. We wll make use of the drected edges n the next secton. Defne the followng varatonal free energy functon F(µ; ρ, θ) F(µ; ρ, θ) = µ θ k ρ H(µ; T ). (3) =1 In [13] t s shown that mnmzng F(µ; ρ, θ) results n an upper bound on the log-partton functon log Z(θ) mn F(µ; ρ, θ). (4) µ C(G) The mnmzaton also results n an optmal (mnmzng) µ, whch s used to approxmate the margnals of p(x; θ). Emprcal results n [13] show that TRW usually performs as well as, and often better than the standard Bethe free energy approxmatons, especally n regmes where BP fals to converge. 3 Condtonal Entropes and Drected Edge Probabltes Our goal s to use convex dualty to obtan the dual problem of Eq. (4). To acheve ths, we frst seek a representaton of F(µ; ρ, θ) that s a convex functon of µ for all values of µ, and not just wthn the consstent set µ C(G). For example, the entropy term n Eq. (2) s concave only for µ C(G) but not for a general µ. We therefore seek an alternatve expresson for the tree entropy. Let r(t ) be the root node of T (recall that the trees are drected). We wrte the entropy assocated wth the tree as H(µ; T ) = H(X r(t ) ) + H(X X j ) (5) j T where j T mples that there s a drected edge from vertex j to vertex n the drected tree T. The condtonal entropy H(X X j ) s assumed to be calculated only on the bass of the jont margnal µ j (x, x j ), and does not nvolve µ (x ). The entropy H(X r(t ) ) s calculated va the sngleton margnal µ r(t ) (x r(t ) ). The expressons n Eq. (5) and Eq. (2) wll agree whenever µ C(G). However, they wll yeld dfferent results when µ / C(G). The advantage of Eq. (5) s that H(µ; T ) s now a concave functon of the set of margnals µ. The concavty follows mmedately from the concavty of H(X ) as a functon of µ (x ) and the concavty of the condtonal entropy H(X X j ) as a functon of µ j (x, x j ) [6].

3 The functon F(µ; ρ, θ) nvolves a summaton over a potentally large number of tree entropes. To express ths compactly whle mantanng drectonalty, we defne ρ j as the probablty that the drected edge j s present n a tree drawn accordng to the dstrbuton ρ over trees. Smlarly, we defne ρ as the probablty that node appears as a root. We note that t s possble to fnd such edge probabltes for dstrbutons (e.g. unform) over the set of all spannng trees by employng a varant of the matrx tree theorem for drected trees (see [12] p. 141 and [11]). The functon F(µ; ρ, θ) can now be wrtten as µ θ V ρ H(X ) ρ j H(X X j ) (6) j Ē where the edge set Ē contans edges n both drectons. In other words, f j Ē then j s also n Ē. The new functon F(µ; ρ, θ) s convex n µ wthout assumng consstency of the margnals. 4 The TRW Convex Dual The TRW prmal problem s gven by mn F(µ; ρ, θ). (7) µ C(G) Snce the functon F(µ; ρ, θ) s now convex for all µ and the set of constrants s lnear, ths optmzaton problem s convex and thus has an equvalent convex dual [1]. 1 However, t s not mmedately clear how to derve ths dual n closed form. The man dffculty s that two terms n the objectve F(µ; ρ, θ) depend on µ j (x, x j ), namely H(X X j ) and H(X j X ). To get around ths problem we ntroduce addtonal varables to the prmal problem. Specfcally, we replace µ j (x, x j ) by two copes whch we denote by µ j (x, x j ) and µ j (x, x j ), and requre that these two copes are dentcal. The entropy H(X X j ) s then evaluated va the varables µ j (x, x j ). We shall also fnd t convenent to replace the consstency constrants n C(G) by the followng equvalent drected consstency constrants µ j (x, x j ) = µ j (x, x j ) µ j (x, x j ) = µ (x ), µ j (x, x j ) = µ j (x j ) x j x µ (x ) = 1, µ j (x, x j ) 0, µ j (x, x j ) 0. x For smplcty we wll contnue to denote the new extended varable set by µ (as we wll be usng t from 1 Strct dualty follows from Slater s condtons, whch are satsfed n ths case. now on) and refer to the consstency constrants by C(G). The TRW prmal problem s then P T RW : mn F(µ; ρ, θ). µ C(G) (8) The convex dual of P T RW s derved n App. A. and s n fact a convex unconstraned mnmzaton problem. In what follows we descrbe ths dual. The dual varables wll be denoted by β j (x, x j ) for j E, and are not constraned. 2 The dual objectve s gven by F D (β; ρ, θ) = ρ log x e ρ 1 (θ(x) P k N () λ k (x ;β)) where λ j (x ; β) s a functon of the β varables: λ j (x ; β) = ρ j log x j and δ j s defned as δ j = { 1 j E 1 j E. e ρ 1 j (θj(x,xj)+δ j β j(x,x j)) The dual TRW optmzaton problem s then DT RW : mn β F D (β; ρ, θ). (9) We re-emphasze the fact the DTRW s an unconstraned mnmzaton of a functon of β. The varables λ j (x ; β) are ntroduced merely for the purpose of notatonal convenence. The mappng between dual and prmal varables can be shown to be µ (x ) e ρ 1 (θ(x) P k N () λ k (x ;β)) µ j (x j x ) e ρ 1 j (θj(x,xj)+δ j β j(x,x j)). (10) Ths relaton maps the optmal β to the optmal µ, but we shall also use t for non-optmal values. The dual objectve F D (β; ρ, θ) s a convex functon (see App. B) and therefore has no local mnma. 5 Dual Gradent and Optmum The DTRW problem presented above s unconstraned and can thus be solved usng a varety of gradent based algorthms, such as conjugate gradent or BFGS [10]. The gradent of F D (β; ρ, θ) w.r.t. β s F D (β; ρ, θ) β j (x, x j ) = µ j(x x j )µ j (x j ) µ j (x j x )µ (x ) where the dstrbutons are gven by the dual to prmal mappng n Eq. (10). The gradent s thus a measure 2 Note that β varables are not drected,.e., there s one varable β j per edge.

4 of the dscrepancy between two ways of calculatng the jont parwse margnal, based on the two dfferent orentatons of the edge j. To characterze the optmum of DTRW we set the gradent to zero, yeldng the followng smple dual optmalty crteron µ j (x x j )µ j (x j ) = µ j (x j x )µ (x ). (11) Thus at the optmum the two alternatve ways of estmatng µ j (x, x j ) wll yeld the same result. Calculatng the gradent w.r.t a gven β j (x, x j ) has complexty O(1), and reles only on β j (x, x j ) for edges contanng or j. Thus the gradent can be calculated locally, and gradent descent algorthms can be mplemented effcently. One drawback of gradent based algorthms s ther relance on lne-search modules for fndng a step sze that decreases the objectve. In the next secton we consder updates that are parameter-free. 6 Local Margnal Updates The gradent updates descrbed n the prevous secton use the dfference between two jont dstrbutons. We wll now focus on updates relyng on the rato between these dstrbutons. Consder β t+1 j (x, x j ) = β t j(x, x j )+ɛ log µt j (x j x )µ t (x ) µ t j (x x j )µ t j (x j) (12) where µ t j (x j x ) and µ t (x ) are functons of β as n Eq. (10) and ɛ s a step sze whose value wll be dscussed n the next secton. As a rato of two expected values, the update s remnscent of Generalzed Iteratve Scalng [5]. We shall assume for smplcty that only one edge s updated at each tme step t. The update n Eq. (12) s performed on the β varables. An equvalent, and somewhat smpler update may be derved n terms of the varables µ t j (x x j ) and µ t j (x j). The resultng updates and algorthm are descrbed n Fgure 1. We call the resultng algorthm TRW-GP (TRW Geometrc Programmng). 6.1 Convergence Proof To analyze the convergence of the update n the prevous secton, we need to consder the resultng change n the objectve F D (β; ρ, θ), namely F D (β t ; ρ, θ) F D (β t+1 ; ρ, θ). It can be shown (see App. D) that ths dfference only depends on the µ varables n the TRW-GP algorthm, and thus we denote t by D (µ t ). Snce F D (β t ; ρ, θ) should be mnmzed, ths dfference needs to be non-negatve. Ths s ndeed guaranteed by the followng lemma (see App. D): Lemma 6.1 : For 0 < ɛ < mn(ρ, ρ j, ρ j, ρ j ) the dual objectve s decreased at every teraton so that D (µ t ) 0 for all t. Furthermore, D (µ t ) = 0 holds f and only f the optmum condton of Eq. (11) s satsfed. Any choce of ɛ that s smaller than mn(ρ, ρ j, ρ j, ρ j ) wll result n monotone mprovement of the objectve. In the current mplementaton we use ɛ = 1 2 mn(ρ, ρ j, ρ j, ρ j ). Ths value turns out to mnmze a frst order approxmaton of the mprovement n the objectve, and was found to work well n practce. The convergence to the global optmum now follows from Lemma 6.1. Lemma 6.2 : The updates n Eq. (12) wth ɛ as n Lemma 6.1 converge to the jont optmum of PTRW and DTRW. Proof: Denote the mappng from µ t to µ t+1 by R(µ t ) = µ t+1. The mappng s clearly contnuous. By Lemma 6.1 the sequence F D (β t ; ρ, θ) s monotoncally decreasng. It s also bounded snce F D (β; ρ, θ) s bounded and thus the dfference seres D (µ t ) converges to zero. Takng t to nfnty then mples that µ t has a convergent subsequence that converges to some µ. Ths µ wll then satsfy F D (µ ; ρ, θ) = F D (R(µ ); ρ, θ). We know from the Lemma 6.1 that such a pont necessarly satsfes the zero gradent condton n Eq. (11), and thus µ (or more precsely, the correspondng β) mnmzes the dual objectve. 3 7 Tree Re-parametrzaton Vew The TRW problem can be nterpreted n terms of teratng through dfferent re-parametrzatons of the dstrbuton p(x; θ) [13]. Here we present a related vew of our algorthm. We wsh to show that the margnal varables obtaned by the algorthm can always be used to obtan the orgnal dstrbuton va p(x; θ) = c t µ t (x ) ρ µ t j (x j x ) ρ j. (13) j Ē For t = 0 ths s clearly true. We proceed by nducton. Assume that at teraton t we have a reparametrzaton wth constant c t. Substtutng the update rule n Fgure 1 and usng smple algebra shows that we agan have a reparametrzaton, only wth c t+1 = c t e FD(βt+1 ;ρ,θ) F D(β t ;ρ,θ) = c t e D(µt). 3 To carefully account for the possblty that some of the convergng margnals would nvolve zero probabltes, the updates n the prmal form, along wth the objectve, can be wrtten n a form wthout any ratos.

5 Inputs: A graph G = (E, V ), parameter vector θ on G, root probabltes ρ and drected edge probabltes ρ j for (j), (j) E. Intalzaton: Set µ 0 (x ) e ρ 1 θ(x) and µ 0 j (x x j ) e ρ 1 j θj(x,xj) Algorthm: Iterate untl small enough change n margnals: Set ɛ = 1 2 mn(ρ, ρ j, ρ j, ρ j ), and update ( µ t+1 (x ) µ t (x ) µ t j (x j x ) x j µ t+1 j (x x j ) µ t j (x x j ) 1 ɛρ 1 j ( ) ɛρ µ t j (x xj)µt j (xj) µ t j (xj x)µt (x) ( ) ɛρ 1 µ t j (xj x)µt (x) j µ t j (xj) 1) ρj ρ 1 j Output: Fnal values of margnals. Fgure 1: The TRW-GP algorthm expressed n terms of condtonal and sngleton margnals. In other words the multplcatve constant turns out to be related to the mprovement n the dual functon. Ths creates an nterestng lnk between reparametrzaton and mnmzaton, and may be used to study message passng algorthms where a dual s more dffcult to characterze. 8 Relaton to Prevous Work Heskes [7] recently presented a detaled study of convex free energes. When the entropy term s a postve combnaton of jont and sngleton entropes (and s therefore concave), he provdes a local update algorthm that s monotone n the convex dual, and converges to the global optmum. He then dscusses the applcaton of the same algorthm to the case where the sngleton entropes all have negatve weght, and the overall entropy s convex over the set of constrants. 4 In ths case, the dual s generally not gven n closed form and t s not known f the algorthm decreases t at every step. However, Heskes argues that wth suffcent dampng the algorthm can be shown to converge, although the exact form of dampng s not gven. Snce the TRW entropy can be shown to decompose nto postvely weghted parwse entropes and negatvely weghted sngleton entropes, t satsfes the above condton n Heskes work. Our analyss provdes several advantages over the algorthm n [7]. Frst, we derve a closed form soluton of dual. Second, the dual s unconstraned, and thus allows unconstraned mnmzaton methods to be appled. Thrd, unlke most belef propagaton varants, our algorthm 4 The dscusson n [7] s n terms of general regons, not just pars. We present hs argument for the smpler parwse case. s shown to provde a monotone mprovement of an objectve functon 5, and thus dverges from the standard fxed pont analyss used n message passng algorthms. Fnally, another algorthm whch s guaranteed to converge to a global mnmum of convexfed free energes s the double loop CCCP algorthm of Yulle [17]. The man dsadvantage of CCCP s that each teraton requres solvng an optmzaton problem. Ths usually results n slower convergence, and furthermore t s not clear what precson s requred for the nner loop optmzaton, and how ths affects convergence guarantees. The algorthm we present here s essentally a sngle loop method, and s thus easer to analyze. 9 Emprcal Demonstraton The orgnal TRW message passng (TRW-MP) algorthm presented n [13] s not generally guaranteed to converge. However, we observed emprcally that when dampng of α = 0.5 s appled to the log-messages, convergence s always acheved. 6 To compare TRW- MP to TRW-GP, we use the pseudomargnals generated by TRW-MP 7 as margnals n the prmal objectve F(µ; ρ, θ) n Eq. (3). Ths value s not expected to be an upper or lower bound on the optmum of F(µ; ρ, θ), snce the TRW-MP pseudomargnals are 5 As mentoned above, Heskes presents such an algorthm for postvely weghted sngleton and parwse entropes. It s however not clear that such entropes are useful n practce 6 Ths observaton s n lne wth Heskes argument that suffcently damped messages wll converge for the case of the TRW free energy. 7 See Equatons (58) and (59) n [13].

6 Prmal Dual Value TRW GP TRW MP TRW MP(damped) Iteraton Prmal Dual Value TRW GP TRW MP TRW MP(damped) Iteraton Fgure 2: Illustraton of the dual message passng algorthm for a Isng model. The TRW-GP curve shows the dual objectve value F D(β; ρ, θ) obtaned by the TRW-GP algorthm. The TRW-MP curves show the prmal objectve values F(µ; ρ, θ) obtaned by TRW message passng algorthms. The damped TRW-MP used a dampng of 0.5 n the log doman. The MRF parameters were set as follows: α F = 1, α I = 9 for the left fgure, and α F = 1, α I = 1 for the rght fgure. not guaranteed to be parwse consstent, except at the optmum. However, snce the TRW-MP pseudomargnals converge to the optmal prmal margnals, the value F(µ; ρ, θ) wll converge to the prmal optmum. The progress of TRW-GP may be montored by evaluatng F D (β; ρ, θ) at every teraton. Ths value s guaranteed to decrease and converge to the optmum of F D (β; ρ, θ) whch s dentcal to the optmum of F(µ; ρ, θ). We can thus observe the rate at whch the dfferent algorthms converge to ther jont optmum. To study the convergence rate of the two algorthms, we used an Isng model on a grd wth nteracton parameters θ j drawn unformly from [ α I, α I ] and feld parameters θ drawn unformly from [ α F, α F ]. The MRF s gven by p(x; θ) e P j E θjxxj+p V θx where x {+1, 1}. We used a unform dstrbuton over drected spannng trees calculated as n [11]. Fgure 2 (left) shows an example run where the undamped TRW-MP algorthm does not converge, but the TRW-GP and the damped TRW-MP do converge, and do so roughly at the same rate. Fgure 2 (rght) shows an example where both TRW-MP algorthms converge and do so at a faster rate than TRW-GP. We expermented wth varous values of α F and α I and have observed that at lower nteracton levels (e.g., α I 4 for α F = 1) the TRW-MP algorthms outperform TRW-GP, whereas for hgher nteracton levels the undamped TRW-MP does not converge, but the damped verson converges at roughly the same rate as TRW-GP. We also expermented wth conjugate gradent mnmzaton of F D (β; ρ, θ), but these dd not yeld better rates than TRW-GP. 10 Conclusons We presented a novel message passng algorthm whose updates yeld a monotone mprovement on the dual of the TRW free energy mnmzaton problem. In order to obtan a closed form dual we used two trcks. The frst was to decouple dfferent entropes that depend on the same margnals by ntroducng multple copes of these margnals. The second was to use un-drectonal consstency constrants, so that every copy of a jont margnal appears n a sngle consstency constrant. Although we presented the method n the context of tree decompostons, the algorthm tself stll apples as long as ρ j and ρ are non-negatve (although the upper bound on the log partton functon may not be guaranteed n ths case). The TRW-GP algorthm resolves the convergence problems wth the undamped TRW-MP algorthm. However, we observed emprcally that the damped TRW-MP algorthm always converges, and typcally at a better rate than TRW-GP. Thus, the man contrbuton of the current paper s n ntroducng a dual framework for message passng algorthms, whch could be used to analyze exstng algorthms, and possbly develop faster varants n the future. Free energes may be defned usng margnals of more than two varables [13, 16]. In a recent paper [6] we study the relaton between such free energes and GP. It wll be worthwhle to study generalzatons of TRW-GP to ths case. Another nterestng extenson s to the MAP problem, where the correspondng varatonal problem s a lnear program. Global convergence results for MAP message passng algorthms such as max-product are also hard to obtan n the general case. It turns out that an approach smlar to the one presented here may be used to obtaned convergent algorthms to solve the MAP lnear program. These algorthms wll be presented elsewhere. References [1] D. P. Bertsekas. Nonlnear Programmng. Athena Scentfc, Belmont, MA, [2] S. Boyd and L. Vandenberghe. Convex Optmzaton. Cambrdge Unv. Press, [3] M. Chang. Geometrc programmng for communcaton systems. Foundatons and Trends n Communcatons and Informaton Theory, 2(1):1 154, [4] M. Chang and S. Boyd. Geometrc programmng duals of channel capacty and rate dstorton. IEEE Trans. on Informaton Theory, 50(2): , [5] J.N. Darroch and D. Ratclff. Generalzed teratve scalng for log-lnear models. Ann. Math. Statst., 43(5): , 1972.

7 [6] A. Globerson and T. Jaakkola. Approxmate nference usng condtonal entropy decompostons. In AISTATS, [7] T. Heskes. Convexty arguments for effcent mnmzaton of the Bethe and Kkuch free energes. Journal of Artfcal Intellgence Research, 26: , [8] E.L. Peterson R.J. Duffn and C. Zener. Geometrc programmng. Wley, [9] L. Vandenberghe S. Boyd, S.J. Km and A. Hassb. A tutoral on geometrc programmng. Optmzaton and Engneerng, [10] F. Sha and F. Perera. Shallow parsng wth condtonal random felds. In Proc. HLT NAACL, [11] X. Carreras T. Koo, A. Globerson and M. Collns. Structured predcton models va the matrx-tree theorem. In EMNLP, [12] W. Tutte. Graph Theory. Addson-Wesley, [13] M. J. Wanwrght, T. Jaakkola, and A. S. Wllsky. A new class of upper bounds on the log partton functon. IEEE Trans. on Informaton Theory, 51(7): , [14] M.J. Wanwrght and M.I. Jordan. Graphcal models, exponental famles, and varatonal nference. Techncal report, UC Berkeley Dept. of Statstcs, [15] M. Wellng and Y.W. Teh. Belef optmzaton for bnary networks: A stable alternatve to loopy belef propagaton. In Uncertanty n Artfcal Intellgence, [16] J.S. Yedda, W.T. Freeman, and Y. Wess. Constructng free-energy approxmatons and generalzed belef propagaton algorthms. IEEE Trans. on Informaton Theory, 51(7): , [17] A. L. Yulle. CCCP algorthms to mnmze the Bethe and Kkuch free energes: Convergent alternatves to belef propagaton. Neural Computaton, 14(7): , A Dervng the TRW Dual Our goal s to show that the problems n Eq. (8) and Eq. (9) are convex duals of each other. Frst, we clam that the convex dual of the PTRW problem n Eq. (8) s gven by DT RW C mn ρ log e ρ 1 (θ P (x ) k N () λ k (x )) x s.t. e ρ 1 j (θj(x,xj)+δ j β j(x,x j)+λ j (x )) 1. x j The varables n the above problem are λ j (x ), λ j (x j ) and β j (x, x j ) for every edge j E. The dualty between PTRW and DTRWC results from the dualty between condtonal entropy maxmzaton and geometrc programs, and appears n several works n slghtly dfferent forms [3, 8]. A dervaton of the dualty result can be found n [2] (page 256) and [4]. It s mportant to note that the dual can be found n ths case because the objectve s a sum of condtonal entropes (and sngleton entropes) as n Eq. (6). It s not clear how to derve a dual f tree entropes are expressed va mutual nformaton as n Eq. (2). Due to complementary slackness condtons, the nequalty n the constrants of DTRWC wll hold wth equalty at the optmum ff the optmal prmal varables satsfy µ (x ) > 0. In App. C we show that for the current objectve, ths wll always happen,.e., µ (x ) > 0 for all and x. We thus conclude that all the nequalty constrants n DTRWC are always satsfed as equaltes at the optmum. We therefore lose nothng by replacng them wth equalty constrants x j e ρ 1 j (θj(x,xj)+δ j β j(x,x j)+λ j (x )) = 1. (14) Snce each varable λ j (x ) appears n only one constrant, we can elmnate t by expressng t as a functon of the β varables λ j (x ; β) = ρ j log x j e ρ 1 j (θj(x,xj)+δ j β j(x,x j)). Snce the λ j (x ) varables have been elmnated and the equalty constrants are satsfed, optmzaton s now only over the β varables, yeldng the DTRW problem n Eq. (9) B mn β ρ log x e ρ 1 (θ (x ) P k N () λ k (x ;β)). Convexty of the Dual Here we argue that the functon F D (β; ρ, θ) s a convex functon of β. We frst defne the class of posynomal functons as functons of the form [9] f(x 1,..., x n ) = K k=1 c k x a 1k 1 x a 2k 2,..., x a1n n (15) where c k > 0. A functon f(x 1,..., x n ) s sad to be a generalzed posynomal f t s ether a posynomal or t can be formed from generalzed posynomals usng the operatons of addton, multplcaton, postve power, maxmum and composton. A key property of generalzed posynomals s that they can be turned nto a convex functon by a smple change of varables. Specfcally, f f(x) s a generalzed posynomal, then F (y) = log f(e y ) s a convex functon of y [9].

8 It s easy to see that f j (x ; e β ) = e λ j (x ;β) s a generalzed posynomal n e β (snce t s a postve power of a posynomal). The functon g (e β ) = x e ρ 1 (θ (x ) P k N () λ k (x ;β)) = e ρ 1 θ(x) x k N() f k (x ; e β ) ρ 1 s then also a generalzed posynomal. Therefore log g (e β ) s a convex functon of β. Snce F D (β; ρ, θ) = ρ log g (e β ), t follows that F D (β; ρ, θ) s a convex functon of β. C Strct Postvty of TRW Margnals Here we want to show that the soluton of the prmal problem must satsfy µ (x ) > 0. To do so, we employ an alternatve formulaton of TRW [13]. Assgn a parameter vector θ T to every tree n the set of trees, and denote by Z(θ T ) the partton functon of an MRF on tree T wth parameters Z(θ T ). TRW can then be cast as mn T ρ T log Z(θ T ) s.t. T ρ T θ T = θ. (16) At the optmum, all tree dstrbutons p(x; θ T ) can be shown to have the same sngleton margnals µ (x ), and these correspond to the margnals that solve PTRW. The optmzaton above can be rewrtten as mn T ρ T D KL [p(x; θ) p(x; θ T )] s.t. T ρ T θ T = θ. (17) The objectve n Eq. (16) can be obtaned from that n Eq. (17) by expandng the D KL and usng the fact that the constrants hold. The two objectves then dffer by a constant log Z(θ) Assume that at the optmum of PTRW, there exst and x such that µ (x ) = 0. Then the above argument mples that all trees dstrbutons wll have ths zero margnal. However, n that case there wll be an assgnment x such that p(x; θ T ) = 0. On the other hand, for any fnte θ the true dstrbuton p(x; θ) wll be strctly greater than zero. The D KL wll then be nfnte, mplyng the parameters are not optmal and resultng n a contradcton. 8 D Monotoncty of Updates Assume we perform an update on the µ varables correspondng to an edge j E (.e., update 8 We note that the same argument can be appled to show that the optmal parwse margnals µ j(x, x j) are never zero. µ j (x, x j ), µ j (x, x j ), µ (x ) and µ j (x j )) and that ɛ < mn(ρ, ρ, ρ j, ρ j ). The resultng dfference n objectve value can be wrtten as D (µ t ) = f + f j where f = ρ log x µ t (x )e ρ 1 (λt j (x) λt+1 j (x)). (18) The λ t j (x ) λ t+1 j (x ) dfference can be wrtten n terms of µ t as ρ j log x j µ t (x j x ) 1 ɛρ 1 j ( ) µ t (x x j)µ t ɛρ 1 (x j) j µ t (x ) = ρ j log x j e (1 ɛρ 1 j ) log µt (x j x )+ɛρ 1 j log µt (x x j )µ t (x j ) µ t (x ) Snce 0 < ɛ < ρ j we can use the convexty of the log exp functon and the fact that x j µ t j (x j x ) = 1 to obtan λ t j (x ) λ t+1 j (x ) ɛ log x j µ t j (x x j )µ t j (x j) µ t (x ) Note that snce log exp s strctly convex, equalty here s acheved f and only f µ t (x j x ) = µ t (x x j)µ t (x j) µ t (x ), whch mples the optmum condton n Eq. (11) s satsfed. Substtutng ths n the expresson for f n Eq. (18) and rearrangng we have f ρ log x (µ t (x ) 1 ɛ ρ xj µ t j (x x j )µ t j(x j ) The above expresson s of the form log x p(x ) 1 η q(x ) η where p(x ), q(x ) are dstrbutons over x. Defne a dstrbuton r(x ) p(x ) 1 η q(x ) η. Smple algebra then yelds log x p(x ) 1 η q(x ) η = (1 η)d KL [r p] + ηd KL [r q] where D KL s the KL dvergence, and s non-negatve. Here η = ɛ ρ and thus 0 < η < 1 and the above weghted sum of the two D KL dvergences s always non-negatve. It follows that f 0 wth equalty f and only f the condton n Eq. (11) s satsfed. A smlar argument shows that f j 0 f wth equalty ff Eq. (11) s satsfed. The result for the non-negatvty of D (µ t ) then follows mmedately.. ɛ ρ

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there