A Note on Structural Extensions of SVMs

Size: px

Start display at page:

Download "A Note on Structural Extensions of SVMs"

Colin Terry
5 years ago
Views:

1 A Note on Structural Extensons of SVMs Mark Schmdt March 29, Introducton Ths document s ntended as an ntroducton to structural extensons of support vector machnes for those who are famlar wth logstc regresson (bnary and multnomal) and dscrete-state probablstc graphcal models (n partcular, condtonal random felds). No pror knowledge about support vector machnes s assumed. The outlne s as follows 2 motvates and outlnes bnary support vector machnes. The contents of ths secton are standard, and the reader s referred to [Vapnk, 1995] for more detals. However, we wll follow a non-standard presentaton; nstead of motvatng support vector machnes from the pont of vew of optmal separatng hyper-planes, we focus on the relatonshp between logstc regresson and support vector machnes through lkelhood ratos, a vewpont due to SVN Vshwanthan. 3 dscusses mult-class generalzatons of bnary suppport vector machnes, focusng on the N K-slack formulaton of [Weston and Watkns, 1999], and the N-slack formulaton of [Crammer and Snger, 2001]. 4 dscusses extensons of the N-slack mult-class support vector machne that can model data wth structured output spaces. In partcular, ths secton focuses on hdden Markov support vector machnes [Altun et al., 2003, Joachms, 2003], max-margn Markov networks [Taskar et al., 2003], and structural support vector machnes [Tsochantards et al., 2004]. 5 (n progress) wll dscuss solvng the optmzaton problems n 4. There are four man approaches: (1) sub-gradent methods [Collns, 2002, Altun et al., 2003, Zhang, 2004, Shalev-Shwartz et al., 2007], (2) cuttng plane and bundle methods [Tsochantards et al., 2004, Joachms, 2006, Teo et al., 2007, Smola et al., 2008, Joachms et al., 2009], (3) polynomal-szed reformulatons [Taskar et al., 2003, Bartlett et al., 2005, Collns et al., 2008], and (4) mn-max formulatons [Taskar et al., 2004, Taskar et al., 2006b, Taskar et al., 2006a]. Some of the thngs that wll not be covered n ths documentaton are a dscusson of optmal separatng hyper-planes, dervng the Wolfe dual formulatons, the kernel trck, and computatonal learnng theory. [Lacoste-Julen, 2003] s an accessble ntroducton to max-margn Markov networks that dscusses these topcs, and also contans ntroductory materal on graphcal models and condtonal random felds. 2 Support Vector Machnes We frst look at bnary classfcaton wth an N by P desgn matrx X, and an N by 1 vector of class labels y wth y { 1, 1}. We wll gnore the bas term to smplfy presentaton, and consder probabltes for the class labels that are proportonal to the exponental of a lnear functon of the data, p(y = 1 w, x ) exp(w T x ), 1

2 p(y = 1 w, x ) exp( w T x ), Usng these class probabltes s equvalent to usng a logstc regresson model, and we can estmate the parameters w by maxmzng the lkelhood of the tranng data, or equvalently mnmzng the negatve log-lkelhood mn log p(y w, x ). w The soluton to ths problem s not necessarly unque (and the optmal parameters may be unbounded). To yeld a unque soluton (and avod over-fttng), we typcally add a penalty on the l 2 -norm of the parameter vector and compute a penalzed maxmum lkelhood estmate mn w log p(y w, x ) + λ w 2 2, where the scalar λ controls the strength of the regularzer. Wth an estmate of the parameter w, we can classfy a data pont x usng { 1 f p(y = 1 w, x ) > p(y = 1 w, x ) ŷ = 1 f p(y = 1 w, x ) < p(y = 1 w, x ). But what f our goal sn t to have a good model p(y w, x ), but rather to make the rght decson for all our tranng set ponts? We can express ths n terms of the followng condton on the lkelhood ratos p(y w, x ) p( y w, x ) c, where c > 1. The exact choce of c s arbtrary, snce f we can satsfy ths for some c > 1, then we can also satsfy t for any c > 1 by re-scalng w. Takng logarthms, we can re-wrte ths condton as log p(y w, x ) log p( y w, x ) log c, and pluggng n the defntons of p(y w, x ) we can wrte ths as 2y w T x log c. Snce c s an arbtrary constant greater than 1, we wll pck c so that (1/2) log c = 1, so that our condtons can be wrtten n a very smple form y w T x 1. Ths s a lnear feasblty problem, and t can be solved usng technques from lnear programmng. However, one of two thngs wll go wrong: () the soluton may not be unque, or () there may be no soluton. As before, we can make the parameters dentfable by usng an l 2 -norm regularzer, whch leads to a quadratc program mn w λ w 2 2, s.t. y w T x 1, In ths case, the choce of λ s arbtrary snce solvng ths quadratc program wll yeld the soluton wth smallest l 2 norm for any λ > 0. To address the case where the lnear feasblty problem has no soluton, we wll ntroduce a non-negatve slack varable ξ for each tranng case that allows the constrant to be volated for that nstance, but we wll penalze the l 1 -norm of ξ to try and mnmze the volaton of the constrants. Ths yelds the quadratc program mn ξ + λ w 2 2, s.t. y w T x 1 ξ, ξ 0, 2

3 whch has P + N varables and s the prmal form of support vector machnes. Specfcally, ths s what s known as a soft-margn support vector machne [Vapnk, 1995]. The support vectors are those data ponts where the non-negatvty constrant holds wth equalty (to keep the presentaton smple, we wll not dscuss the dual form or the kernel trck ). By re-arrangng the constrants, we have that for all, ξ 0 and ξ 1 y w T x, and therefore we know that ξ max{0, 1 y w T x }. Further, snce we are drectly mnmzng a lnear functon of ξ, we can show that the followng mnmzaton mn w (1 y w T x ) + + λ w 2 2, where we use (z) + to denote max{0, z}, has the same soluton as the prmal form of support vector machnes. Ths s an unconstraned but non-dfferentable problem n P varables 1. The term (1 y w T x ) + s known as the hnge loss, and s an upper bound on the number of classfcaton errors. To summarze, l 2 -regularzed logstc regresson and support vector machnes are closely related. They employ a very smlar model and use the same regularzaton, but dffer n that logstc regresson seeks to maxmze the (penalzed) lkelhood (a log-concave smooth approxmaton to the number of classfcaton errors), whle support vector machnes seek to mnmze the (penalzed) constrant volatons n a set of condtons on the lkelhood ratos (a convex pecewse-lnear upper bound on the number of classfcaton errors). 3 Mult-Class Support Vector Machnes Vewed from the optmal separatng hyper-plane perspectve, t s not mmedately clear how to extend bnary support vector machnes to the case where we consder K possble class labels. Most attempts at developng mult-class support vector machnes concentrate on producng a mult-class classfer by combnng the decsons made by a set of ndependent bnary classfers, a popular approach of ths type s to use errorcorrectng output codes [Detterch and Bakr, 1995]. However, the lkelhood rato perspectve of support vector machnes suggests an obvous generalzaton to the mult-class scenaro by analogy to the extensons of logstc regresson to the mult-class scenaro. A natural generalzaton of the bnary logstc regresson classfer to the case of a K-class problem s to use a weght vector w k for each class p(y = k w k, x ) exp(w T k x ). Parameter estmaton s performed as n the bnary case, and we are lead to the decson rule ŷ = max p(y = k w k, x ). k Vewed n terms of lkelhood ratos, we would lke our decson rule to satsfy the constrant k y, p(y w, x ) p(y = k w k, x ) c Followng the same steps as before, we arrve at a set of lnear constrants of the form k y, w T y x w T k x 1. We wll agan consder the mnmzng l 2 -norm of w (concatenatng all K of the w k vectors nto one large vector), and ntroduce slack varables to allow for constrant volaton. However, there are dfferent ways 1 If we nstead consdered penalzng the squared l 2 -norm of the slack varables, then we can re-wrte the problem as a dfferentable unconstraned optmzaton. Ths smooth support vector machne s due to [Lee and Mangasaran, 2001] 3

4 to ntroduce the slack varables. The orgnal method of [Weston and Watkns, 1999] ntroduces one slack varable for each of these constrants; mn ξ,k + λ w 2 2, k y k y, w T y x w T k x 1 ξ,k, k y ξ,k 0, a quadratc program wth NK+P K varables. We wll refer to ths formulaton as the NK-slack formulaton. An equvalent unconstraned optmzaton problem where we elmnate the slack varables s mn (1 wy T w x + wk T x ) + + λ w 2 2, k y an unconstraned non-dfferentable problem n P K varables 2. Later, [Crammer and Snger, 2001] proposed what we wll call the N-slack mult-class formulaton, whch can be motvated by re-wrtng the lkelhood rato n the equvalent form p(y w, x ) max k y p(y = k w k, x ) c. Although ths leads an equvalent set of constrants, the formulatons dffer when we ntroduce slack varables on ths set of constrants, snce each tranng nstance s only assocated wth 1 slack varable; ξ + λ w 2 2, mn k y, w T y x w T k x 1 ξ, ξ 0, a quadratc program wth N + P K varables. We wll refer to ths as the N-slack formulaton. An equvalent unconstraned optmzaton problem where we elmnate the slack varables s mn max(1 wy T w k y x + wk T x ) + + λ w 2 2, an unconstraned non-dfferentable problem n P K varables. Comparng these two formulatons, we see that the N K-slack formulaton s a much more punshng penalty, t punshes volaton of the constrants for any competng hypothess. In contrast, the N-slack formulaton only punshes based on the volaton of the most lkely competng hypothess, even f there are many lkely competng hypothess. However, the advantage of the N-slack formulaton s that t leads to effcent structural extensons. 4 Hdden Markov Support Vector Machnes, Max-Margn Markov Networks, Structural Support Vector Machnes We now move to the case where we no longer have a sngle class label y for each tranng nstance, but nstead have a set of j labels y,j. If these labels are ndependent, then we can smply use the methods above to ft an ndependent support vector machne to each label. The more nterestng case occurs when the elements of the output vector are dependent. For example, we mght have a sequental dependency; the 2 If we consder penalzng the squared l 2 -norm of the NK-slack formulaton, then we can wrte the optmzaton as a dfferentable unconstraned optmzaton. A demo of ths mult-class extenson of smooth support vector machnes s at 4

5 label y,j depends on the prevous label y,j 1. The structural extenson of logstc regresson to handle ths type of dependency structure s known as a lnear-chan condtonal random feld [Lafferty et al., 2001]. In the case of ted parameters, bnary labels, Isng-lke potentals, and sequences of labels wth the same length S (removng any of these restrctons s trval, but would requre more notaton), the probablty of a vector of labels Y s wrtten as p(y w, X ) exp( S j=1 S 1 y,j wn T x,j + y,j y,j+1 we T x,j,j+1 ), where w n represent the node parameters, w e represent the edge/transton parameters, and we allow a feature vector x,j for each node and x,j,j+1 for each transton. Although there are other possbltes, we consder the decson rule Ŷ = max Y p(y w, X ). Computng ths maxmum by consderng all possble labels may no longer be feasble, snce there are now 2 S possble labels, but the maxmum can be found n O(S) by dynamc programmng n the form of the Vterb decodng algorthm. Movng to an nterpretaton n terms of lkelhood ratos, we would lke our classfer to satsfy the crtera j=1 leadng to the set of constrants p(y w, X ) Y p(y w, X ) c, Y log p(y w, X ) log p( w, X ) 1. (1) Introducng the l 2 -norm (squared) regularzer and the N-slack constrant volaton penalty, we obtan the quadratc program ξ + λ w 2 2, mn s.t. Y log p(y w, X ) log p( w, X ) 1 ξ, ξ 0 Ths s known as a hdden Markov support vector machne [Altun et al., 2003, Joachms, 2003]. Because t s based on the N-slack formulaton, the hdden Markov support vector machne penalzes all alternatve hypothess based on the most promsng alternatve hypothess. In partcular, the number of label dfferences between Y and an alternatve confguraton s never consdered. Rather than selectng the ), the number constant to be 1 across all alternatve confguratons, we could consder settng t to (Y, of label dfferences between Y and Y log p(y w, X ) log p( w, X ) (Y, ). (2) Ths encourages the lkelhood rato between the true label and an alternatve confguraton to be hgher f the alternatve confguraton dsagrees wth Y at many nodes. Wth our usual procedure we obtan the quadratc program mn ξ + λ w 2 2, s.t. Y log p(y w, X ) log p( w, X ) (Y, ) ξ, ξ 0, whch s known as a max-margn Markov network [Taskar et al., 2003], or a structural support vector machne wth margn re-scalng [Tsochantards et al., 2004]. Of course, ths sn t the only way to change the constrants based on the agreement between Y and Y, and [Tsochantards et al., 2004] argues that ths formulaton has the dsadvantage that t tres to make very large the dfference between competng labels that dffer n many postons. An alternatve to changng the 5

6 constant factor n the lkelhood rato s to change the scalng of the slack varables and solve the quadratc program ξ + λ w 2 2, mn s.t. Y log p(y w, X ) log p( w, X ) 1 ξ / (Y, ), ξ 0. Here, you need to add more slack for a competng hypothess that dffers n many poston. Ths last varant s known as a structural support vector machne wth slack re-scalng [Tsochantards et al., 2004]. Equvalent unconstraned versons of these 3 structural extensons of support vector machnes are 3 (HMSVM) (MMMN) (SSVM) mn w mn w mn w max Y (1 log p(y w, X ) + log p( max ( (Y, Y Y ) log p(y w, X ) + log p( Y max ( (Y, Y Y )(1 log p(y w, X ) + log p( Y w, X )) + + λ w 2 2, w, X )) + + λ w 2 2, w, X ))) + + λ w 2 2. In the (MMMN) and (SSVM) formulatons, we can use (Y, Y ) = 0 to smplfy these expressons, gvng (MMMN) mn max( (Y, Y w Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2, (SSVM) mn w N.B., n these latter two expressons we can maxmze over all max( (Y, Y )(1 log p(y w, X ) + log p(y w, X ))) + λ w 2 2. nstead of over all Y Y ; In these cases, we can compute the objectve functon usng the Vterb decodng algorthm (wth a sutably modfed nput). In contrast, the HMSVM model requres solvng a maxmzaton problem over all confguratons except Y, whch be may be much harder to solve 4. Fnally, we note that whle the structural hnge losses n all three formulatons represent an upper bound on the number of sequence errors (e. the number of tmes at least one label n the sequence was wrong), the MMMN and SSVM formulatons gve an upper bound on the number of label errors whle the HMSVM does not. Although ths secton has focused on a chan-structured dependency structure, we only used ths property n two places: () the defnton of p(y w, X ), and () n effcent methods to compute maxmums over the space of possble confguratons. We can apply the same methodology to derve MMMNs and SSVMs for a wde varety of alternatve models. For example, we can consder the cases where condtonal random felds can be appled exactly, such as low tree-wdth structured dependences, and probablstc context free grammars. We can also consder models where we consder hgher-order dependences, rather than smply parwse dependences, as long as the graph structure or parameterzaton allows us to effcently solve the assocated maxmzaton problems. As dscussed n [Taskar et al., 2004], an advantage that MMMNs/SSVMs have over condtonal random felds s that we can also consder completely general graph structures (such as lattces or fully connected graphs), f we enforce the addtonal constrant that the edge weghts enforce a sub-modularty condton [Kolmogorov and Zabn, 2004]. In these cases, the exact nference needed for tranng condtonal random felds s ntractable, but computng the maxmums needed by MMMNs/SSVMs can be formulated as a lnear program (that can be effcently solved usng mn-cut technques). [Taskar et al., 2004] proposes assocatve max margn Markov networks, models that take advantage of ths property to allow effcent tranng and decodng n models wth an arbtrary graph structure and arbtrary clque szes. In these models, the 3 We can also consder penalzng the squared values of the slack varables [Tsochantards et al., 2004], but ths does not lead to a dfferentable mnmzaton problem due to the potental for tes n the max functons. 4 We could also consder structural extensons based on the NK-slack formulaton [Altun and Hofmann, 2003]. In ths case, the max functons would be replaced by sums, but t s not obvous (to me, anyway) how you would compute these sums effcently, nor how you would deal wth the exponental number of varables and constrants n the quadratc program 6

7 potental that all varables n a clque take the state k based on a non-negatve feature x s modeled wth a parameter λ k, 1, whle the potental for any confguraton where they don t take the same state s fxed at 1. For bnary varables, the decodng problem can be solved exactly as a lnear program for any graph structure, whle for mult-class problems the lnear program yelds a worst-case approxmaton rato of 1/c (where c s the largest clque sze), f an approprate dscretzaton strategy s used. [Taskar et al., 2006b] also gves the example of weghted bpartte graph matchng, where decodng can be computed n polynomal tme usng lnear programmng, but the nference needed to tran a condtonal random feld s #P -hard. For general graphs where we do not want the sub-modularty condton, we can also consder an approxmaton where we use an approxmate decodng method, such as loopy belef propagaton or a stochastc local search method [Hutter et al., 2005]). However, theoretcal and emprcal work n [Fnley and Joachms, 2008] shows that usng such under-generatng decodng methods (whch may return sub-optmal solutons to the decodng problem) have several dsadvantages compared to so-called over-generatng decodng methods. A typcal example of an over-generatng decodng method s a lnear programmng relaxaton of an nteger programmng formulatons of decodng, where n the relaxaton the objectve s optmzed exactly over an expanded space [Kumar et al., 2007], hence yeldng a fractonal decodng that s a strct upper bound on the optmal decodng. 5 Tranng In ths secton, we dscuss methods for estmatng the parameters of structural extensons of SVMs. We wll focus on MMMNs snce t makes the notaton slghtly smpler, but most of the technques can be appled wth only mnor modfcatons to the case of SSVMs. 5.1 Subgradent Methods We frst consder methods that address the unconstraned non-dfferentable formulaton of MMMNs by movng along sub-gradents of the objectve functon f(w) max( (Y, Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2. If we use as an argmax of the max, then a sub-gradent of ths objectve functon s g(w) w log p( w, X ) w log p(y w, X ) + 2λw. The functon s dfferentable and ths wll be the gradent whenever all of the argmax values are unque. The negatve sub-gradent drecton s a descent drecton at all ponts where the functon s dfferentable, whch means that f(w ηg(w)) < f(w) for suffcently small η > 0. In contrast, at non-dfferentable ponts (where at least one argmax s not unque) negatve sub-gradent drectons may not be descent drectons [Bertsekas, 1999, 6.3]. However, we can guarantee that for suffcently small η the dstance to the optmal soluton s reduced, even f the functon value s not [Bertsekas, 1999, 6.3]. We can therefore consder a smple sub-gradent method that uses a sequence of small step szes {η k }, where the teratons take of the form w k+1 = w k η k g(w k ). Several strateges for choosng the step length that lead to convergence of the teratons are dscussed n [Bertsekas, 1999, 6.3], wth a common choce beng a sequence {η k } of postve step szes satsfyng η k =, k=1 ηk 2 <. k=1 7

8 Convergence of the method can also be shown f each step uses the contrbuton to the sub-gradent of a sutably chosen ndvdual tranng example [Kushner and Yn, 2003] resultng n steps of the form g (w) w log p( w, X ) w log p(y w, X ) + (2/N)λw, w k+1 = w k η k g (w k ). Ths s known as a stochastc sub-gradent descent method, and was presented n [Collns, 2002] (predatng the formulaton of structural SVMs) for the degenerate case of λ = 0 and Y,Y (Y, Y ) = 0, and where a constant step sze (η k = η) was used nstead of a convergent sequence. An nterestng alternatve method was also outlned n [Collns, 2002]; In ths second method a constant step sze was used, and a varant on the procedure was consdered where the fnal estmator s the average of all the terates; w k+1 = w k ηg (w k ), w k+1 = k 1 w k + 1 k k w k+1. Convergence of averaged estmators of the form w s dscussed n [Kushner and Yn, 2003], where t s shown that under certan condtons these steps satsfy an asymptotc statstcal effcency property. In the specal case of bnary support vector machnes (though easly extended to structural support vector machnes), [Shalev-Shwartz et al., 2007] consders teratons of the form w k+1 = π(w k η k g(w k )), where π s a projecton onto an l 2 -ball that s guaranteed to contan the soluton 5. They showed that the number of teratons to obtan a soluton wth accuracy ɛ usng ths method s O(1/ɛ) (gnorng poly-logarthmc factors), whch was an mprovement over the O(1/ɛ 2 ) number of teratons provded by the prevous analyss of the averaged stochastc gradent method [Zhang, 2004] and cuttng plane methods [Joachms, 2006]. [Shalev-Shwartz et al., 2007] also consder the case of a projected stochastc gradent update, and show that the teratons acheve an accuracy of ɛ wth probablty 1 δ n O(1/(δɛ)) teratons. One could also consder combnng averagng and projectons, whose asymptotc statstcal effcency s consdered n [Kushner and Yn, 2003]. Fnally, [Ratlff et al., 2007] show that the (determnstc) subgradent method wth a suffcently small constant step sze wll converge lnearly (e. wth the faster rate of O(log(1/ɛ))) to a regon contanng the mnmum whose radus s proportonal to C/λ (where C s a bound on the sub-dfferental). 5.2 Cuttng Plane and Bundle Methods We now consder methods for solvng the quadratc programmng formulaton of MMMNs ξ + λ w 2 2, mn s.t. log p(y w, X ) log p( w, X ) (Y, ) ξ, ξ 0. Ths formulaton s problematc snce we have an exponental number of constrants. However, t s concevable that we do not need to enforce all of these constrants at the optmal soluton. In partcular, [Tsochantards et al., 2004] showed that there always exsts a polynomal-szed subset of the constrants that can satsfy all constrants wth accuracy at least ɛ. Ths makes cuttng-plane methods for quadratc programmng an appealng approach for solvng ths type of problem. Cuttng plane methods have a very smple structure; We frst fnd the unconstraned mnmzer. If the mnmzer does not satsfy all the constrants, we fnd a constrant that s volated and use t to construct our 5 The projecton of a vector w onto a set W s defned as mn w W w w 2 8

9 ntal workng set W. We then solve the problem wth the constrants contaned n W, and f ths soluton stll does not satsfy all the constrants we add another volatng constrant to W. Ths contnues untl we have accumulated enough constrants n W so that all constrants are satsfed n the soluton. [Tsochantards et al., 2004] suggests the followng mplementaton of a cuttng plane algorthm for MMMNs. We terate over the tranng examples, and use Vterb decodng to fnd the maxmum structural hnge loss. If no constrant s volated by more than ɛ then we move on to the next tranng example. Otherwse, we add the constrant correspondng to ths alternatve confguraton, and optmze the objectve functon subject to our current workng set of constrants. The algorthm termnates when we complete a full pass through the tranng data wthout addng any constrants. The quadratc programs can be effcently solved usng the Wolfe dual formulaton (see the next secton), whch s a relatvely small problem snce t has only one varable for each workng constrant. Further, the sequence of quadratc programs can be solve effcently snce each quadratc program dffers from the prevous one by only a sngle constrant. The man concern wth usng ths type of approach s that the number of constrants needed mght be so large that the ntermedate quadratc programs may be mpractcal to solve. [Tsochantards et al., 2004] show that at most O(1/ɛ 2 ) constrants are requred 6. Ths cuttng plane algorthm for structural SVMs later nspred an effcent algorthm for bnary support vector machnes [Joachms, 2006]. In ths method, the bnary support vector machne problem s wrtten n a so-called 1-slack formulaton mn Nξ + λ w 2 2, s.t. c {0,1} N c y w T x c Nξ, ξ 0. Here, we have an exponental number of constrants but only 1 slack varable. [Joachms, 2006] shows that ths formulaton has the same optmal parameters (wth ξ = (1/N) ξ ), and outlnes a cuttng plane algorthm for bnary SVMs that s mplemented n the SVMperf software 7. Ths method was extended (back) to structural SVMs n [Joachms et al., 2009], where t was shown that at most O(1/ɛ) constrants are requred. Ths s the algorthm mplemented n the current verson of the SVMstruct software 8. An alternatve to applyng cuttng plane methods for quadratc programmng to a constraned formulaton of the MMMN problem s to apply the related cuttng plane methods for non-smooth optmzaton to the unconstraned non-dfferentable MMMN problem. Ths s explored n [Teo et al., 2007, Smola et al., 2008]. In cuttng plane methods for optmzaton of (non-smooth) convex functons, we use the defnton of a sub-gradent at a pont w 0 to yeld a global lower bound on the functon f(w) f(w 0 ) + (w w 0 ) T g(w 0 ), that s exact at w 0. Cuttng plane methods for non-smooth optmzaton proceed by collectng a set of these lower boundng hyper-planes, and at each teraton generate a new tral pont by fndng the mnmum of these lower bounds. The functon value and sub-gradent at the new tral pont are then added to the workng set, yeldng a more accurate lower bound. A dsadvantage of ths procedure s that the mnmum of the set of lower bounds may suggest a tral pont that s far away from the prevous pont, even f the prevous pont was close to the optmal soluton. Ths s known as nstablty [Bertsekas, 1999, 6.3]. A common way to address ths s to mnmze the lower bound subject to a regularzer w k+1 w k 2 2 on the dfference between subsequent tral values. Ths s known as a bundle method. The method of [Teo et al., 2007, Smola et al., 2008] s closely related to bundle methods, but dffers n the regularzer. Rather than buldng a lower bound for the full objectve functon and usng the regularzer w k+1 w k 2 2, they buld a sequence of lower bounds for the hnge loss and use the exact regularzer λ w 2 2 already present n the objectve functon. In [Smola et al., 2008], t s shown that 6 A closely related method and analyss for HMSVMs s n [Joachms, 2003]

10 the method requres O(1/ɛ) teratons to converge 9, and that the method offers a speed advantage over the method of [Shalev-Shwartz et al., 2007]. 5.3 Polynomal-Sze Reformulatons In ths secton, we examne an alternatve to cuttng plane methods for dealng wth the exponental szed quadratc programs. In partcular, we consder methods that take advantage of the sparse dependency structure n the underlyng dstrbuton to rewrte the exponental-szed quadratc program nto a polynomal-szed problem. In ths secton, we wll largely follow [Taskar et al., 2003] and wll work wth a dual formulaton to the MMMN quadratc program. In ths dual formulaton, we wll have a varable α (Y ) for each possble confguraton Y of tranng example. We wll also fnd t convenent to use the feature representaton of the probablty functons p(y w, X ) w T F (X, Y ), and use the notaton F (Y ) F (X, Y ) F (X, Y ). Usng ths notaton, [Taskar et al., 2003] shows that the followng problem s dual to the MMMN quadratc program max α (Y )α j (Y j ) F (Y ) T F j (Y j ), α α ( ) (Y, ) 1 2 s.t. j j α ( ) = 1 2λ, α ( ) 0. Although ths dual formulaton has an exponental number of constrants and an exponental number of varables, note the smple form of the constrants. If λ = 1/2, then the constrants smply enforce that α (Y ) s a vald probablty dstrbuton over all values of Y (for each tranng example ). For other values of λ, t enforces that α (Y ) s an unnormalzed dstrbuton (wth normalzng constant 1/2λ). If our dependency structure factorzes over nodes and edges n a graph, then we can consder a parameterzaton n terms of expectatons wth respect to ths dstrbuton µ (y j ) = α (Y ), µ (y j, y k ) = [y j] [y j,y k ] α ( ), where the notaton Y [y j] s used to denote that a confguraton Y s consstent wth the label y j. When we consder usng these varables, we wll requre that the node parameters satsfy the constrants of the orgnal problem j µ (y j ) 0, µ (y j ) = 1 2λ. To ensure that the edge varables form a legal densty, they must belong to the margnal polytope. In the case of tree-structured graphs, ths s equvalent to the requrement that [Taskar et al., 2003] (j,k) E µ (y j, y k ) = µ (y k ), (j,k) E µ (y j, y k ) 0. y j Usng these expectatons, we can re-wrte the set of frst terms n the dual formulaton as α (Y ) (Y, Y ) = α (Y ) j (y j, y j) = j (y j, y j)µ (y j). Y j j y j 9 [Smola et al., 2008] also show that applyng ther method to tran condtonal random felds wth l 2 -regularzaton acheves the faster rate of O(log(1/ɛ)). j 10

11 Ths s now a sum over ndvdual label assgnments rather than over jont assgnments. We can smlarly re-wrte the second term n the dual n terms of edge expectatons. Ths leads to what [Taskar et al., 2003] calls the factored dual max j (y j, y j)µ (y j) µ 1 2 (j,k) E (j,k ) E y j,y k j y j y j,y k µ (y j, y k)µ (y j, y k )F (X, y j, y k) T F (X, y j, y k ), s.t. j µ (y j ) 0, j µ (y j ) = 1 2λ, (j,k) E µ (y j, y k ) = µ (y k ), (j,k) E µ (y j, y k ) 0. y j Whle ths formulaton s now polynomal-sze, the quadratc form has a very large number of terms (e. t s quadratc n the number of tranng examples and nstances). [Taskar et al., 2003] proposes a method, known as the structural SMO method, for solvng ths optmzaton problem. In ths method, volaton of the constraned optmalty condtons s used to select the two most volatng varables, and ther optmal value gven all the other varables s computed n closed form to update the varables. [Taskar et al., 2003] dscusses extensons of the above method to graph structures that are not forests. For trangulated graphs, addtonal constrants can be mposed to ensure that the dstrbuton remans n the margnal polytope, although the number of addtonal constrants s exponental n the worst case. If we have an untrangulated graph or do not want to trangulate and add these extra constrants, the factored dual can be solved as an approxmaton (analogous to loopy belef propagaton). It s also possble to derve factored duals when the orgnal dstrbuton factorzes over hgher-order clques. Fnally, [Taskar et al., 2003] dscusses a factored verson of the prmal problem, where we have an extended set of slack varables that mmc the dependency structure n the graph and constran dual varables enforcng constrants on the margnal polytope. An alternatve method for optmzng the dual formulaton usng a polynomal-szed number of varables s outlned n [Bartlett et al., 2005, Collns et al., 2008]. In ths method, the exponentated gradent algorthm s used to (ndrectly) optmze the varables α n the dual formulaton. The exponentated gradent algorthm [Kvnen and Warmuth, 1997] s a method to optmze a functon f(x), where x les on the probablty smplex 10 x 0, x = 1. The exponentated gradent algorthm takes steps of the form x = x exp( η f(x)) x exp( η f(x)), where as before η s a learnng rate, and we can consder applyng the algorthm usng the contrbuton to the gradent of one tranng example at a tme nstead of all the tranng examples at once. [Bartlett et al., 2005, Collns et al., 2008] work wth a slghtly modfed dual formulaton where the constrants take the form of a probablty smplex (nstead of an unnormalzed verson), and apply the exponentated gradent algorthm to ths problem. To avod dealng wth the exponental number of varables α, they use an mplct representaton for α n terms of varables that are analogous to the µ varables above. Smlar to the methods from prevous sectons, they show that the method requres O(1/ɛ) teratons to reach an accuracy of ɛ, and that ths apples whether the full gradent or the gradent contrbuton from an ndvdual 10 [Kvnen and Warmuth, 1997] also outlne an extenson that can handle non-negatve varables subject to the l 1 -norm of the vector beng 1, and extensons where the constant 1 s replaced by another known constant C. 11

12 tranng example s used 11. A dsadvantage of ths optmzaton procedure s that usng the mplct representaton of α requres nference n the underlyng probablstc model. For example, runnng belef propagaton to compute margnals n tree-sturctured graphs. Snce the nference problem s strctly harder than the decodng problem needed by sub-gradent/cuttng-plane methods, the exponentated gradent method seems to only apply to a subset of the problems where these other methods apply. 5.4 Mn-Max Formulatons Fnally, the last set of methods that we consder deals wth the exponental number of constrants n the quadratc program by replacng them wth one non-lnear constrant for each tranng example. ξ + λ w 2 2, mn s.t. log p(y w, X ) + ξ max log p( Y w, X ) + (Y, Y ), ξ 0. Above, we re-arranged the constrant so that the rght-hand sde of the nequalty s the soluton to the modfed decodng problem. We frst consder a method to solve ths mn-max formulaton that s due to [Taskar et al., 2004], that apples n the specal case where the optmal decodng can be formulated as a lnear program. In ths method, we wrte the lnear program that solves the problem max log p(y w, X ) + (Y, Y ) n standard form max z wbz s.t. z 0, Az b, for sutably defned values of A, B, and b. We obtan an equvalent mnmzaton problem by workng wth the dual of ths lnear program mn z b T z s.t. z 0, A T z (wb) T. We then substtute ths dual problem nto our orgnal problem to elmnate the non-lnear constrant, yeldng the quadratc program mn ξ + λ w 2 2, z s.t. log p(y w, X ) + ξ b T z, ξ 0, z 0, A T z (wb) T. Subsequently, we can solve ths (reasonably-szed) quadratc program to exactly solve the orgnal quadratc program whenever the optmal decodng can be formulated as a lnear program. In cases where the lnear programmng relaxaton s not exact, we can stll use ths formulaton to compute an approxmate soluton, where the slack varables wll be ncreased for any tranng examples where a fractonal decodng yelds a better confguraton than the best nteger decodng. An alternatve mn-max formulaton was explored n [Taskar et al., 2006b]. In ths formulaton, we plug a lnear programmng problem that solves the decodng problem nto the unconstraned verson of the MMMN problem, gvng the saddle-pont problem mn max w W z Z λ w w T F z + c T z w T F (X, Y ), for approprately defned Z, F and c. Subsequently, the extragradent algorthm s appled to fnd a value of w and z that mnmzes over w the maxmzng value of z. Usng L(w, z) to denote the objectve functon n 11 [Collns et al., 2008] also shows that the faster of rate of O(log(1/ɛ)) s acheved when the method s appled to optmzng a dual formulaton of tranng condtonal random felds. 12

13 the saddle pont problem, the teratons of the extragradent algorthm take the form of alternatng projected gradent steps n w and z; w p = π W (w η w L(w, z)), z p = π Z(z + η z L(w, z)), w c = π W (w η w L(w p, z p )), z c = π Z (z + η z L(w p, z p )). The step sze η s selected by a backtrackng lne search untl a sutable condton on w p and z p s satsfed [Taskar et al., 2006b]. The teratons converge lnearly to the soluton, whch corresponds to a rate of O(log(1/ɛ)). Snce w s unconstraned, the projecton π W s smply the dentty functon (for assocatve maxmargn Markov networks, we project the edge parameters onto the non-negatve orthant), whle the projecton π Z s the soluton of a mn-cost quadratc flow problem [Taskar et al., 2006b]. In [Taskar et al., 2006a], a dual extragradent method s consdered, along wth an extenson that uses proxmal steps wth respect to a Bregman dvergence rather than Eucldean projectons. 5.5 Comments As the dscusson above ndcates, effcently solvng the structural support vector machne optmzaton remans a challengng problem. Currently, the sub-gradent descent methods are one appealng choce because of ther smplcty, whle the cuttng plane methods are another appealng opton due to the avalablty of effcent mplementatons of these methods. However, the O(1/ɛ) (sub-lnear) worst-case convergence rates of these methods s somewhat unappealng. Whle the O(log(1/ɛ)) (lnear) convergence speed of the extragradent method s more appealng, t does not approach the O(log(log(1/ɛ))) convergence rates of standard methods for solvng quadratc programs. We could consder applyng nteror-pont methods (see [Boyd and Vandenberghe, 2004], for example) to one of the polynomal-szed reformulatons or one of the mn-max formulatons, but the number of varables/constrants n these problems would lkely make teratons of the method mpractcal. References [Altun and Hofmann, 2003] Altun, Y. and Hofmann, T. (2003). Large margn methods for label sequence learnng. In Eghth European Conference on Speech Communcaton and Technology. ISCA. [Altun et al., 2003] Altun, Y., Tsochantards, I., Hofmann, T., et al. (2003). Hdden markov support vector machnes. In ICML, volume 20, page 3. [Bartlett et al., 2005] Bartlett, P., Collns, M., Taskar, B., and McAllester, D. (2005). Exponentated gradent algorthms for large-margn structured classfcaton. In Advances n Neural Informaton Processng Systems 17: Proceedngs Of The 2004 Conference, page 113. MIT Press. [Bertsekas, 1999] Bertsekas, D. (1999). Nonlnear programmng. Athena Scentfc. [Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex Optmzaton [Collns, 2002] Collns, M. (2002). Dscrmnatve tranng methods for hdden markov models: Theory and experments wth perceptron algorthms. In Proceedngs of the ACL-02 conference on Emprcal methods n natural language processng-volume 10, pages 1 8. Assocaton for Computatonal Lngustcs Morrstown, NJ, USA. [Collns et al., 2008] Collns, M., Globerson, A., Koo, T., Carreras, X., and Bartlett, P. (2008). Exponentated gradent algorthms for condtonal random felds and max-margn markov networks. The Journal of Machne Learnng Research, 9:

14 [Crammer and Snger, 2001] Crammer, K. and Snger, Y. (2001). On the algorthmc mplementaton of multclass kernel-based vector machnes. The Journal of Machne Learnng Research, 2: [Detterch and Bakr, 1995] Detterch, T. and Bakr, G. (1995). Solvng multclass learnng problems va error-correctng output codes. Arxv preprnt cs.ai/ [Fnley and Joachms, 2008] Fnley, T. and Joachms, T. (2008). Tranng structural SVMs when exact nference s ntractable. In Proceedngs of the 25th nternatonal conference on Machne learnng, pages ACM New York, NY, USA. [Hutter et al., 2005] Hutter, F., Hoos, H., and Stutzle, T. (2005). Effcent stochastc local search for MPE solvng. In INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 19, page 169. LAWRENCE ERLBAUM ASSOCIATES LTD. [Joachms, 2003] Joachms, T. (2003). Learnng to algn sequences: A maxmum-margn approach. onlne manuscrpt. [Joachms, 2006] Joachms, T. (2006). Tranng lnear SVMs n lnear tme. In Proceedngs of the 12th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM New York, NY, USA. [Joachms et al., 2009] Joachms, T., Fnley, T., and Yu, C. (2009). Cuttng plane tranng of structural SVMs. Machne Learnng. [Kvnen and Warmuth, 1997] Kvnen, J. and Warmuth, M. (1997). Exponentated gradent versus gradent descent for lnear predctors. Informaton and Computaton. [Kolmogorov and Zabn, 2004] Kolmogorov, V. and Zabn, R. (2004). What energy functons can be mnmzed va graph cuts? IEEE Transactons on Pattern Analyss and Machne Intellgence, 26(2): [Kumar et al., 2007] Kumar, M., Kolmogorov, V., and Torr, P. (2007). An analyss of convex relaxatons for MAP estmaton. Advances n Neural Informaton Processng Systems, 20: [Kushner and Yn, 2003] Kushner, H. and Yn, G. (2003). Stochastc approxmaton and recursve algorthms and applcatons. Sprnger. [Lacoste-Julen, 2003] Lacoste-Julen, S. (2003). Combnng SVM wth graphcal models for supervsed classfcaton: an ntroducton to Max-Margn Markov Network. [Lafferty et al., 2001] Lafferty, J., McCallum, A., and Perera, F. (2001). Condtonal random felds: Probablstc models for segmentng and labelng sequence data. In ICML, pages [Lee and Mangasaran, 2001] Lee, Y. and Mangasaran, O. (2001). SSVM: A smooth support vector machne for classfcaton. Computatonal optmzaton and Applcatons, 20(1):5 22. [Ratlff et al., 2007] Ratlff, N., Bagnell, J. A. D., and Znkevch, M. (2007). (onlne) subgradent methods for structured predcton. In Eleventh Internatonal Conference on Artfcal Intellgence and Statstcs (AIStats). [Shalev-Shwartz et al., 2007] Shalev-Shwartz, S., Snger, Y., and Srebro, N. (2007). Pegasos: Prmal estmated sub-gradent solver for SVM. In Proceedngs of the 24th nternatonal conference on Machne learnng, pages ACM New York, NY, USA. [Smola et al., 2008] Smola, A., Vshwanathan, S., and Le, Q. (2008). Bundle methods for machne learnng. Advances n Neural Informaton Processng Systems,

15 [Taskar et al., 2004] Taskar, B., Chatalbashev, V., and Koller, D. (2004). Learnng assocatve Markov networks. In Proceedngs of the twenty-frst nternatonal conference on Machne learnng. ACM New York, NY, USA. [Taskar et al., 2003] Taskar, B., Guestrn, C., and Koller, D. (2003). Max-margn Markov networks. Advances n neural nformaton processng systems, 16:51. [Taskar et al., 2006a] Taskar, B., Lacoste-Julen, S., and Jordan, M. (2006a). Structured predcton, dual extragradent and Bregman projectons. The Journal of Machne Learnng Research, 7: [Taskar et al., 2006b] Taskar, B., Lacoste-Julen, S., and Jordan, M. (2006b). Structured predcton va the extragradent method. Advances n neural nformaton processng systems, 18:1345. [Teo et al., 2007] Teo, C., Smola, A., Vshwanathan, S., and Le, Q. (2007). A scalable modular convex solver for regularzed rsk mnmzaton. In Proceedngs of the 13th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM New York, NY, USA. [Tsochantards et al., 2004] Tsochantards, I., Hofmann, T., Joachms, T., and Altun, Y. (2004). Support vector machne learnng for nterdependent and structured output spaces. In Proceedngs of the twenty-frst nternatonal conference on Machne learnng. ACM New York, NY, USA. [Vapnk, 1995] Vapnk, V. (1995). The nature of statstcal learnng theory. [Weston and Watkns, 1999] Weston, J. and Watkns, C. (1999). Support vector machnes for mult-class pattern recognton. In Proceedngs of the Seventh European Symposum On Artfcal Neural Networks, volume 4. [Zhang, 2004] Zhang, T. (2004). Solvng large scale lnear predcton problems usng stochastc gradent descent algorthms. In Proceedngs of the twenty-frst nternatonal conference on Machne learnng. ACM New York, NY, USA. 15

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009 Structural Extensons of Support Vector Machnes Mark Schmdt March 30, 2009 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons