Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Size: px

Start display at page:

Download "Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009"

Tabitha Briggs
5 years ago
Views:

1 Structural Extensons of Support Vector Machnes Mark Schmdt March 30, 2009

2 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons Outlne

3 Topcs Not Covered Optmal separatng hyper-planes Generalzaton bounds Dervng Wolfe duals of quadratc programs The kernel trck, Mercer/Representer theorems

4 Logstc Regresson Model probabltes of bnary labels as: p(y = 1 w, x ) exp(w T x ), p(y = 1 w, x ) exp( w T x ), Tran by maxmzng quvalent tolkelhood, usng a logstc or mnmzng regress negatve log-lkelhood: mn w log p(y w, x ). To make soluton essarlyunque, (and add the an L2 optpenalty: mn w Make decsons usng the rule: ŷ = log p(y w, x ) + λ w 2 2, { 1 f p(y = 1 w, x ) > p(y = 1 w, x ) 1 f p(y = 1 w, x ) < p(y = 1 w, x ).

5 Lnear Separablty If we just want to get the decsons rght, then the we requre (for some arbtrary c > 1): Takng logarthms p(y w, x ) p( y w, x ) c, trary, snce f we can sa log p(y w, x ) log p( y w, x ) log c, Pluggng n probabltes (cancelng normalzng constants): 2y w T x log c. Choose c such that n 1, log(c)/2 we wll pck = 1: so y w T x 1.

6 Fxng We can solve ths as a lnear feasblty problem: y w T x 1. Ths problem ether n be solved has no usng soluton, t or an nfnte number To make the soluton unque wth add an L2 penalty: mn w λ w 2 2, s.t. y w T x 1, To make the soluton snce solvng exst ths we allow quadr slack n the constrants, but penalze the L1-norm of ths slack: mn w,ξ ξ + λ w 2 2, s.t. y w T x 1 ξ, ξ 0,

7 Support Vector Machne Ths s the prmal form of soft-margn SVMs: mn w,ξ ξ + λ w 2 2, s.t. y w T x 1 ξ, ξ 0, We can also elmnate the slacks and wrte t as an unconstraned problem: mn w (1 y w T x ) + + λ w 2 2, The hnge loss s an upper bound on the classfcaton errors It s very smlar to logstc regresson wth L2-regularzaton: mn w log p(y w, x ) + λ w 2 2,

8 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons Outlne

9 Multnomal Logstc We extend bnary logstc regresson to mult-class data by gvng each class k ts own weght vector: p(y = k w k, x ) exp(w T k x ). Tranng s the same as before, and we make decsons usng: ŷ = max k p(y = k w k, x ).

10 NK-Slack Multclass SVMs Makng the rght decsons corresponds to satsfyng: k y, Followng the same steps as before, we can wrte ths as: Addng slacks and L2-regularzaton yelds the NK -slack mult-class SVM: Ths can also be wrtten as: p(y w, x ) p(y = k w k, x ) c k y, w T y x w T k x 1. mn w,ξ k y ξ,k + λ w 2 2, k y, w T y x w T k x 1 ξ,k, k y ξ,k 0, mn w + varables. We wll refer to ths formulaton as k y (1 w T y x + w T k x ) + + λ w 2 2,

11 N-Slack Multclass SVMs If nstead of wrtng the constrant on the decson rul as: We wrote t as: k y, Then followng the same procedure we obtan the N -slack multclass SVM: p(y w, x ) p(y = k w k, x ) c p(y w, x ) max k y p(y = k w k, x ) c. mn w,ξ ξ + λ w 2 2, k y, w T y x w T k x 1 ξ, ξ 0, Whch can + be varables. wrtten We as the wll refer unconstraned to ths as the optmzaton: -s mn w max k y (1 w T y x + w T k x ) + + λ w 2 2,

12 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons Outlne

13 Condtonal Random Felds The extenson of logstc regresson to data wth multple (dependent) labels s known as a condtonal random feld. For example, a bnary chan-crf wth Isng-lke potentals and ted parameters could use: S p(y w, X ) exp( j=1 y,j w T n x,j + A concse notaton for the general case s: One possble decson rule s: S 1 j=1 y,j y,j+1 w T e x,j,j+1 ), p(y w, X ) exp(w T F (X, Y )), Ŷ = max p(y w, X ). Y In the case of chans, ths s Vterb decodng g all possble labels may

14 Hdden Markov SVMs Makng the rght decsons wth Vterb decodng corresponds to satsfyng: p(y w, X ) Y Y p(y w, X ) c, Ths s equvalent to the set of constrants: Y Y log p(y w, X ) log p(y w, X ) 1. Addng the L2-penalty and usng the N-slack penalty: uared) regularzer and the mn w,ξ The hdden Markov support vector machne ξ + λ w 2 2, -slack constrant vol s.t. Y Y log p(y w, X ) log p(y w, X ) 1 ξ, ξ 0

15 Max-Margn Markov Networks The constrants n the HMSVM don t care about the number of dfferences between Y and Y : Y Y log p(y w, X ) log p(y w, X ) 1. We uared) mght to regularzer be the dfference and then probablty to be hgher -slack constrant vol when the dfference n labels s hgher: Y Y log p(y w, X ) log p(y w, X ) (Y, Y ). ram Leadng to the QP: mn elhood rato between the true label and an alternatve c w,ξ Ths s known as a max-margn Markov networks, or structural SVM wth margn-rescalng ξ + λ w 2 2, s.t. Y Y log p(y w, X ) log p(y w, X ) (Y, Y ) ξ, ξ 0, as a max-margn Markov network [Taskar et al., 2003], or a structural support

16 Structural SVMs Rescalng the constant mght make us concentrate on beng much better than sequences that dffer at many postons: Y Y log p(y w, X ) log p(y w, X ) (Y, Y An alternatve s to rescale the slacks based on the dfference between sequences: Leadng to the QP: lhood rato between the true label and an alternatve c Y Y log p(y w, X ) log p(y w, X ) 1 ξ / (Y, Y ), ξ 0. mn w,ξ ξ + λ w 2 2, s.t. Y Y log p(y w, X ) log p(y w, X ) 1 ξ / (Y, Y ), ξ 0. Ths s known as a structural SVM wth slack-rescalng ).

17 Summary Unconstraned formulatons of structural extensons: (HMSVM) (MMMN) (SSVM) mn w mn w mn w Snce delta(y,y)=0, we smplfy MMMN and SSVM: (MMMN) (SSVM) mn w mn w Ths allows us to use Vterb decodng wth a modfed nput to compute the max values. max (1 log p(y w, X ) + log p(y Y Y w, X )) + + λ w 2 2, max ( (Y, Y Y Y ) log p(y w, X ) + log p(y w, X )) + + λ w 2 2, max ( (Y, Y Y Y )(1 log p(y w, X ) + log p(y w, X ))) + + λ w 2 2. max( (Y, Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2, Y max( (Y, Y )(1 log p(y w, X ) + log p(y w, X ))) + λ w 2 2. Y

18 (MMMN) (SSVM) mn w Beyond Chans mn w We can compute these objectve value anytme we can do decodng n the model: Trees and low-treewdth graphs Context-free grammars General graphs wth sub-modular potentals* Weghted bpartte matchng* max( (Y, Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2, *: #P-hard to tran condtonal random feld Y max( (Y, Y )(1 log p(y w, X ) + log p(y w, X ))) + λ w 2 2. Y We can also plug n an approxmate decodng or convex relaxaton of decodng

19 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons Outlne

20 Our objectve functon s: Subgradents f(w) max( (Y, Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2. Y If Y s an argmax of the max, a subgradent s: g(w) w log p(y w, X ) w log p(y w, X ) + 2λw. Consder the step: w k+1 = w k η k g(w k ). For small enough eta, ths wll: length that lead to conve always move us toward the optmal soluton decrease the objectve functon when the argmax s unque

21 Subgradent descent We can therefore consder optmzaton algorthms of the form: { Common choces of step sze are constant, or a sequence satsfyng: Update based on a sngle tranng example: Average the teratons: w k+1 = w k η k g(w k ). k=1 η k =, k=1 η 2 k <. g (w) w log p(y w, X ) w log p(y w, X ) + (2/N)λw, the form w k+1 = w k ηg (w k ), w k+1 = k 1 k w k + 1 k w k+1. Project onto a compact theset formcontanng s dscussed the n soluton: [Kus w k+1 = π(w k η k g(w k )),

22 Some Convergence Rates Projected batch SD (dmnshng step szes): O(1/eps) Averaged stochastc SD (constant step szes): O(1/eps2 ), asymptotc varance Stochastc projected SD (dmshng step szes): O(1/(d eps)) w.p. 1-d Averaged stochastc projected SD (constant step szes):?, asymptotc varance Batch SD (constant step szes): O(log(1/eps)) to get wthn bounded regon of optmal (bound depends on lambda and bound on sub-dfferental)

23 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons Outlne

24 Cuttng Plane Methods The problem wth the QP formulaton s that t has an exponental number of constrants: mn w,ξ But, there always exsts a polynomal-szed set that satsfes all constrants up to an accuracy of eps. Basc dea behnd cuttng plane method: use decodng to fnd out f all constrants are satsfed f not, greedly add a constrant ξ + λ w 2 2, s.t. Y log p(y w, X ) log p(y w, X ) (Y, Y ) ξ, ξ 0.

25 QP Cuttng Plane Method Cuttng plane method: we have a workng set of constrants terate over tranng examples: f decodng does not volate constrants, contnue otherwse, add constrant to workng set and solve QP stop f no changes n workng set Solvng these QPs n the dual s effcent, as long as the workng set s small. At most O(1/eps) constrants are requred.

26 Convex Cuttng Plane There also exst cuttng plane methods for solvng (nonsmooth) convex optmzaton problems We can apply these to the unconstraned formulaton: f(w) Basc dea: any subgradent gves a lower boundng hyperplane max( (Y, Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2. Y f(w) f(w 0 ) + (w w 0 ) T g(w 0 ), Cuttng plane for non-smooth optmzaton: Fnd mnmum over these lower bounds Use mnmum to make better lower bound

27 Bundle Methods Problem: mnmum of lower bound may be far away from current soluton. Bundle method: mnmze lower bound subject to L2-penalty on dstance from current soluton Combned cuttng-plane/bundle-method: use the L2-penalty already present n the objectve, and buld a lower bound on the hnge loss w k+1 w k 2 2 already presen Combned method requres at most O(1/eps) teratons.

28 Outlne Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons

29 Poly-szed Formulatons The prevous two strateges use the graph structure to allow effcent decodng. An alternatve strategy s to use the graph structure to reparameterze our quadratc program. Although we can also do ths for the prmal, ths can be shown more drectly for the dual problem...

30 Dual MMMN Solvng the MMMN QP s equvalent to solvng the followng QP: max α Notes: Y ths QP has an exponental number of constrants/varables the constrants take the form of an unnormalzed dstrbuton over label confguratons We are gong to wrte ths QP n terms of margnals of ths dstrbuton α (Y ) (Y, Y ) 1 2 s.t. Y Y j Y j α (Y ) = 1 2λ, Y α (Y ) 0. α (Y )α j (Y j ) F (Y ) T F j (Y j ), hs dual formulaton has an exponental number of constrants and an exponenta

31 Margnal Representaton If the dstrbuton factorzes nto node and edge potentals, we can wrte the margnals of the dstrbuton as: µ (y j ) = µ (y j, y k ) = Y [y j] α (Y ), We must satsfy d tothe denote constrants that a confguraton of the orgnal s problem: We also es need formthe a node legal densty, and edge they margnals must belong to le n the Y [y j,y k ] j µ (y j ) 0, α (Y ), margnal polytope. For chans/trees/forests, t s suffcent to enforce a local consstency condton: j µ (y j ) = 1 2λ. (j,k) E µ (y j, y k ) = µ (y k ), (j,k) E µ (y j, y k ) 0. y j

32 Polynomal-Szed Dual We can re-wrte the frst set of terms n the dual usng these Y margnals: α (Y ) (Y, Y ) = Y j α (Y ) j (y j, y j) = j We can smlarly wrte the second set of terms, yeldng a polynomal-szed verson of the dual problem: ls the factored dual 1 2 (j,k) E (j,k ) E y j,y k max µ j (y j, y j)µ (y j) Structural SMO ; coordnate descent on ths problem j y j y j j (y j, y j)µ (y j). y j,y k µ (y j, y k)µ (y j, y k )F (X, y j, y k) T F (X, y j, y k ), s.t. j µ (y j ) 0, j µ (y j ) = 1 2λ, (j,k) E µ (y j, y k ) = µ (y k ), (j,k) E µ (y j, y k ) 0. y j hle ths formulaton s now polynomal-sze, the quadratc form has a very large number of terms (e

33 Exponentated Gradent An alternatve to usng an explcty formulaton of the dual s to use an mplct formulaton apply the exponentated gradent (EG) algorthm. The EG algorthm solves optmzaton problem where the varables take the form of a dstrbuton: x 0, x = 1. EG steps take the form: x = x exp( η f(x)) x exp( η f(x)), and we can consder applyng th

34 Exponentated Gradent It s possble to derve a dual where the varables alpha represent a normalzed dstrbuton. In ths case, we can apply the batch or onlne EG algorthm. To make the teratons effcent, an mplct representaton for alpha that factorzes accordng to the graph s used. The algorthm requres O(1/eps) teratons to reach an epsaccurate soluton. Performng the updates usng ths mplct representaton requres nference, nstead of just decodng (so t can t be appled n general)

35 Outlne Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons

36 Mn-Max Formulatons Rather than dealng wth the exponental number of constrants lnear: n s problematc snce we have an exponental number of constrants. Ho We could just use one non-lnear constrant for each tranng example: In ths formulaton, we have a constrant on the optmal decodng. mn w,ξ ξ + λ w 2 2, s.t. Y log p(y w, X ) log p(y w, X ) (Y, Y ) ξ, ξ 0. mn w,ξ s.t. log p(y w, X ) + ξ max Y ξ + λ w 2 2, log p(y w, X ) + (Y, Y ), ξ 0.

37 Lnear Programmng The mn-max formulaton s useful s when the optmal decodng can be formulated as a lnear program: max z wbz s.t. z 0, Az b, In ths case we can wrte out the dual of ths problem: mn z b T z s.t. z 0, A T z (wb) T. And plug t n to the mn-max formulaton: em nto our orgnal problem to elmnate mn w,ξ z ξ + λ w 2 2, s.t. log p(y w, X ) + ξ b T z, ξ 0, z 0, A T z (wb) T. Ths s lke changng the max over Z nto a max over R can solve ths (reasonably-szed) quadratc program to exactly solve the

38 Extragradent Method An alternatve to pluggng the lnear program nto the QP formulaton s to plug t nto the unconstraned formulaton: mn max w W z Z λ w Ths problem can be solved usng the extragradent method: The projecton onto Z can be formulated as a quadratc-cost network flow problem. w p = π W (w η w L(w, z)), w T F z + c T z w T F (X, Y ), z p = π Z(z + η z L(w, z)), w c = π W (w η w L(w p, z p )), z c = π Z (z + η z L(w p, z p )). The step sze s chosen by backtrackng, and the algorthm has a lnear convergence rate, O(log(1/eps))

39 Comments on rates of convergence O(1/eps2 ) s ncredbly, ncredbly slow O(1/eps) s stll ncredbly slow ( sub-lnear convergence) O(log(1/eps)) can be fast, slow, or somewhere n between ( lnear convergence ) O(log log(1/eps)) s fast ( quadratc convergence) Open queston: can we get a practcal O(log log(1/eps)) algorthm, or an O(log(1/eps)) algorthm wth a provably nce constant n the rate of convergence.

A Note on Structural Extensions of SVMs

A Note on Structural Extensions of SVMs A Note on Structural Extensons of SVMs Mark Schmdt March 29, 2009 1 Introducton Ths document s ntended as an ntroducton to structural extensons of support vector machnes for those who are famlar wth logstc