A Note on Structural Extensions of SVMs

Size: px
Start display at page:

Download "A Note on Structural Extensions of SVMs"

Transcription

1 A Note on Structural Extensons of SVMs Mark Schmdt March 29, Introducton Ths document s ntended as an ntroducton to structural extensons of support vector machnes for those who are famlar wth logstc regresson (bnary and multnomal) and dscrete-state probablstc graphcal models (n partcular, condtonal random felds). No pror knowledge about support vector machnes s assumed. The outlne s as follows 2 motvates and outlnes bnary support vector machnes. The contents of ths secton are standard, and the reader s referred to [Vapnk, 1995] for more detals. However, we wll follow a non-standard presentaton; nstead of motvatng support vector machnes from the pont of vew of optmal separatng hyper-planes, we focus on the relatonshp between logstc regresson and support vector machnes through lkelhood ratos, a vewpont due to SVN Vshwanthan. 3 dscusses mult-class generalzatons of bnary suppport vector machnes, focusng on the N K-slack formulaton of [Weston and Watkns, 1999], and the N-slack formulaton of [Crammer and Snger, 2001]. 4 dscusses extensons of the N-slack mult-class support vector machne that can model data wth structured output spaces. In partcular, ths secton focuses on hdden Markov support vector machnes [Altun et al., 2003, Joachms, 2003], max-margn Markov networks [Taskar et al., 2003], and structural support vector machnes [Tsochantards et al., 2004]. 5 (n progress) wll dscuss solvng the optmzaton problems n 4. There are four man approaches: (1) sub-gradent methods [Collns, 2002, Altun et al., 2003, Zhang, 2004, Shalev-Shwartz et al., 2007], (2) cuttng plane and bundle methods [Tsochantards et al., 2004, Joachms, 2006, Teo et al., 2007, Smola et al., 2008, Joachms et al., 2009], (3) polynomal-szed reformulatons [Taskar et al., 2003, Bartlett et al., 2005, Collns et al., 2008], and (4) mn-max formulatons [Taskar et al., 2004, Taskar et al., 2006b, Taskar et al., 2006a]. Some of the thngs that wll not be covered n ths documentaton are a dscusson of optmal separatng hyper-planes, dervng the Wolfe dual formulatons, the kernel trck, and computatonal learnng theory. [Lacoste-Julen, 2003] s an accessble ntroducton to max-margn Markov networks that dscusses these topcs, and also contans ntroductory materal on graphcal models and condtonal random felds. 2 Support Vector Machnes We frst look at bnary classfcaton wth an N by P desgn matrx X, and an N by 1 vector of class labels y wth y { 1, 1}. We wll gnore the bas term to smplfy presentaton, and consder probabltes for the class labels that are proportonal to the exponental of a lnear functon of the data, p(y = 1 w, x ) exp(w T x ), 1

2 p(y = 1 w, x ) exp( w T x ), Usng these class probabltes s equvalent to usng a logstc regresson model, and we can estmate the parameters w by maxmzng the lkelhood of the tranng data, or equvalently mnmzng the negatve log-lkelhood mn log p(y w, x ). w The soluton to ths problem s not necessarly unque (and the optmal parameters may be unbounded). To yeld a unque soluton (and avod over-fttng), we typcally add a penalty on the l 2 -norm of the parameter vector and compute a penalzed maxmum lkelhood estmate mn w log p(y w, x ) + λ w 2 2, where the scalar λ controls the strength of the regularzer. Wth an estmate of the parameter w, we can classfy a data pont x usng { 1 f p(y = 1 w, x ) > p(y = 1 w, x ) ŷ = 1 f p(y = 1 w, x ) < p(y = 1 w, x ). But what f our goal sn t to have a good model p(y w, x ), but rather to make the rght decson for all our tranng set ponts? We can express ths n terms of the followng condton on the lkelhood ratos p(y w, x ) p( y w, x ) c, where c > 1. The exact choce of c s arbtrary, snce f we can satsfy ths for some c > 1, then we can also satsfy t for any c > 1 by re-scalng w. Takng logarthms, we can re-wrte ths condton as log p(y w, x ) log p( y w, x ) log c, and pluggng n the defntons of p(y w, x ) we can wrte ths as 2y w T x log c. Snce c s an arbtrary constant greater than 1, we wll pck c so that (1/2) log c = 1, so that our condtons can be wrtten n a very smple form y w T x 1. Ths s a lnear feasblty problem, and t can be solved usng technques from lnear programmng. However, one of two thngs wll go wrong: () the soluton may not be unque, or () there may be no soluton. As before, we can make the parameters dentfable by usng an l 2 -norm regularzer, whch leads to a quadratc program mn w λ w 2 2, s.t. y w T x 1, In ths case, the choce of λ s arbtrary snce solvng ths quadratc program wll yeld the soluton wth smallest l 2 norm for any λ > 0. To address the case where the lnear feasblty problem has no soluton, we wll ntroduce a non-negatve slack varable ξ for each tranng case that allows the constrant to be volated for that nstance, but we wll penalze the l 1 -norm of ξ to try and mnmze the volaton of the constrants. Ths yelds the quadratc program mn ξ + λ w 2 2, s.t. y w T x 1 ξ, ξ 0, 2

3 whch has P + N varables and s the prmal form of support vector machnes. Specfcally, ths s what s known as a soft-margn support vector machne [Vapnk, 1995]. The support vectors are those data ponts where the non-negatvty constrant holds wth equalty (to keep the presentaton smple, we wll not dscuss the dual form or the kernel trck ). By re-arrangng the constrants, we have that for all, ξ 0 and ξ 1 y w T x, and therefore we know that ξ max{0, 1 y w T x }. Further, snce we are drectly mnmzng a lnear functon of ξ, we can show that the followng mnmzaton mn w (1 y w T x ) + + λ w 2 2, where we use (z) + to denote max{0, z}, has the same soluton as the prmal form of support vector machnes. Ths s an unconstraned but non-dfferentable problem n P varables 1. The term (1 y w T x ) + s known as the hnge loss, and s an upper bound on the number of classfcaton errors. To summarze, l 2 -regularzed logstc regresson and support vector machnes are closely related. They employ a very smlar model and use the same regularzaton, but dffer n that logstc regresson seeks to maxmze the (penalzed) lkelhood (a log-concave smooth approxmaton to the number of classfcaton errors), whle support vector machnes seek to mnmze the (penalzed) constrant volatons n a set of condtons on the lkelhood ratos (a convex pecewse-lnear upper bound on the number of classfcaton errors). 3 Mult-Class Support Vector Machnes Vewed from the optmal separatng hyper-plane perspectve, t s not mmedately clear how to extend bnary support vector machnes to the case where we consder K possble class labels. Most attempts at developng mult-class support vector machnes concentrate on producng a mult-class classfer by combnng the decsons made by a set of ndependent bnary classfers, a popular approach of ths type s to use errorcorrectng output codes [Detterch and Bakr, 1995]. However, the lkelhood rato perspectve of support vector machnes suggests an obvous generalzaton to the mult-class scenaro by analogy to the extensons of logstc regresson to the mult-class scenaro. A natural generalzaton of the bnary logstc regresson classfer to the case of a K-class problem s to use a weght vector w k for each class p(y = k w k, x ) exp(w T k x ). Parameter estmaton s performed as n the bnary case, and we are lead to the decson rule ŷ = max p(y = k w k, x ). k Vewed n terms of lkelhood ratos, we would lke our decson rule to satsfy the constrant k y, p(y w, x ) p(y = k w k, x ) c Followng the same steps as before, we arrve at a set of lnear constrants of the form k y, w T y x w T k x 1. We wll agan consder the mnmzng l 2 -norm of w (concatenatng all K of the w k vectors nto one large vector), and ntroduce slack varables to allow for constrant volaton. However, there are dfferent ways 1 If we nstead consdered penalzng the squared l 2 -norm of the slack varables, then we can re-wrte the problem as a dfferentable unconstraned optmzaton. Ths smooth support vector machne s due to [Lee and Mangasaran, 2001] 3

4 to ntroduce the slack varables. The orgnal method of [Weston and Watkns, 1999] ntroduces one slack varable for each of these constrants; mn ξ,k + λ w 2 2, k y k y, w T y x w T k x 1 ξ,k, k y ξ,k 0, a quadratc program wth NK+P K varables. We wll refer to ths formulaton as the NK-slack formulaton. An equvalent unconstraned optmzaton problem where we elmnate the slack varables s mn (1 wy T w x + wk T x ) + + λ w 2 2, k y an unconstraned non-dfferentable problem n P K varables 2. Later, [Crammer and Snger, 2001] proposed what we wll call the N-slack mult-class formulaton, whch can be motvated by re-wrtng the lkelhood rato n the equvalent form p(y w, x ) max k y p(y = k w k, x ) c. Although ths leads an equvalent set of constrants, the formulatons dffer when we ntroduce slack varables on ths set of constrants, snce each tranng nstance s only assocated wth 1 slack varable; ξ + λ w 2 2, mn k y, w T y x w T k x 1 ξ, ξ 0, a quadratc program wth N + P K varables. We wll refer to ths as the N-slack formulaton. An equvalent unconstraned optmzaton problem where we elmnate the slack varables s mn max(1 wy T w k y x + wk T x ) + + λ w 2 2, an unconstraned non-dfferentable problem n P K varables. Comparng these two formulatons, we see that the N K-slack formulaton s a much more punshng penalty, t punshes volaton of the constrants for any competng hypothess. In contrast, the N-slack formulaton only punshes based on the volaton of the most lkely competng hypothess, even f there are many lkely competng hypothess. However, the advantage of the N-slack formulaton s that t leads to effcent structural extensons. 4 Hdden Markov Support Vector Machnes, Max-Margn Markov Networks, Structural Support Vector Machnes We now move to the case where we no longer have a sngle class label y for each tranng nstance, but nstead have a set of j labels y,j. If these labels are ndependent, then we can smply use the methods above to ft an ndependent support vector machne to each label. The more nterestng case occurs when the elements of the output vector are dependent. For example, we mght have a sequental dependency; the 2 If we consder penalzng the squared l 2 -norm of the NK-slack formulaton, then we can wrte the optmzaton as a dfferentable unconstraned optmzaton. A demo of ths mult-class extenson of smooth support vector machnes s at 4

5 label y,j depends on the prevous label y,j 1. The structural extenson of logstc regresson to handle ths type of dependency structure s known as a lnear-chan condtonal random feld [Lafferty et al., 2001]. In the case of ted parameters, bnary labels, Isng-lke potentals, and sequences of labels wth the same length S (removng any of these restrctons s trval, but would requre more notaton), the probablty of a vector of labels Y s wrtten as p(y w, X ) exp( S j=1 S 1 y,j wn T x,j + y,j y,j+1 we T x,j,j+1 ), where w n represent the node parameters, w e represent the edge/transton parameters, and we allow a feature vector x,j for each node and x,j,j+1 for each transton. Although there are other possbltes, we consder the decson rule Ŷ = max Y p(y w, X ). Computng ths maxmum by consderng all possble labels may no longer be feasble, snce there are now 2 S possble labels, but the maxmum can be found n O(S) by dynamc programmng n the form of the Vterb decodng algorthm. Movng to an nterpretaton n terms of lkelhood ratos, we would lke our classfer to satsfy the crtera j=1 leadng to the set of constrants p(y w, X ) Y p(y w, X ) c, Y log p(y w, X ) log p( w, X ) 1. (1) Introducng the l 2 -norm (squared) regularzer and the N-slack constrant volaton penalty, we obtan the quadratc program ξ + λ w 2 2, mn s.t. Y log p(y w, X ) log p( w, X ) 1 ξ, ξ 0 Ths s known as a hdden Markov support vector machne [Altun et al., 2003, Joachms, 2003]. Because t s based on the N-slack formulaton, the hdden Markov support vector machne penalzes all alternatve hypothess based on the most promsng alternatve hypothess. In partcular, the number of label dfferences between Y and an alternatve confguraton s never consdered. Rather than selectng the ), the number constant to be 1 across all alternatve confguratons, we could consder settng t to (Y, of label dfferences between Y and Y log p(y w, X ) log p( w, X ) (Y, ). (2) Ths encourages the lkelhood rato between the true label and an alternatve confguraton to be hgher f the alternatve confguraton dsagrees wth Y at many nodes. Wth our usual procedure we obtan the quadratc program mn ξ + λ w 2 2, s.t. Y log p(y w, X ) log p( w, X ) (Y, ) ξ, ξ 0, whch s known as a max-margn Markov network [Taskar et al., 2003], or a structural support vector machne wth margn re-scalng [Tsochantards et al., 2004]. Of course, ths sn t the only way to change the constrants based on the agreement between Y and Y, and [Tsochantards et al., 2004] argues that ths formulaton has the dsadvantage that t tres to make very large the dfference between competng labels that dffer n many postons. An alternatve to changng the 5

6 constant factor n the lkelhood rato s to change the scalng of the slack varables and solve the quadratc program ξ + λ w 2 2, mn s.t. Y log p(y w, X ) log p( w, X ) 1 ξ / (Y, ), ξ 0. Here, you need to add more slack for a competng hypothess that dffers n many poston. Ths last varant s known as a structural support vector machne wth slack re-scalng [Tsochantards et al., 2004]. Equvalent unconstraned versons of these 3 structural extensons of support vector machnes are 3 (HMSVM) (MMMN) (SSVM) mn w mn w mn w max Y (1 log p(y w, X ) + log p( max ( (Y, Y Y ) log p(y w, X ) + log p( Y max ( (Y, Y Y )(1 log p(y w, X ) + log p( Y w, X )) + + λ w 2 2, w, X )) + + λ w 2 2, w, X ))) + + λ w 2 2. In the (MMMN) and (SSVM) formulatons, we can use (Y, Y ) = 0 to smplfy these expressons, gvng (MMMN) mn max( (Y, Y w Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2, (SSVM) mn w N.B., n these latter two expressons we can maxmze over all max( (Y, Y )(1 log p(y w, X ) + log p(y w, X ))) + λ w 2 2. nstead of over all Y Y ; In these cases, we can compute the objectve functon usng the Vterb decodng algorthm (wth a sutably modfed nput). In contrast, the HMSVM model requres solvng a maxmzaton problem over all confguratons except Y, whch be may be much harder to solve 4. Fnally, we note that whle the structural hnge losses n all three formulatons represent an upper bound on the number of sequence errors (e. the number of tmes at least one label n the sequence was wrong), the MMMN and SSVM formulatons gve an upper bound on the number of label errors whle the HMSVM does not. Although ths secton has focused on a chan-structured dependency structure, we only used ths property n two places: () the defnton of p(y w, X ), and () n effcent methods to compute maxmums over the space of possble confguratons. We can apply the same methodology to derve MMMNs and SSVMs for a wde varety of alternatve models. For example, we can consder the cases where condtonal random felds can be appled exactly, such as low tree-wdth structured dependences, and probablstc context free grammars. We can also consder models where we consder hgher-order dependences, rather than smply parwse dependences, as long as the graph structure or parameterzaton allows us to effcently solve the assocated maxmzaton problems. As dscussed n [Taskar et al., 2004], an advantage that MMMNs/SSVMs have over condtonal random felds s that we can also consder completely general graph structures (such as lattces or fully connected graphs), f we enforce the addtonal constrant that the edge weghts enforce a sub-modularty condton [Kolmogorov and Zabn, 2004]. In these cases, the exact nference needed for tranng condtonal random felds s ntractable, but computng the maxmums needed by MMMNs/SSVMs can be formulated as a lnear program (that can be effcently solved usng mn-cut technques). [Taskar et al., 2004] proposes assocatve max margn Markov networks, models that take advantage of ths property to allow effcent tranng and decodng n models wth an arbtrary graph structure and arbtrary clque szes. In these models, the 3 We can also consder penalzng the squared values of the slack varables [Tsochantards et al., 2004], but ths does not lead to a dfferentable mnmzaton problem due to the potental for tes n the max functons. 4 We could also consder structural extensons based on the NK-slack formulaton [Altun and Hofmann, 2003]. In ths case, the max functons would be replaced by sums, but t s not obvous (to me, anyway) how you would compute these sums effcently, nor how you would deal wth the exponental number of varables and constrants n the quadratc program 6

7 potental that all varables n a clque take the state k based on a non-negatve feature x s modeled wth a parameter λ k, 1, whle the potental for any confguraton where they don t take the same state s fxed at 1. For bnary varables, the decodng problem can be solved exactly as a lnear program for any graph structure, whle for mult-class problems the lnear program yelds a worst-case approxmaton rato of 1/c (where c s the largest clque sze), f an approprate dscretzaton strategy s used. [Taskar et al., 2006b] also gves the example of weghted bpartte graph matchng, where decodng can be computed n polynomal tme usng lnear programmng, but the nference needed to tran a condtonal random feld s #P -hard. For general graphs where we do not want the sub-modularty condton, we can also consder an approxmaton where we use an approxmate decodng method, such as loopy belef propagaton or a stochastc local search method [Hutter et al., 2005]). However, theoretcal and emprcal work n [Fnley and Joachms, 2008] shows that usng such under-generatng decodng methods (whch may return sub-optmal solutons to the decodng problem) have several dsadvantages compared to so-called over-generatng decodng methods. A typcal example of an over-generatng decodng method s a lnear programmng relaxaton of an nteger programmng formulatons of decodng, where n the relaxaton the objectve s optmzed exactly over an expanded space [Kumar et al., 2007], hence yeldng a fractonal decodng that s a strct upper bound on the optmal decodng. 5 Tranng In ths secton, we dscuss methods for estmatng the parameters of structural extensons of SVMs. We wll focus on MMMNs snce t makes the notaton slghtly smpler, but most of the technques can be appled wth only mnor modfcatons to the case of SSVMs. 5.1 Subgradent Methods We frst consder methods that address the unconstraned non-dfferentable formulaton of MMMNs by movng along sub-gradents of the objectve functon f(w) max( (Y, Y ) + log p(y w, X )) log p(y w, X ) + λ w 2 2. If we use as an argmax of the max, then a sub-gradent of ths objectve functon s g(w) w log p( w, X ) w log p(y w, X ) + 2λw. The functon s dfferentable and ths wll be the gradent whenever all of the argmax values are unque. The negatve sub-gradent drecton s a descent drecton at all ponts where the functon s dfferentable, whch means that f(w ηg(w)) < f(w) for suffcently small η > 0. In contrast, at non-dfferentable ponts (where at least one argmax s not unque) negatve sub-gradent drectons may not be descent drectons [Bertsekas, 1999, 6.3]. However, we can guarantee that for suffcently small η the dstance to the optmal soluton s reduced, even f the functon value s not [Bertsekas, 1999, 6.3]. We can therefore consder a smple sub-gradent method that uses a sequence of small step szes {η k }, where the teratons take of the form w k+1 = w k η k g(w k ). Several strateges for choosng the step length that lead to convergence of the teratons are dscussed n [Bertsekas, 1999, 6.3], wth a common choce beng a sequence {η k } of postve step szes satsfyng η k =, k=1 ηk 2 <. k=1 7

8 Convergence of the method can also be shown f each step uses the contrbuton to the sub-gradent of a sutably chosen ndvdual tranng example [Kushner and Yn, 2003] resultng n steps of the form g (w) w log p( w, X ) w log p(y w, X ) + (2/N)λw, w k+1 = w k η k g (w k ). Ths s known as a stochastc sub-gradent descent method, and was presented n [Collns, 2002] (predatng the formulaton of structural SVMs) for the degenerate case of λ = 0 and Y,Y (Y, Y ) = 0, and where a constant step sze (η k = η) was used nstead of a convergent sequence. An nterestng alternatve method was also outlned n [Collns, 2002]; In ths second method a constant step sze was used, and a varant on the procedure was consdered where the fnal estmator s the average of all the terates; w k+1 = w k ηg (w k ), w k+1 = k 1 w k + 1 k k w k+1. Convergence of averaged estmators of the form w s dscussed n [Kushner and Yn, 2003], where t s shown that under certan condtons these steps satsfy an asymptotc statstcal effcency property. In the specal case of bnary support vector machnes (though easly extended to structural support vector machnes), [Shalev-Shwartz et al., 2007] consders teratons of the form w k+1 = π(w k η k g(w k )), where π s a projecton onto an l 2 -ball that s guaranteed to contan the soluton 5. They showed that the number of teratons to obtan a soluton wth accuracy ɛ usng ths method s O(1/ɛ) (gnorng poly-logarthmc factors), whch was an mprovement over the O(1/ɛ 2 ) number of teratons provded by the prevous analyss of the averaged stochastc gradent method [Zhang, 2004] and cuttng plane methods [Joachms, 2006]. [Shalev-Shwartz et al., 2007] also consder the case of a projected stochastc gradent update, and show that the teratons acheve an accuracy of ɛ wth probablty 1 δ n O(1/(δɛ)) teratons. One could also consder combnng averagng and projectons, whose asymptotc statstcal effcency s consdered n [Kushner and Yn, 2003]. Fnally, [Ratlff et al., 2007] show that the (determnstc) subgradent method wth a suffcently small constant step sze wll converge lnearly (e. wth the faster rate of O(log(1/ɛ))) to a regon contanng the mnmum whose radus s proportonal to C/λ (where C s a bound on the sub-dfferental). 5.2 Cuttng Plane and Bundle Methods We now consder methods for solvng the quadratc programmng formulaton of MMMNs ξ + λ w 2 2, mn s.t. log p(y w, X ) log p( w, X ) (Y, ) ξ, ξ 0. Ths formulaton s problematc snce we have an exponental number of constrants. However, t s concevable that we do not need to enforce all of these constrants at the optmal soluton. In partcular, [Tsochantards et al., 2004] showed that there always exsts a polynomal-szed subset of the constrants that can satsfy all constrants wth accuracy at least ɛ. Ths makes cuttng-plane methods for quadratc programmng an appealng approach for solvng ths type of problem. Cuttng plane methods have a very smple structure; We frst fnd the unconstraned mnmzer. If the mnmzer does not satsfy all the constrants, we fnd a constrant that s volated and use t to construct our 5 The projecton of a vector w onto a set W s defned as mn w W w w 2 8

9 ntal workng set W. We then solve the problem wth the constrants contaned n W, and f ths soluton stll does not satsfy all the constrants we add another volatng constrant to W. Ths contnues untl we have accumulated enough constrants n W so that all constrants are satsfed n the soluton. [Tsochantards et al., 2004] suggests the followng mplementaton of a cuttng plane algorthm for MMMNs. We terate over the tranng examples, and use Vterb decodng to fnd the maxmum structural hnge loss. If no constrant s volated by more than ɛ then we move on to the next tranng example. Otherwse, we add the constrant correspondng to ths alternatve confguraton, and optmze the objectve functon subject to our current workng set of constrants. The algorthm termnates when we complete a full pass through the tranng data wthout addng any constrants. The quadratc programs can be effcently solved usng the Wolfe dual formulaton (see the next secton), whch s a relatvely small problem snce t has only one varable for each workng constrant. Further, the sequence of quadratc programs can be solve effcently snce each quadratc program dffers from the prevous one by only a sngle constrant. The man concern wth usng ths type of approach s that the number of constrants needed mght be so large that the ntermedate quadratc programs may be mpractcal to solve. [Tsochantards et al., 2004] show that at most O(1/ɛ 2 ) constrants are requred 6. Ths cuttng plane algorthm for structural SVMs later nspred an effcent algorthm for bnary support vector machnes [Joachms, 2006]. In ths method, the bnary support vector machne problem s wrtten n a so-called 1-slack formulaton mn Nξ + λ w 2 2, s.t. c {0,1} N c y w T x c Nξ, ξ 0. Here, we have an exponental number of constrants but only 1 slack varable. [Joachms, 2006] shows that ths formulaton has the same optmal parameters (wth ξ = (1/N) ξ ), and outlnes a cuttng plane algorthm for bnary SVMs that s mplemented n the SVMperf software 7. Ths method was extended (back) to structural SVMs n [Joachms et al., 2009], where t was shown that at most O(1/ɛ) constrants are requred. Ths s the algorthm mplemented n the current verson of the SVMstruct software 8. An alternatve to applyng cuttng plane methods for quadratc programmng to a constraned formulaton of the MMMN problem s to apply the related cuttng plane methods for non-smooth optmzaton to the unconstraned non-dfferentable MMMN problem. Ths s explored n [Teo et al., 2007, Smola et al., 2008]. In cuttng plane methods for optmzaton of (non-smooth) convex functons, we use the defnton of a sub-gradent at a pont w 0 to yeld a global lower bound on the functon f(w) f(w 0 ) + (w w 0 ) T g(w 0 ), that s exact at w 0. Cuttng plane methods for non-smooth optmzaton proceed by collectng a set of these lower boundng hyper-planes, and at each teraton generate a new tral pont by fndng the mnmum of these lower bounds. The functon value and sub-gradent at the new tral pont are then added to the workng set, yeldng a more accurate lower bound. A dsadvantage of ths procedure s that the mnmum of the set of lower bounds may suggest a tral pont that s far away from the prevous pont, even f the prevous pont was close to the optmal soluton. Ths s known as nstablty [Bertsekas, 1999, 6.3]. A common way to address ths s to mnmze the lower bound subject to a regularzer w k+1 w k 2 2 on the dfference between subsequent tral values. Ths s known as a bundle method. The method of [Teo et al., 2007, Smola et al., 2008] s closely related to bundle methods, but dffers n the regularzer. Rather than buldng a lower bound for the full objectve functon and usng the regularzer w k+1 w k 2 2, they buld a sequence of lower bounds for the hnge loss and use the exact regularzer λ w 2 2 already present n the objectve functon. In [Smola et al., 2008], t s shown that 6 A closely related method and analyss for HMSVMs s n [Joachms, 2003]

10 the method requres O(1/ɛ) teratons to converge 9, and that the method offers a speed advantage over the method of [Shalev-Shwartz et al., 2007]. 5.3 Polynomal-Sze Reformulatons In ths secton, we examne an alternatve to cuttng plane methods for dealng wth the exponental szed quadratc programs. In partcular, we consder methods that take advantage of the sparse dependency structure n the underlyng dstrbuton to rewrte the exponental-szed quadratc program nto a polynomal-szed problem. In ths secton, we wll largely follow [Taskar et al., 2003] and wll work wth a dual formulaton to the MMMN quadratc program. In ths dual formulaton, we wll have a varable α (Y ) for each possble confguraton Y of tranng example. We wll also fnd t convenent to use the feature representaton of the probablty functons p(y w, X ) w T F (X, Y ), and use the notaton F (Y ) F (X, Y ) F (X, Y ). Usng ths notaton, [Taskar et al., 2003] shows that the followng problem s dual to the MMMN quadratc program max α (Y )α j (Y j ) F (Y ) T F j (Y j ), α α ( ) (Y, ) 1 2 s.t. j j α ( ) = 1 2λ, α ( ) 0. Although ths dual formulaton has an exponental number of constrants and an exponental number of varables, note the smple form of the constrants. If λ = 1/2, then the constrants smply enforce that α (Y ) s a vald probablty dstrbuton over all values of Y (for each tranng example ). For other values of λ, t enforces that α (Y ) s an unnormalzed dstrbuton (wth normalzng constant 1/2λ). If our dependency structure factorzes over nodes and edges n a graph, then we can consder a parameterzaton n terms of expectatons wth respect to ths dstrbuton µ (y j ) = α (Y ), µ (y j, y k ) = [y j] [y j,y k ] α ( ), where the notaton Y [y j] s used to denote that a confguraton Y s consstent wth the label y j. When we consder usng these varables, we wll requre that the node parameters satsfy the constrants of the orgnal problem j µ (y j ) 0, µ (y j ) = 1 2λ. To ensure that the edge varables form a legal densty, they must belong to the margnal polytope. In the case of tree-structured graphs, ths s equvalent to the requrement that [Taskar et al., 2003] (j,k) E µ (y j, y k ) = µ (y k ), (j,k) E µ (y j, y k ) 0. y j Usng these expectatons, we can re-wrte the set of frst terms n the dual formulaton as α (Y ) (Y, Y ) = α (Y ) j (y j, y j) = j (y j, y j)µ (y j). Y j j y j 9 [Smola et al., 2008] also show that applyng ther method to tran condtonal random felds wth l 2 -regularzaton acheves the faster rate of O(log(1/ɛ)). j 10

11 Ths s now a sum over ndvdual label assgnments rather than over jont assgnments. We can smlarly re-wrte the second term n the dual n terms of edge expectatons. Ths leads to what [Taskar et al., 2003] calls the factored dual max j (y j, y j)µ (y j) µ 1 2 (j,k) E (j,k ) E y j,y k j y j y j,y k µ (y j, y k)µ (y j, y k )F (X, y j, y k) T F (X, y j, y k ), s.t. j µ (y j ) 0, j µ (y j ) = 1 2λ, (j,k) E µ (y j, y k ) = µ (y k ), (j,k) E µ (y j, y k ) 0. y j Whle ths formulaton s now polynomal-sze, the quadratc form has a very large number of terms (e. t s quadratc n the number of tranng examples and nstances). [Taskar et al., 2003] proposes a method, known as the structural SMO method, for solvng ths optmzaton problem. In ths method, volaton of the constraned optmalty condtons s used to select the two most volatng varables, and ther optmal value gven all the other varables s computed n closed form to update the varables. [Taskar et al., 2003] dscusses extensons of the above method to graph structures that are not forests. For trangulated graphs, addtonal constrants can be mposed to ensure that the dstrbuton remans n the margnal polytope, although the number of addtonal constrants s exponental n the worst case. If we have an untrangulated graph or do not want to trangulate and add these extra constrants, the factored dual can be solved as an approxmaton (analogous to loopy belef propagaton). It s also possble to derve factored duals when the orgnal dstrbuton factorzes over hgher-order clques. Fnally, [Taskar et al., 2003] dscusses a factored verson of the prmal problem, where we have an extended set of slack varables that mmc the dependency structure n the graph and constran dual varables enforcng constrants on the margnal polytope. An alternatve method for optmzng the dual formulaton usng a polynomal-szed number of varables s outlned n [Bartlett et al., 2005, Collns et al., 2008]. In ths method, the exponentated gradent algorthm s used to (ndrectly) optmze the varables α n the dual formulaton. The exponentated gradent algorthm [Kvnen and Warmuth, 1997] s a method to optmze a functon f(x), where x les on the probablty smplex 10 x 0, x = 1. The exponentated gradent algorthm takes steps of the form x = x exp( η f(x)) x exp( η f(x)), where as before η s a learnng rate, and we can consder applyng the algorthm usng the contrbuton to the gradent of one tranng example at a tme nstead of all the tranng examples at once. [Bartlett et al., 2005, Collns et al., 2008] work wth a slghtly modfed dual formulaton where the constrants take the form of a probablty smplex (nstead of an unnormalzed verson), and apply the exponentated gradent algorthm to ths problem. To avod dealng wth the exponental number of varables α, they use an mplct representaton for α n terms of varables that are analogous to the µ varables above. Smlar to the methods from prevous sectons, they show that the method requres O(1/ɛ) teratons to reach an accuracy of ɛ, and that ths apples whether the full gradent or the gradent contrbuton from an ndvdual 10 [Kvnen and Warmuth, 1997] also outlne an extenson that can handle non-negatve varables subject to the l 1 -norm of the vector beng 1, and extensons where the constant 1 s replaced by another known constant C. 11

12 tranng example s used 11. A dsadvantage of ths optmzaton procedure s that usng the mplct representaton of α requres nference n the underlyng probablstc model. For example, runnng belef propagaton to compute margnals n tree-sturctured graphs. Snce the nference problem s strctly harder than the decodng problem needed by sub-gradent/cuttng-plane methods, the exponentated gradent method seems to only apply to a subset of the problems where these other methods apply. 5.4 Mn-Max Formulatons Fnally, the last set of methods that we consder deals wth the exponental number of constrants n the quadratc program by replacng them wth one non-lnear constrant for each tranng example. ξ + λ w 2 2, mn s.t. log p(y w, X ) + ξ max log p( Y w, X ) + (Y, Y ), ξ 0. Above, we re-arranged the constrant so that the rght-hand sde of the nequalty s the soluton to the modfed decodng problem. We frst consder a method to solve ths mn-max formulaton that s due to [Taskar et al., 2004], that apples n the specal case where the optmal decodng can be formulated as a lnear program. In ths method, we wrte the lnear program that solves the problem max log p(y w, X ) + (Y, Y ) n standard form max z wbz s.t. z 0, Az b, for sutably defned values of A, B, and b. We obtan an equvalent mnmzaton problem by workng wth the dual of ths lnear program mn z b T z s.t. z 0, A T z (wb) T. We then substtute ths dual problem nto our orgnal problem to elmnate the non-lnear constrant, yeldng the quadratc program mn ξ + λ w 2 2, z s.t. log p(y w, X ) + ξ b T z, ξ 0, z 0, A T z (wb) T. Subsequently, we can solve ths (reasonably-szed) quadratc program to exactly solve the orgnal quadratc program whenever the optmal decodng can be formulated as a lnear program. In cases where the lnear programmng relaxaton s not exact, we can stll use ths formulaton to compute an approxmate soluton, where the slack varables wll be ncreased for any tranng examples where a fractonal decodng yelds a better confguraton than the best nteger decodng. An alternatve mn-max formulaton was explored n [Taskar et al., 2006b]. In ths formulaton, we plug a lnear programmng problem that solves the decodng problem nto the unconstraned verson of the MMMN problem, gvng the saddle-pont problem mn max w W z Z λ w w T F z + c T z w T F (X, Y ), for approprately defned Z, F and c. Subsequently, the extragradent algorthm s appled to fnd a value of w and z that mnmzes over w the maxmzng value of z. Usng L(w, z) to denote the objectve functon n 11 [Collns et al., 2008] also shows that the faster of rate of O(log(1/ɛ)) s acheved when the method s appled to optmzng a dual formulaton of tranng condtonal random felds. 12

13 the saddle pont problem, the teratons of the extragradent algorthm take the form of alternatng projected gradent steps n w and z; w p = π W (w η w L(w, z)), z p = π Z(z + η z L(w, z)), w c = π W (w η w L(w p, z p )), z c = π Z (z + η z L(w p, z p )). The step sze η s selected by a backtrackng lne search untl a sutable condton on w p and z p s satsfed [Taskar et al., 2006b]. The teratons converge lnearly to the soluton, whch corresponds to a rate of O(log(1/ɛ)). Snce w s unconstraned, the projecton π W s smply the dentty functon (for assocatve maxmargn Markov networks, we project the edge parameters onto the non-negatve orthant), whle the projecton π Z s the soluton of a mn-cost quadratc flow problem [Taskar et al., 2006b]. In [Taskar et al., 2006a], a dual extragradent method s consdered, along wth an extenson that uses proxmal steps wth respect to a Bregman dvergence rather than Eucldean projectons. 5.5 Comments As the dscusson above ndcates, effcently solvng the structural support vector machne optmzaton remans a challengng problem. Currently, the sub-gradent descent methods are one appealng choce because of ther smplcty, whle the cuttng plane methods are another appealng opton due to the avalablty of effcent mplementatons of these methods. However, the O(1/ɛ) (sub-lnear) worst-case convergence rates of these methods s somewhat unappealng. Whle the O(log(1/ɛ)) (lnear) convergence speed of the extragradent method s more appealng, t does not approach the O(log(log(1/ɛ))) convergence rates of standard methods for solvng quadratc programs. We could consder applyng nteror-pont methods (see [Boyd and Vandenberghe, 2004], for example) to one of the polynomal-szed reformulatons or one of the mn-max formulatons, but the number of varables/constrants n these problems would lkely make teratons of the method mpractcal. References [Altun and Hofmann, 2003] Altun, Y. and Hofmann, T. (2003). Large margn methods for label sequence learnng. In Eghth European Conference on Speech Communcaton and Technology. ISCA. [Altun et al., 2003] Altun, Y., Tsochantards, I., Hofmann, T., et al. (2003). Hdden markov support vector machnes. In ICML, volume 20, page 3. [Bartlett et al., 2005] Bartlett, P., Collns, M., Taskar, B., and McAllester, D. (2005). Exponentated gradent algorthms for large-margn structured classfcaton. In Advances n Neural Informaton Processng Systems 17: Proceedngs Of The 2004 Conference, page 113. MIT Press. [Bertsekas, 1999] Bertsekas, D. (1999). Nonlnear programmng. Athena Scentfc. [Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex Optmzaton [Collns, 2002] Collns, M. (2002). Dscrmnatve tranng methods for hdden markov models: Theory and experments wth perceptron algorthms. In Proceedngs of the ACL-02 conference on Emprcal methods n natural language processng-volume 10, pages 1 8. Assocaton for Computatonal Lngustcs Morrstown, NJ, USA. [Collns et al., 2008] Collns, M., Globerson, A., Koo, T., Carreras, X., and Bartlett, P. (2008). Exponentated gradent algorthms for condtonal random felds and max-margn markov networks. The Journal of Machne Learnng Research, 9:

14 [Crammer and Snger, 2001] Crammer, K. and Snger, Y. (2001). On the algorthmc mplementaton of multclass kernel-based vector machnes. The Journal of Machne Learnng Research, 2: [Detterch and Bakr, 1995] Detterch, T. and Bakr, G. (1995). Solvng multclass learnng problems va error-correctng output codes. Arxv preprnt cs.ai/ [Fnley and Joachms, 2008] Fnley, T. and Joachms, T. (2008). Tranng structural SVMs when exact nference s ntractable. In Proceedngs of the 25th nternatonal conference on Machne learnng, pages ACM New York, NY, USA. [Hutter et al., 2005] Hutter, F., Hoos, H., and Stutzle, T. (2005). Effcent stochastc local search for MPE solvng. In INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 19, page 169. LAWRENCE ERLBAUM ASSOCIATES LTD. [Joachms, 2003] Joachms, T. (2003). Learnng to algn sequences: A maxmum-margn approach. onlne manuscrpt. [Joachms, 2006] Joachms, T. (2006). Tranng lnear SVMs n lnear tme. In Proceedngs of the 12th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM New York, NY, USA. [Joachms et al., 2009] Joachms, T., Fnley, T., and Yu, C. (2009). Cuttng plane tranng of structural SVMs. Machne Learnng. [Kvnen and Warmuth, 1997] Kvnen, J. and Warmuth, M. (1997). Exponentated gradent versus gradent descent for lnear predctors. Informaton and Computaton. [Kolmogorov and Zabn, 2004] Kolmogorov, V. and Zabn, R. (2004). What energy functons can be mnmzed va graph cuts? IEEE Transactons on Pattern Analyss and Machne Intellgence, 26(2): [Kumar et al., 2007] Kumar, M., Kolmogorov, V., and Torr, P. (2007). An analyss of convex relaxatons for MAP estmaton. Advances n Neural Informaton Processng Systems, 20: [Kushner and Yn, 2003] Kushner, H. and Yn, G. (2003). Stochastc approxmaton and recursve algorthms and applcatons. Sprnger. [Lacoste-Julen, 2003] Lacoste-Julen, S. (2003). Combnng SVM wth graphcal models for supervsed classfcaton: an ntroducton to Max-Margn Markov Network. [Lafferty et al., 2001] Lafferty, J., McCallum, A., and Perera, F. (2001). Condtonal random felds: Probablstc models for segmentng and labelng sequence data. In ICML, pages [Lee and Mangasaran, 2001] Lee, Y. and Mangasaran, O. (2001). SSVM: A smooth support vector machne for classfcaton. Computatonal optmzaton and Applcatons, 20(1):5 22. [Ratlff et al., 2007] Ratlff, N., Bagnell, J. A. D., and Znkevch, M. (2007). (onlne) subgradent methods for structured predcton. In Eleventh Internatonal Conference on Artfcal Intellgence and Statstcs (AIStats). [Shalev-Shwartz et al., 2007] Shalev-Shwartz, S., Snger, Y., and Srebro, N. (2007). Pegasos: Prmal estmated sub-gradent solver for SVM. In Proceedngs of the 24th nternatonal conference on Machne learnng, pages ACM New York, NY, USA. [Smola et al., 2008] Smola, A., Vshwanathan, S., and Le, Q. (2008). Bundle methods for machne learnng. Advances n Neural Informaton Processng Systems,

15 [Taskar et al., 2004] Taskar, B., Chatalbashev, V., and Koller, D. (2004). Learnng assocatve Markov networks. In Proceedngs of the twenty-frst nternatonal conference on Machne learnng. ACM New York, NY, USA. [Taskar et al., 2003] Taskar, B., Guestrn, C., and Koller, D. (2003). Max-margn Markov networks. Advances n neural nformaton processng systems, 16:51. [Taskar et al., 2006a] Taskar, B., Lacoste-Julen, S., and Jordan, M. (2006a). Structured predcton, dual extragradent and Bregman projectons. The Journal of Machne Learnng Research, 7: [Taskar et al., 2006b] Taskar, B., Lacoste-Julen, S., and Jordan, M. (2006b). Structured predcton va the extragradent method. Advances n neural nformaton processng systems, 18:1345. [Teo et al., 2007] Teo, C., Smola, A., Vshwanathan, S., and Le, Q. (2007). A scalable modular convex solver for regularzed rsk mnmzaton. In Proceedngs of the 13th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM New York, NY, USA. [Tsochantards et al., 2004] Tsochantards, I., Hofmann, T., Joachms, T., and Altun, Y. (2004). Support vector machne learnng for nterdependent and structured output spaces. In Proceedngs of the twenty-frst nternatonal conference on Machne learnng. ACM New York, NY, USA. [Vapnk, 1995] Vapnk, V. (1995). The nature of statstcal learnng theory. [Weston and Watkns, 1999] Weston, J. and Watkns, C. (1999). Support vector machnes for mult-class pattern recognton. In Proceedngs of the Seventh European Symposum On Artfcal Neural Networks, volume 4. [Zhang, 2004] Zhang, T. (2004). Solvng large scale lnear predcton problems usng stochastc gradent descent algorthms. In Proceedngs of the twenty-frst nternatonal conference on Machne learnng. ACM New York, NY, USA. 15

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009 Structural Extensons of Support Vector Machnes Mark Schmdt March 30, 2009 Formulaton: Bnary SVMs Multclass SVMs Structural SVMs Tranng: Subgradents Cuttng Planes Margnal Formulatons Mn-Max Formulatons

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far So far Supervsed machne learnng Lnear models Non-lnear models Unsupervsed machne learnng Generc scaffoldng So far

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far Supervsed machne learnng Lnear models Least squares regresson Fsher s dscrmnant, Perceptron, Logstc model Non-lnear

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

Which Separator? Spring 1

Which Separator? Spring 1 Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

Support Vector Machines

Support Vector Machines Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information

Maximal Margin Classifier

Maximal Margin Classifier CS81B/Stat41B: Advanced Topcs n Learnng & Decson Makng Mamal Margn Classfer Lecturer: Mchael Jordan Scrbes: Jana van Greunen Corrected verson - /1/004 1 References/Recommended Readng 1.1 Webstes www.kernel-machnes.org

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + + Intuton of Margn Consder ponts

More information

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016 CS 294-128: Algorthms and Uncertanty Lecture 14 Date: October 17, 2016 Instructor: Nkhl Bansal Scrbe: Antares Chen 1 Introducton In ths lecture, we revew results regardng follow the regularzed leader (FTRL.

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

Support Vector Machines

Support Vector Machines /14/018 Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

Lecture 20: November 7

Lecture 20: November 7 0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probablstc & Unsupervsed Learnng Convex Algorthms n Approxmate Inference Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt Unversty College London Term 1, Autumn 2008 Convexty A convex

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

Support Vector Machines

Support Vector Machines CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at

More information

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machines. Oct Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron

More information

Structured Perceptrons & Structural SVMs

Structured Perceptrons & Structural SVMs Structured Perceptrons Structural SVMs 4/6/27 CS 59: Advanced Topcs n Machne Learnng Recall: Sequence Predcton Input: x = (x,,x M ) Predct: y = (y,,y M ) Each y one of L labels. x = Fsh Sleep y = (N, V)

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

MAXIMUM A POSTERIORI TRANSDUCTION

MAXIMUM A POSTERIORI TRANSDUCTION MAXIMUM A POSTERIORI TRANSDUCTION LI-WEI WANG, JU-FU FENG School of Mathematcal Scences, Peng Unversty, Bejng, 0087, Chna Center for Informaton Scences, Peng Unversty, Bejng, 0087, Chna E-MIAL: {wanglw,

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Lecture 3: Dual problems and Kernels

Lecture 3: Dual problems and Kernels Lecture 3: Dual problems and Kernels C4B Machne Learnng Hlary 211 A. Zsserman Prmal and dual forms Lnear separablty revsted Feature mappng Kernels for SVMs Kernel trck requrements radal bass functons SVM

More information

Linear Classification, SVMs and Nearest Neighbors

Linear Classification, SVMs and Nearest Neighbors 1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush

More information

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Neural networks. Nuno Vasconcelos ECE Department, UCSD Neural networs Nuno Vasconcelos ECE Department, UCSD Classfcaton a classfcaton problem has two types of varables e.g. X - vector of observatons (features) n the world Y - state (class) of the world x X

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

Estimation: Part 2. Chapter GREG estimation

Estimation: Part 2. Chapter GREG estimation Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the

More information

The Minimum Universal Cost Flow in an Infeasible Flow Network

The Minimum Universal Cost Flow in an Infeasible Flow Network Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

Pattern Classification

Pattern Classification Pattern Classfcaton All materals n these sldes ere taken from Pattern Classfcaton (nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wley & Sons, 000 th the permsson of the authors and the publsher

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? Intuton of Margn Consder ponts A, B, and C We

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.

More information

ECE559VV Project Report

ECE559VV Project Report ECE559VV Project Report (Supplementary Notes Loc Xuan Bu I. MAX SUM-RATE SCHEDULING: THE UPLINK CASE We have seen (n the presentaton that, for downlnk (broadcast channels, the strategy maxmzng the sum-rate

More information

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016 CS 29-128: Algorthms and Uncertanty Lecture 17 Date: October 26, 2016 Instructor: Nkhl Bansal Scrbe: Mchael Denns 1 Introducton In ths lecture we wll be lookng nto the secretary problem, and an nterestng

More information

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015 CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012 Support Vector Machnes Je Tang Knowledge Engneerng Group Department of Computer Scence and Technology Tsnghua Unversty 2012 1 Outlne What s a Support Vector Machne? Solvng SVMs Kernel Trcks 2 What s a

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,

More information

Introduction to Hidden Markov Models

Introduction to Hidden Markov Models Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts

More information

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number

More information

Some modelling aspects for the Matlab implementation of MMA

Some modelling aspects for the Matlab implementation of MMA Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

17 Support Vector Machines

17 Support Vector Machines 17 We now dscuss an nfluental and effectve classfcaton algorthm called (SVMs). In addton to ther successes n many classfcaton problems, SVMs are responsble for ntroducng and/or popularzng several mportant

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

CSE 546 Midterm Exam, Fall 2014(with Solution)

CSE 546 Midterm Exam, Fall 2014(with Solution) CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud Resource Allocaton wth a Budget Constrant for Computng Independent Tasks n the Cloud Wemng Sh and Bo Hong School of Electrcal and Computer Engneerng Georga Insttute of Technology, USA 2nd IEEE Internatonal

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

14 Lagrange Multipliers

14 Lagrange Multipliers Lagrange Multplers 14 Lagrange Multplers The Method of Lagrange Multplers s a powerful technque for constraned optmzaton. Whle t has applcatons far beyond machne learnng t was orgnally developed to solve

More information

Feature Selection in Multi-instance Learning

Feature Selection in Multi-instance Learning The Nnth Internatonal Symposum on Operatons Research and Its Applcatons (ISORA 10) Chengdu-Juzhagou, Chna, August 19 23, 2010 Copyrght 2010 ORSC & APORC, pp. 462 469 Feature Selecton n Mult-nstance Learnng

More information

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Solutions to exam in SF1811 Optimization, Jan 14, 2015 Solutons to exam n SF8 Optmzaton, Jan 4, 25 3 3 O------O -4 \ / \ / The network: \/ where all lnks go from left to rght. /\ / \ / \ 6 O------O -5 2 4.(a) Let x = ( x 3, x 4, x 23, x 24 ) T, where the varable

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning Advanced Introducton to Machne Learnng 10715, Fall 2014 The Kernel Trck, Reproducng Kernel Hlbert Space, and the Representer Theorem Erc Xng Lecture 6, September 24, 2014 Readng: Erc Xng @ CMU, 2014 1

More information

Vapnik-Chervonenkis theory

Vapnik-Chervonenkis theory Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence

More information

A new construction of 3-separable matrices via an improved decoding of Macula s construction

A new construction of 3-separable matrices via an improved decoding of Macula s construction Dscrete Optmzaton 5 008 700 704 Contents lsts avalable at ScenceDrect Dscrete Optmzaton journal homepage: wwwelsevercom/locate/dsopt A new constructon of 3-separable matrces va an mproved decodng of Macula

More information

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution. Solutons HW #2 Dual of general LP. Fnd the dual functon of the LP mnmze subject to c T x Gx h Ax = b. Gve the dual problem, and make the mplct equalty constrants explct. Soluton. 1. The Lagrangan s L(x,

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information