11 KERNEL METHODS From Feature Combinations to Kernels. φ(x) = 1, 2x 1, 2x 2, 2x 3,..., 2x D, Learning Objectives:

Size: px

Start display at page:

Download "11 KERNEL METHODS From Feature Combinations to Kernels. φ(x) = 1, 2x 1, 2x 2, 2x 3,..., 2x D, Learning Objectives:"

Warren Montgomery
5 years ago
Views:

1 11 KERNEL METHODS May who have had a opportuity of kowig ay ore about atheatics cofuse it with arithetic, ad cosider it a arid sciece. I reality, however, it is a sciece which requires a great aout of iagiatio. Sofia Kovalevskaya Liear odels are great because they are easy to uderstad ad easy to optiize. They suffer because they ca oly lear very siple decisio boudaries. Neural etworks ca lear ore coplex decisio boudaries, but lose the ice covexity properties of ay liear odels. Oe way of gettig a liear odel to behave o-liearly is to trasfor the iput. For istace, by addig feature pairs as additioal iputs. Learig a liear odel o such a represetatio is covex, but is coputatioally prohibitive i all but very low diesioal spaces. You ight ask: istead of explicitly expadig the feature space, is it possible to stay with our origial data represetatio ad do all the feature blow up iplicitly? Surprisigly, the aswer is ofte yes ad the faily of techiques that akes this possible are kow as kerel approaches. Learig Objectives: Explai how kerels geeralize both feature cobiatios ad basis fuctios. Cotrast dot products with kerel products. Ipleet kerelized perceptro. Derive a kerelized versio of regularized least squares regressio. Ipleet a kerelized versio of the perceptro. Derive the dual forulatio of the support vector achie. Depedecies: 11.1 Fro Feature Cobiatios to Kerels I Sectio 5.4, you leared oe ethod for icreasig the expressive power of liear odels: explode the feature space. For istace, a quadratic feature explosio ight ap a feature vector x = x 1, x 2, x 3,..., x D to a expaded versio deoted φ(x): φ(x) = 1, 2x 1, 2x 2, 2x 3,..., 2x D, x 2 1, x 1x 2, x 1 x 3,..., x 1 x D, x 2 x 1, x 2 2, x 2x 3,..., x 2 x D, x 3 x 1, x 3 x 2, x 2 3,..., x 2x D,..., x D x 1, x D x 2, x D x 3,..., x 2 D (11.1) (Note that there are repetitios here, but hopefully ost learig algoriths ca deal well with redudat features; i particular, the 2x 1 ters are due to collapsig soe repetitios.)

2 142 a course i achie learig You could the trai a classifier o this expaded feature space. There are two priary cocers i doig so. The first is coputatioal: if your learig algorith scales liearly i the uber of features, the you ve just squared the aout of coputatio you eed to perfor; you ve also squared the aout of eory you ll eed. The secod is statistical: if you go by the heuristic that you should have about two exaples for every feature, the you will ow eed quadratically ay traiig exaples i order to avoid overfittig. This chapter is all about dealig with the coputatioal issue. It will tur out i Chapter 12 that you ca also deal with the statistical issue: for ow, you ca just hope that regularizatio will be sufficiet to atteuate overfittig. The key isight i kerel-based learig is that you ca rewrite ay liear odels i a way that does t require you to ever explicitly copute φ(x). To start with, you ca thik of this purely as a coputatioal trick that eables you to use the power of a quadratic feature appig without actually havig to copute ad store the apped vectors. Later, you will see that it s actually quite a bit deeper. Most algoriths we discuss ivolve a product of the for w φ(x), after perforig the feature appig. The goal is to rewrite these algoriths so that they oly ever deped o dot products betwee two exaples, say x ad z; aely, they deped o φ(x) φ(z). To uderstad why this is helpful, cosider the quadratic expasio fro above, ad the dot-product betwee two vectors. You get: φ(x) φ(z) = 1 + x 1 z 1 + x 2 z x D z D + x 2 1 z x 1x D z 1 z D + + x D x 1 z D z 1 + x D x 2 z D z x 2 D z2 D (11.2) = d x d z d + d x d x e z d z e (11.3) e = 1 + 2x z + (x z) 2 (11.4) = (1 + x z) 2 (11.5) Thus, you ca copute φ(x) φ(z) i exactly the sae aout of tie as you ca copute x z (plus the tie it takes to perfor a additio ad a ultiply, about 0.02 aosecods o a circa 2011 processor). The rest of the practical challege is to rewrite your algoriths so that they oly deped o dot products betwee exaples ad ot o ay explicit weight vectors Kerelized Perceptro Cosider the origial perceptro algorith fro Chapter 4, repeated i Algorith 11.2 usig liear algebra otatio ad usig feature expasio otatio φ(x). I this algorith, there are two places

3 kerel ethods 143 Algorith 29 PerceptroTrai(D, MaxIter) 1: w 0, b 0 // iitialize weights ad bias 2: for iter = 1... MaxIter do 3: for all (x,y) D do 4: a w φ(x) + b // copute activatio for this exaple 5: if ya 0 the 6: w w + y φ(x) // update weights 7: b b + y // update bias 8: ed if 9: ed for 10: ed for 11: retur w, b MATH REVIEW SPANS If U = {u i } I i=1 is a set of vectors i RD, the the spa of U is the set of vectors that ca be writte as liear cobiatios of u i s; aely: spa(u) = { i a i u i : a 1 R,..., a I R}. If all of the u i s are liearly idepedet, the the diesio of spa(u) is I; i particular, if there are D-ay liearly idepedet vectors the they spa R D. where φ(x) is used explicitly. The first is i coputig the activatio (lie 4) ad the secod is i updatig the weights (lie 6). The goal is to reove the explicit depedece of this algorith o φ ad o the weight vector. To do so, you ca observe that at ay poit i the algorith, the weight vector w ca be writte as a liear cobiatio of expaded traiig data. I particular, at ay poit, w = α φ(x ) for soe paraeters α. Iitially, w = 0 so choosig α = 0 yields this. If the first update occurs o the th traiig exaple, the the resolutio weight vector is siply y φ(x ), which is equivalet to settig α = y. If the secod update occurs o the th traiig exaple, the all you eed to do is update α α + y. This is true, eve if you ake ultiple passes over the data. This observatio leads to the followig represeter theore, which states that the weight vector of the perceptro lies i the spa of the traiig data. Theore 12 (Perceptro Represeter Theore). Durig a ru of the perceptro algorith, the weight vector w is always i the spa of the (assued o-epty) traiig data, φ(x 1 ),..., φ(x N ). Proof of Theore 12. By iductio. Base case: the spa of ay oepty set cotais the zero vector, which is the iitial weight vector. Iductive case: suppose that the theore is true before the kth update, ad suppose that the kth update happes o exaple. By the iductive hypothesis, you ca write w = i α i φ(x i ) before Figure 11.1:

4 144 a course i achie learig Algorith 30 KerelizedPerceptroTrai(D, MaxIter) 1: α 0, b 0 // iitialize coefficiets ad bias 2: for iter = 1... MaxIter do 3: for all (x,y ) D do 4: a α φ(x ) φ(x ) + b // copute activatio for this exaple 5: if y a 0 the 6: α α + y // update coefficiets 7: b b + y // update bias 8: ed if 9: ed for 10: ed for 11: retur α, b the update. The ew weight vector is [ i α i φ(x i )] + y φ(x ) = i (α i + y [i = ])φ(x i ), which is still i the spa of the traiig data. Now that you kow that you ca always write w = α φ(x ) for soe α i s, you ca additioall copute the activatios (lie 4) as: ( ) w φ(x) + b = α φ(x ) φ(x) + b defiitio of w (11.6) ] = α [φ(x ) φ(x) + b dot products are liear (11.7) This ow depeds oly o dot-products betwee data poits, ad ever explicitly requires a weight vector. You ca ow rewrite the etire perceptro algorith so that it ever refers explicitly to the weights ad oly ever depeds o pairwise dot products betwee exaples. This is show i Algorith The advatage to this kerelized algorith is that you ca perfor feature expasios like the quadratic feature expasio fro the itroductio for free. For exaple, for exactly the sae cost as the quadratic features, you ca use a cubic feature ap, coputed as φ(x)φ(z) = (1 + x z) 3, which correspods to three-way iteractios betwee variables. (Ad, i geeral, you ca do so for ay polyoial degree p at the sae coputatioal coplexity.) 11.3 Kerelized K-eas For a coplete chage of pace, cosider the K-eas algorith fro Sectio 3. This algorith is for clusterig where there is o otio of traiig labels. Istead, you wat to partitio the data ito coheret clusters. For data i R D, it ivolves radoly iitializig K-ay

5 kerel ethods 145 cluster eas µ (1),..., µ (K). The algorith the alterates betwee the followig two steps util covergece, with x replaced by φ(x) sice that is the evetual goal: 1. For each exaple, set cluster label z = arg i k φ(x ) µ (k) For each cluster k, update µ (k) = 1 N k :z =k φ(x ), where N k is the uber of with z = k. The questio is whether you ca perfor these steps without explicitly coputig φ(x ). The represeter theore is ore straightforward here tha i the perceptro. The ea of a set of data is, alost by defiitio, i the spa of that data (choose the a i s all to be equal to 1/N). Thus, so log as you iitialize the eas i the spa of the data, you are guarateed always to have the eas i the spa of the data. Give this, you kow that you ca write each ea as a expasio of the data; say that µ (k) = α (k) φ(x ) for soe paraeters α (k) (there are N K-ay such paraeters). Give this expasio, i order to execute step (1), you eed to copute ors. This ca be doe as follows: z = arg i k φ(x ) µ (k) 2 = arg i k φ(x ) α (k) = arg i φ(x ) 2 + k = arg i k α (k) α (k) φ(x ) α (k) 2 φ(x ) φ(x ) φ(x ) + 2 (11.8) (11.9) [ ] + φ(x ) α (k) φ(x ) (11.10) α (k) φ(x ) φ(x ) + cost (11.11) defiitio of z defiitio of µ (k) expad quadratic ter liearity ad costat This coputatio ca replace the assigets i step (1) of K-eas. The ea updates are ore direct i step (2): µ (k) = 1 N k φ(x ) α (k) = :z =k { 1Nk if z = k 0 otherwise (11.12) 11.4 What Makes a Kerel A kerel is just a for of geeralized dot product. You ca also thik of it as siply shorthad for φ(x) φ(z), which is cooly writte K φ (x, z). Or, whe φ is clear fro cotext, siply K(x, z).

6 146 a course i achie learig This is ofte refered to as the kerel product betwee x ad z (uder the appig φ). I this view, what you ve see i the precedig two sectios is that you ca rewrite both the perceptro algorith ad the K-eas algorith so that they oly ever deped o kerel products betwee data poits, ad ever o the actual datapoits theselves. This is a very powerful otio, as it has eabled the developet of a large uber of o-liear algoriths essetially for free (by applyig the so-called kerel trick, that you ve just see twice). This raises a iterestig questio. If you have rewritte these algoriths so that they oly deped o the data through a fuctio K : X X R, ca you stick ay fuctio K i these algoriths, or are there soe K that are forbidde? I oe sese, you could use ay K, but the real questio is: for what types of fuctios K do these algoriths retai the properties that we expect the to have (like covergece, optiality, etc.)? Oe way to aswer this questio is to say that K(, ) is a valid kerel if it correspods to the ier product betwee two vectors. That is, K is valid if there exists a fuctio φ such that K(x, z) = φ(x) φ(z). This is a direct defiitio ad it should be clear that if K satisfies this, the the algoriths go through as expected (because this is how we derived the). You ve already see the geeral class of polyoial kerels, which have the for: K (poly) d (x, z) = ( 1 + x z) d (11.13) where d is a hyperparaeter of the kerel. These kerels correspod to polyoial feature expasios. There is a alterative characterizatio of a valid kerel fuctio that is ore atheatical. It states that K : X X R is a kerel if K is positive sei-defiite (or, i shorthad, psd). This property is also soeties called Mercer s coditio. I this cotext, this eas the for all fuctios f that are square itegrable (i.e., f (x) 2 dx < ), other tha the zero fuctio, the followig property holds: f (x)k(x, z) f (z)dxdz > 0 (11.14) This likely sees like it cae out of owhere. Ufortuately, the coectio is well beyod the scope of this book, but is covered well is exteral sources. For ow, siply take it as a give that this is a equivalet requireet. (For those so iclied, the appedix of this book gives a proof, but it requires a bit of kowledge of fuctio spaces to uderstad.) The questio is: why is this alterative characterizatio useful? It is useful because it gives you a alterative way to costruct kerel

7 kerel ethods 147 fuctios. For istace, usig it you ca easily prove the followig, which would be difficult fro the defiitio of kerels as ier products after feature appigs. Theore 13 (Kerel Additio). If K 1 ad K 2 are kerels, the K defied by K(x, z) = K 1 (x, z) + K 2 (x, z) is also a kerel. Proof of Theore 13. You eed to verify the positive sei-defiite property o K. You ca do this as follows: f (x)k(x, z) f (z)dxdz = f (x) [K 1 (x, z) + K 2 (x, z)] f (z)dxdz = f (x)k 1 (x, z) f (z)dxdz + f (x)k 2 (x, z) f (z)dxdz (11.15) (11.16) defiitio of K distributive rule > K 1 ad K 2 are psd (11.17) More geerally, ay positive liear cobiatio of kerels is still a kerel. Specifically, if K 1,..., K M are all kerels, ad α 1,..., α M 0, the K(x, z) = α K (x, z) is also a kerel. You ca also use this property to show that the followig Gaussia kerel (also called the RBF kerel) is also psd: [ K (RBF) γ (x, z) = exp γ x z 2] (11.18) Here γ is a hyperparaeter that cotrols the width of this Gaussialike bups. To gai a ituitio for what the RBF kerel is doig, cosider what predictio looks like i the perceptro: f (x) = α K(x, x) + b (11.19) = α exp [ γ x z 2] (11.20) I this coputatio, each traiig exaple is gettig to vote o the label of the test poit x. The aout of vote that the th traiig exaple gets is proportioal to the egative expoetial of the distace betwee the test poit ad itself. This is very uch like a RBF eural etwork, i which there is a Gaussia bup at each traiig exaple, with variace 1/(2γ), ad where the α s act as the weights coectig these RBF bups to the output. Showig that this kerel is positive defiite is a bit of a exercise i aalysis (particularly, itegratio by parts), but otherwise ot difficult. Agai, the proof is provided i the appedix.

8 148 a course i achie learig So far, you have see two bsaic classes of kerels: polyoial kerels (K(x, z) = (1 + x z) d ), which icludes the liear kerel (K(x, z) = x z) ad RBF kerels (K(x, z) = exp[ γ x z 2 ]). The forer have a direct coectio to feature expasio; the latter to RBF etworks. You also kow how to cobie kerels to get ew kerels by additio. I fact, you ca do ore tha that: the product of two kerels is also a kerel. As far as a library of kerels goes, there are ay. Polyoial ad RBF are by far the ost popular. A cooly used, but techically ivalid kerel, is the hyperbolic-taget kerel, which iics the behavior of a two-layer eural etwork. It is defied as: K (tah) = tah(1 + x z) Warig: ot psd (11.21) A fial exaple, which is ot very coo, but is oetheless iterestig, is the all-subsets kerel. Suppose that your D features are all biary: all take values 0 or 1. Let A {1, 2,... D} be a subset of features, ad let f A (x) = d A x d be the cojuctio of all the features i A. Let φ(x) be a feature vector over all such As, so that there are 2 D features i the vector φ. You ca copute the kerel associated with this feature appig as: ) K (subs) (x, z) = (1 + x d z d d (11.22) Verifyig the relatioship betwee this kerel ad the all-subsets feature appig is left as a exercise (but closely resebles the expasio for the quadratic kerel) Support Vector Machies Kerelizatio predated support vector achies, but SVMs are defiitely the odel that popularized the idea. Recall the defiitio of the soft-argi SVM fro Chapter 8.7 ad i particular the optiizatio proble (8.38), which attepts to balace a large argi (sall w 2 ) with a sall loss (sall ξ s, where ξ is the slack o the th traiig exaple). This proble is repeated below: i w,b,ξ 1 2 w 2 + C ξ (11.23) subj. to y (w x + b) 1 ξ ( ) ξ 0 ( ) Previously, you optiized this by explicitly coputig the slack variables ξ, give a solutio to the decisio boudary, w ad b. However, you are ow a expert with usig Lagrage ultipliers

9 kerel ethods 149 to optiize costraied probles! The overall goal is goig to be to rewrite the SVM optiizatio proble i a way that it o loger explicitly depeds o the weights w ad oly depeds o the exaples x through kerel products. There are 2N costraits i this optiizatio, oe for each slack costrait ad oe for the requireet that the slacks are oegative. Ulike the last tie, these costraits are ow iequalities, which require a slightly differet solutio. First, you rewrite all the iequalities so that they read as soethig 0 ad the add correspodig Lagrage ultipliers. The ai differece is that the Lagrage ultipliers are ow costraied to be o-egative, ad their sig i the augeted objective fuctio atters. The secod set of costraits is already i the proper for; the first set ca be rewritte as y (w x + b) 1 + ξ 0. You re ow ready to costruct the Lagragia, usig ultipliers α for the first set of costraits ad β for the secod set. L(w, b, ξ, α, β) = 1 2 w 2 + C The ew optiizatio proble is: i ax w,b,ξ ax α 0 β 0 ξ β ξ (11.24) α [y (w x + b) 1 + ξ ] (11.25) L(w, b, ξ, α, β) (11.26) The ituitio is exactly the sae as before. If you are able to fid a solutio that satisfies the costraits (e.g., the purple ter is properly o-egative), the the β s caot do aythig to hurt the solutio. O the other had, if the purple ter is egative, the the correspodig β ca go to +, breakig the solutio. You ca solve this proble by takig gradiets. This is a bit tedious, but ad iportat step to realize how everythig fits together. Sice your goal is to reove the depedece o w, the first step is to take a gradiet with respect to w, set it equal to zero, ad solve for w i ters of the other variables. w L = w α y x = 0 w = α y x (11.27) At this poit, you should iediately recogize a siilarity to the kerelized perceptro: the optial weight vector takes exactly the sae for i both algoriths. You ca ow take this ew expressio for w ad plug it back i to the expressio for L, thus reovig w fro cosideratio. To avoid subscript overloadig, you should replace the i the expressio for

10 150 a course i achie learig w with, say,. This yields: L(b, ξ, α, β) = 1 2 α y x 2 + C [ α ] y ([ α y x x + b ξ β ξ (11.28) ) ] 1 + ξ (11.29) At this poit, it s coveiet to rewrite these ters; be sure you uderstad where the followig coes fro: L(b, ξ, α, β) = 1 2 α α y y x x + (C β )ξ (11.30) = 1 2 b α α y y x x α (y b 1 + ξ ) (11.31) α α y y x x + (C β )ξ (11.32) α y α (ξ 1) (11.33) Thigs are startig to look good: you ve successfully reoved the depedece o w, ad everythig is ow writte i ters of dot products betwee iput vectors! This ight still be a difficult proble to solve, so you eed to cotiue ad attept to reove the reaiig variables b ad ξ. The derivative with respect to b is: L b = α y = 0 (11.34) This does t allow you to substitute b with soethig (as you did with w), but it does ea that the fourth ter (b α y ) goes to zero at the optiu. The last of the origial variables is ξ ; the derivatives i this case look like: L ξ = C β α C β = α (11.35) Agai, this does t allow you to substitute, but it does ea that you ca rewrite the secod ter, which as (C β )ξ as α ξ. This the cacels with (ost of) the fial ter. However, you eed to be careful to reeber soethig. Whe we optiize, both α ad β are costraied to be o-egative. What this eas is that sice we are droppig β fro the optiizatio, we eed to esure that α C, otherwise the correspodig β will eed to be egative, which is ot

11 kerel ethods 151 allowed. You fially wid up with the followig, where x x has bee replaced by K(x, x ): L(α) = α 1 2 α α y y K(x, x ) (11.36) If you are cofortable with atrix otatio, this has a very copact for. Let 1 deote the N-diesioal vector of all 1s, let y deote the vector of labels ad let G be the N N atrix, where G, = y y K(x, x ), the this has the followig for: L(α) = α α Gα (11.37) The resultig optiizatio proble is to axiize L(α) as a fuctio of α, subject to the costrait that the α s are all o-egative ad less tha C (because of the costrait added whe reovig the β variables). Thus, your proble is: i α L(α) = 1 2 α α y y K(x, x ) α (11.38) subj. to 0 α C ( ) Oe way to solve this proble is gradiet descet o α. The oly coplicatio is akig sure that the αs satisfy the costraits. I this case, you ca use a projected gradiet algorith: after each gradiet update, you adjust your paraeters to satisfy the costraits by projectig the ito the feasible regio. I this case, the projectio is trivial: if, after a gradiet step, ay α < 0, siply set it to 0; if ay α > C, set it to C Uderstadig Support Vector Machies The prior discussio ivolved quite a bit of ath to derive a represetatio of the support vector achie i ters of the Lagrage variables. This appig is actually sufficietly stadard that everythig i it has a ae. The origial proble variables (w, b, ξ) are called the prial variables; the Lagrage variables are called the dual variables. The optiizatio proble that results after reovig all of the prial variables is called the dual proble. A succict way of sayig what you ve doe is: you foud that after covertig the SVM ito its dual, it is possible to kerelize. To uderstad SVMs, a first step is to peek ito the dual forulatio, Eq (11.38). The objective has two ters: the first depeds o the data, ad the secod depeds oly o the dual variables. The first thig to otice is that, because of the secod ter, the αs wat to

12 152 a course i achie learig get as large as possible. The costrait esures that they caot exceed C, which eas that the geeral tedecy is for the αs to grow as close to C as possible. To further uderstad the dual optiizatio proble, it is useful to thik of the kerel as beig a easure of siilarity betwee two data poits. This aalogy is ost clear i the case of RBF kerels, but eve i the case of liear kerels, if your exaples all have uit or, the their dot product is still a easure of siilarity. Sice you ca write the predictio fuctio as f (ˆx) = sig( α y K(x, ˆx)), it is atural to thik of α as the iportace of traiig exaple, where α = 0 eas that it is ot used at all at test tie. Cosider two data poits that have the sae label; aely, y = y. This eas that y y = +1 ad the objective fuctio has a ter that looks like α α K(x, x ). Sice the goal is to ake this ter sall, the oe of two thigs has to happe: either K has to be sall, or α α has to be sall. If K is already sall, the this does t affect the settig of the correspodig αs. But if K is large, the this strogly ecourages at least oe of α or α to go to zero. So if you have two data poits that are very siilar ad have the sae label, at least oe of the correspodig αs will be sall. This akes ituitive sese: if you have two data poits that are basically the sae (both i the x ad y sese) the you oly eed to keep oe of the aroud. Suppose that you have two data poits with differet labels: y y = 1. Agai, if K(x, x ) is sall, othig happes. But if it is large, the the correspodig αs are ecouraged to be as large as possible. I other words, if you have two siilar exaples with differet labels, you are strogly ecouraged to keep the correspodig αs as large as C. A alterative way of uderstadig the SVM dual proble is geoetrically. Reeber that the whole poit of itroducig the variable α was to esure that the th traiig exaple was correctly classified, odulo slack. More forally, the goal of α is to esure that y (w x + b) 1 + ξ 0. Suppose that this costrait it ot satisfied. There is a iportat result i optiizatio theory, called the Karush-Kuh-Tucker coditios (or KKT coditios, for short) that states that at the optiu, the product of the Lagrage ultiplier for a costrait, ad the value of that costrait, will equal zero. I this case, this says that at the optiu, you have: [ ] α y (w x + b) 1 + ξ = 0 (11.39) I order for this to be true, it eas that (at least) oe of the followig ust be true: α = 0 or y (w x + b) 1 + ξ = 0 (11.40)

13 kerel ethods 153 A reasoable questio to ask is: uder what circustaces will α be o-zero? Fro the KKT coditios, you ca discer that α ca be o-zero oly whe the costrait holds exactly; aely, that y (w x + b) 1 + ξ = 0. Whe does that costrait hold exactly? It holds exactly oly for those poits precisely o the argi of the hyperplae. I other words, the oly traiig exaples for which α = 0 are those that lie precisely 1 uit away fro the axiu argi decisio boudary! (Or those that are oved there by the correspodig slack.) These poits are called the support vectors because they support the decisio boudary. I geeral, the uber of support vectors is far saller tha the uber of traiig exaples, ad therefore you aturally ed up with a solutio that oly uses a subset of the traiig data. Fro the first discussio, you kow that the poits that wid up beig support vectors are exactly those that are cofusable i the sese that you have to exaples that are earby, but have differet labels. This is a copletely i lie with the previous discussio. If you have a decisio boudary, it will pass betwee these cofusable poits, ad therefore they will ed up beig part of the set of support vectors Further Readig TODO further readig

1 The Primal and Dual of an Optimization Problem

1 The Primal and Dual of an Optimization Problem CS 189 Itroductio to Machie Learig Fall 2017 Note 18 Previously, i our ivestigatio of SVMs, we forulated a costraied optiizatio proble that we ca solve to fid the optial paraeters for our hyperplae decisio