Statstcal Machne Learnng Notes 7 Instructor: Justn Domke Kernel Methods and SVMs Contents 1 Introducton 2 2 Kernel Rdge Regresson 2 3 The Kernel Trck 5 4 Support Vector Machnes 7 5 Examples 1 6 Kernel Theory 16 6.1 Kernel algebra................................... 16 6.2 Understandng Polynomal Kernels va Kernel Algebra............ 18 6.3 Mercer s Theorem................................. 19 7 Our Story so Far 21 8 Dscusson 22 8.1 SVMs as Template Methods........................... 22 8.2 Theoretcal Issues................................. 23 1
Kernel Methods and SVMs 2 1 Introducton Support Vector Machnes (SVMs) are a very succesful and popular set of technques for classfcaton. Hstorcally, SVMs emerged after the neural network boom of the 8s and early 9s. People were surprsed to see that SVMs wth lttle to no tweakng could compete wth neural networks nvolvng a great deal of manual engneerng. It remans true today that SVMs are among the best off-the-shelf classfcaton methods. If you want to get good results wth a mnmum of messng around, SVMs are a very good choce. Unlke the other classfcaton methods we dscuss, t s not convenent to begn wth a concse defnton of SVMs, or even to say what exactly a support vector s. There s a set of deas that must be understood frst. Most of these you have already seen n the notes on lnear methods, bass expansons, and template methods. The bggest remanng concept s known as the kernel trck. In fact, ths dea s so fundamental many people have advocated that SVMs be renamed Kernel Machnes. It s worth mentonng that the standard presentaton of SVMs s based on the concept of margn. For lack of tme, ths perspectve on SVMs wll not be presented here. If you wll be workng serously wth SVMs, you should famlarze yourself wth the margn perspectve to enoy a full understandng. (Warnng: These notes are probably the most techncally challengng n ths course, partcularly f you don t have a strong background n lnear algebra, Lagrange multplers, and optmzaton. Kernel methods smply use more mathematcal machnery than most of the other technques we cover, so you should be prepared to put n some extra effort. Enoy!) 2 Kernel Rdge Regresson We begn by not talkng about SVMs, or even about classfcaton. Instead, we revst rdge regresson, wth a slght change of notaton. Let the set of nputs be {(x, y )}, where ndexes the samples. The problem s to mnmze ( ) 2 x w y + λw w If we take the dervatve wth respect to w and set t to zero, we get
Kernel Methods and SVMs 3 = 2x (x T w y ) + 2λw w = ( x x T + λi ) 1 x y. Now, let s consder a dfferent dervaton, makng use of some Lagrange dualty. If we ntroduce a new varable z, and constran t to be the dfference between w x and y, we have 1 mn w,z 2 z z + 1 λw w (2.1) 2 s.t. z = x w y. Usng α to denote the Lagrange multplers, ths has the Lagrangan L = 1 2 z z + 1 2 λw w + α (x w y z ). Recall our foray nto Lagrange dualty. We can solve the orgnal problem by dong max α mn L(w,z, α). w,z To begn, we attack the nner mnmzaton: For fxed α, we would lke to solve for the mnmzng w and z. We can do ths by settng the dervatves of L wth respect to z and w to be zero. Dong ths, we can fnd 1 z = α, w = 1 α x (2.2) λ So, we can solve the problem by maxmzng the Lagrangan (wth respect to α), where we substtute the above expressons for z and w. Thus, we have an unconstraned maxmzaton. max α L(w (α),z (α), α) 1 = dl dz = z α = dl dw = λw + α x
Kernel Methods and SVMs 4 Before dvng nto the detals of that, we can already notce somethng very nterestng happenng here: w s gven by a sum of the nput vectors x, weghted by α /λ. If we were so nclned, we could avod explctly computng w, and predct a new pont x drectly from the data as f(x) = x w = 1 α x x. λ Now, let k(x,x ) = x x be the kernel functon. For now, ust thnk of ths as a change of notaton. Usng ths, we can agan wrte the rdge regresson predctons as f(x) = 1 α k(x,x ). λ Thus, all we really need s the nner product of x wth each of the tranng elements x. We wll return to why ths mght be useful later. Frst, let s return to dong the maxmzaton over the Lagrange multplers α, to see f anythng smlar happens there. The math below looks really complcated. However, all the we are dong s substtutng the expressons for z and w from Eq. 2.2, then dong a lot of manpulaton. max mn L α w,z = max α = max α = max α 1 2 = max α 1 2 1 α 2 + 1 2 2 λ( 1 ) ( 1 ) α x α x λ λ + α (x 1 α x y α ) λ 1 α 2 + 1 α α x x 1 α α x x 2 2λ λ α 2 1 α α x x α y 2λ α 2 1 α α k(x,x ) α y. 2λ α (y + α ) Agan, we only need nner products. If we defne the matrx K by K = k(x,x ), then we can rewrte ths n a puncher vector notaton as
Kernel Methods and SVMs 5 max mn L = max 1 α w,z α 2 α α 1 2λ αt Kα α y. Here, we use a captal K to denote the matrx wth entres K and a lowercase k to denote the kernel functon k(, ). Note that most lterature on kernel machnes mldly abuses notaton by usng the captal letter K for both. The thng on the rght s ust a quadratc n α. As such, we can fnd the optmum as the soluton of a lnear system 2. What s mportant s the observaton that, agan, we only need the nner products of the data k(x,x ) = x x to do the optmzaton over α. Then, once we have solved for α, we can predct f(x) for new x agan usng only nner products. If someone tells us all the nner products, we don t need the orgnal data {x } at all! 3 The Kernel Trck So we can work completely wth nner products, rather than the vectors themselves. So what? One way of lookng at thngs s that we can mplctly use bass expansons. If we want to take x, and transform t nto some fancy feature space φ(x), we can replace the kernel functon by K = k(x,x ) = φ(x ) φ(x ). The pont of talkng about ths s that for certan bass expansons, we can compute k very cheaply wthout ever explctly formng φ(x ) or φ(x ). Ths can mean a huge computatonal savngs. A nce example of ths s the kernel functon We can see that k(x,v) = (x v) 2. 2 It s easy to show (by takng the gradent) that the optmum s at α = ( 1 λ K I) 1 y.
Kernel Methods and SVMs 6 k(x,v) = ( = ( = x v ) 2 x v )( x v ) x x v v. It s not hard to see that k(x,v) = φ(x) φ(v), where φ s a quadratc bass expanson φ m (x) = φ (x)φ (x). For example, n two dmensons, k(x,v) = ( v 1 + v 2 ) 2 = v 1 v 1 + v 1 v 2 + v 2 v 1 + v 2 v 2. whle the bass expansons are φ(x) = (,,, ), φ(v) = (v 1 v 1, v 1 v 2, v 2 v 1, v 2 v 2 ). It s not hard to work out that k(x,v) = φ(x) φ(v). However, notce that we can compute k(x,v) n tme O(d), rather than the O(d 2 ) tme t would take to explctly compute φ(x) φ(v). Ths s the kernel trck : gettng around the computatonal expense n computng large bass expansons by drectly computng kernel functons. Notce, however, that the kernel trck changes nothng, nada, zero about the statstcal ssues wth huge bass expansons. We get exactly the same predctons as f we computed the bass expanson explctly, and used tradtonal lnear methods. We ust compute the predctons n a dfferent way. In fact can nvent a new kernel functon k(x,v), and, as long as t obeys certan rules, use t n the above algorthm, wth out explctly thnkng about bass expanson at all. Some common examples are: name k(x, v) Lnear x v Polynomal (r + x v) d, for some r, d > Radal Bass Functon exp ( γ x v 2), γ > Gaussan exp ( 1 x v 2) 2σ 2
Kernel Methods and SVMs 7 We wll return below to the queston of what kernel functons are legal, meanng there s some feature space φ such that k(x,v) = φ(x) φ(v). Now, what exactly was t about rdge regresson that let us get away wth workng entrely wth nner products? How much could we change the problem, and preserve ths? We really need two thngs to happen: 1. When we take dl/dw =, we need to be able to solve for w, and the soluton needs to be a lnear combnaton of the nput vectors x. 2. When we substtute ths soluton back nto the Lagrangan, we need to get a soluton that smplfes down nto nner products only. Notce that ths leaves us a great deal of flexblty. For example, we could replace the leastsquares crteron z z wth an alternatve (convex) measure. We could also change the way n whch whch measure errors from z = w x y, to somethng else, (although wth some restrctons). 4 Support Vector Machnes Now, we turn to the bnary classfcaton problem. Support Vector Machnes result from mnmzng the hnge loss (1 y w x ) + wth rdge regularzaton. mn w Ths s equvalent to (for c = 2/λ) (1 y w x ) + + λ w 2. mn w c (1 y w x ) + + 1 2 w 2. Because the hnge loss s non-dfferentable, we ntroduce new varables z, creatng a constraned optmzaton mn z,w c z + 1 2 w 2 (4.1) s.t. z (1 y w x ) z.
Kernel Methods and SVMs 8 Introducng new constrants to smplfy an obectve lke ths seems strange at frst, but sn t too hard to understand. Notce the constrants are exactly equvalent to forcng that z (1 y w x ) +. But snce we are mnmzng the sum of all the z, the optmzaton wll make t as small as possble, and so z wll be the hnge loss for example, no more, no less. Introducng Lagrange multplers α to enforce that z (1 y w x ) and µ to enforce that z, we get the Lagrangan L = c z + 1 2 w 2 + α (1 y w x z ) + µ ( z ). A bunch of manpulaton changes ths to L = z ( c µ α ) + 1 2 w 2 + α w α y x. As ever, Lagrangan dualty states that we can solve our orgnal problem by dong max mn L. α,µ z,w For now, we work on the nner mnmzaton. For a partcular α and µ, we want to mnmze wth respect to z and w. By settng dl/dw =, we fnd that w = α y x. Meanwhle, settng dl/dz gves that α = c µ. If we substtute these expressons, we fnd that µ dsappears. However, notce that snce µ we must have that α c. max mn L = max ( ) 1 z α α + α α y y x x α,µ z,w α C 2 + α α y x α y x = max α 1 α α y y x x α c 2
Kernel Methods and SVMs 9 Ths s a maxmzaton of a quadratc obectve, under lnear constrants. That s, ths s a quadratc program. Hstorcally, QP solvers were frst used to solve SVM problems. However, as these scale poorly to large problems, a huge amount of effort has been devoted to faster solvers. (Often based on coordnate ascent and/or onlne optmzaton). Ths area s stll evolvng. However, software s wdely avalable now for solvers that are qute fast n practce. Now, as we saw above that w = α y x, we can classfy new ponts x by f(x) = α y x x Clearly, ths can be kernelzed. If we do so, we can compute the Lagrange multplers by the SVM optmzaton max α c α 1 α α y y k(x,x ), (4.2) 2 whch s agan a quadratc program. We can classfy new ponts by the SVM classfcaton rule f(x) = α y k(x,x ). (4.3) Snce we have kernelzed both the learnng optmzaton, and the classfcaton rule, we are agan free to replace k wth any of the varety of kernel functons we saw before. Now, fnally, we can defne what a support vector s. Notce that Eq. 4.2 s the maxmzaton of a quadratc functon of α, under the box constrants that α c. It often happens that α wants to be negatve (n terms ot the quadratc functon), but s prevented from ths by the constrants. Thus, α s often sparse. Ths has some nterestng consequences. Frst of all, clearly f α =, we don t need to nclude the correspondng term n Eq. 4.3. Ths s potentally a bg savngs. If all α are nonzero, then we would need to explctly compute the kernel functon wth all nputs, and our tme complexty s smlar to a nearest-neghbor method. If we only have a few nonzero α then we only have to compute a few kernel functons, and our complexty s smlar to that of a normal lnear method. Another nterestng property of the sparsty of α s that non- don t affect the soluton. Let s see why. What does t mean f α =? Well, recall that the multpler α s enforcng the constrant that z 1 y w x. (4.4)
Kernel Methods and SVMs 1 If α = at the soluton, then ths means, nformally speakng, that we ddn t really need to enforce ths constrant at all: If we threw t out of the optmzaton, t would stll automatcally be obeyed. How could ths be? Recall that the orgnal optmzaton n Eq. 4.1 s tryng to mnmze all the z. There are two thngs stoppng t from flyng down to : the constrant n Eq. 4.4 above, and the constrant that z. If the constrant above can be removed wth out changng the soluton, then t must be that z =. Thus, α = mples that 1 y w x, or, equvalently, that y w x 1. Thus non- are ponts that are very well classfed, that are comfortably on the rght sde of the lnear boundary. Now, magne we take some x wth z =, and remove t from the tranng set. It s pretty easy to see that ths s equvalent to takng the optmzaton mn z,w c z + 1 2 w 2 s.t. z (1 y w x ) z, and ust droppng the constrant that z (1 y w x ), meanng that z decouples from the other varables, and the optmzaton wll pck z =. But, as we saw above, ths has no effect. Thus, removng a non-support vector from the tranng set has no mpact on the resultng classfcaton rule. 5 Examples In class, we saw some examples of runnng SVMs. Here are many more.
Kernel Methods and SVMs 11 Dataset A, c = 1, k(x,v) = x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces Dataset A, c = 1 3, k(x,v) = x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces Dataset A, c = 1 5, k(x,v) = x v. predctons 1 5 α 1 1 5 2 4 6 8 1 sorted ndces
Kernel Methods and SVMs 12 Dataset A, c = 1, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 1 1 15 2 4 6 8 1 sorted ndces Dataset A, c = 1 3, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces Dataset A, c = 1 5, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 1 2 4 6 8 1 sorted ndces
Kernel Methods and SVMs 13 Dataset B, c = 1 5, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset B, c = 1 5, k(x,v) = (1 + x v) 5. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset B, c = 1 5, k(x,v) = (1 + x v) 1. predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces
Kernel Methods and SVMs 14 Dataset C (dataset B wth nose), c = 1 5, k(x,v) = 1 + x v. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = (1 + x v) 5. predctons 1 5 α 1 1 5 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = (1 + x v) 1. predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces
Kernel Methods and SVMs 15 Dataset C (dataset B wth nose), c = 1 5, k(x,v) = exp ( 2 x v 2). predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = exp ( 2 x v 2). predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces Dataset C, c = 1 5, k(x,v) = exp ( 2 x v 2). predctons 1 5 α 1 1 5 1 1 1 2 3 4 5 sorted ndces
Kernel Methods and SVMs 16 6 Kernel Theory We now return to the ssue of what makes a vald kernel k(x, v) where vald means there exsts some feature space φ such that k(x,v) = φ(x) φ(v). 6.1 Kernel algebra We can construct complex kernel functons from smple ones, usng an algebra of composton rules 3. Interestngly, these rules can be understood from parallel compostons n feature space. To take an example, suppose we have two vald kernel functons k a and k b. If we defne a new kernel functon by k(x,v) = k a (x,v) + k b (x,v), k wll be vald. To see why, consder the feature spaces φ a and φ b correspondng to k a and k b. If we defne φ by ust concatenatng φ a and φ b, by φ(x) = (φ a (x), φ b (x)), then φ s the feature space correspondng to k. To see ths, note φ(x) φ(v) = (φ a (x), φ b (x)) (φ a (v), φ b (v)) = φ a (x) φ a (v) + φ b (x)φ b (v) = k a (x,v) + k b (x,v) = k(x,v). We can make a table of kernel composton rules, along wth the dual feature space composton rules. kernel composton feature composton a) k(x,v) = k a (x,v) + k b (x,v) b) k(x,v) = fk a (x,v), f > φ(x) = (φ a (x), φ b (x)), φ(x) = fφ a (x) c) k(x,v) = k a (x,v)k b (x,v) φ m (x) = φ a (x)φ b (x) d) k(x,v) = x T Av, A postve sem-defnte φ(x) = L T x, where A = LL T. e) k(x,v) = x T M T Mv, M arbtrary φ(x) = Mx 3 Ths materal s based on class notes from Mchael Jordan.
Kernel Methods and SVMs 17 We have already proven rule (a). Let s prove some of the others. Rule (b) s qute easy to understand: φ(x) φ(v) = fφ a (x) fφ a (v) = fφ a (x) φ a (v) = fk a (x,v) = k(x, v) Rule (c) s more complex. It s mportant to understand the notaton. If then φ contans all sx pars. φ a (x) = (φ a1 (x), φ a2 (x), φ a3 (x)) φ b (x) = (φ b1 (x), φ b2 (x)), φ(x) = ( φ a1 (x)φ b1 (x), φ a2 (x)φ b1 (x), φ a3 (x)φ b1 (x) φ a1 (x)φ b2 (x), φ a2 (x)φ b2 (x), φ a3 (x)φ b2 (x) ) Wth that understandng, we can prove rule (c) va φ(x) φ(v) = φ m (x) m φ(v) m = φ a (x)φ b (x)φ a (v)φ b (v) = ( φ a (x)φ a (v) )( φ b (x)φ b (v) ) = ( φ a (x) φ a (v) )( φ b (x) φ b (v) ) = k a (x,v)k b (x,v) = k(x,v). Rule (d) follows from the well known result n lnear algebra that a symmetrc postve sem-defnte matrx A can be factored as A = LL T. Wth that known, clearly
Kernel Methods and SVMs 18 φ(x) φ(v) = (Lx) (Lv) = x T L T Lv = x T A T v = x T Av = k(x, v) We can alternatvely thnk of rule (d) as sayng that k(x) = x T M T Mx corresponds to the bass expanson φ(x) = Mx for any M. That gves rule (e). 6.2 Understandng Polynomal Kernels va Kernel Algebra So, we have all these rules for combnng kernels. What do they tell us? Rules (a),(b), and (c) essentally tell us that polynomal combnatons of vald kernels are vald kernels. Usng ths, we can understand the meanng of polynomal kernels Frst off, for some scalar varable x, consder a polynomal kernel of the form k(x, v) = (xv) d. To what bass expanson does ths kernel correspond? We can buld ths up stage by stage. k(x, v) = xv φ(x) = (x) k(x, v) = (xv) 2 φ(x) = ( ) by rule (c) k(x, v) = (xv) 3 φ(x) = (x 3 ) by rule (c). If we work wth vectors, we fnd that k(x,v) = (x v) corresponds to φ(x) = x, whle (by rule (c)) k(x,v) = (x v) 2 corresponds to a feature space wth all parwse terms φ m (x) = x x, 1, n. Smlarly, k(x,v) = (x v) 3 corresponds to a feature space wth all trplets φ m (x) = x x x k, 1,, k n.
Kernel Methods and SVMs 19 More generally, k(x,v) = (x v) d corresponds to a feature space wth terms φ m (x) =...x d, 1 n (6.1) Thus, a polynomal kernel s equvalent to a polynomal bass expanson, wth all terms of order d. Ths s pretty surprsng even though the word polynomal s n front of both of these terms! Agan, we should reterate the computatonal savngs here. In general, computng a polynomal bass expanson wll take tme O(n d ). However, computng a polynomal kernel only takes tme O(n). Agan, though, we have only defeated the computatonal ssue wth hgh-degree polynomal bass expansons. The statstcal propertes are unchanged. Now, consder the kernel k(x,v) = (r + x v) d. What s the mpact of addng the constant of r? Notce that ths s equvalent to smply takng the vectors x and v and prependng a constant of r to them. Thus, ths kernel corresponds to a polynomal expanson wth constant terms added. One way to wrte ths would be φ m (x) =...x d, n (6.2) where we consder x to be equal to r. Thus, ths kernel s equvalent to a polynomal bass expanson wth all terms of all order less than or equal to d. An nterestng queston s the mpact of the constant r. Should we set t large or small? What s the mpact of ths choce. Notce that the lower-order terms n the bass expanson n Eq. 6.2 wll have many terms x, and so get multpled by a r to a hgh power. Meanwhle, hgh order terms wll have have few or no terms of x, and so get multpled by r to a low power. Thus, a large factor of r has the effect of makng the low-order term larger, relatve to hgh-order terms. Recall that f we make the bass expanson larger, ths has the effect of reducng the the regularzaton penalty, snce the same classfcaton rule can be accomplshed wth a smaller weght. Thus, f we make part of a bass expanson larger, those parts of the bass expanson wll tend to play a larger role n the fnal classfcaton rule. Thus, usng a larger constant r had the effect of makng the low-order parts of the polynomal expanson n Eq. 6.2 tend to have more mpact. 6.3 Mercer s Theorem One thng we mght worry about s f the SVM optmzaton s convex n α. The concern s f α α y y k(x,x ) (6.3)
Kernel Methods and SVMs 2 s convex wth respect to α. We can show that f k s a vald kernel functon, then the kernel matrx K must be postve sem-defnte. z T Kz = = = z K z z φ(x ) φ(x )z z φ(x ) φ(x )z = z φ(x ) 2 We can also show that, f K s postve sem-defnte, then the SVM optmzaton s concave. The thng to see s that α α y y k(x,x ) = α T dag(y)kdag(y)α = α T Mα, where M = dag(y)kdag(y). It s not hard to show that M s postve sem-defnte. So ths s very nce f we use any vald kernel functon, we can be assured that the optmzaton that we need to solve n order to recover the Lagrange multplers α wll be concave. (The equvalent of convex when we are dong a maxmzaton nstead of a mnmzaton.) Now, we stll face the queston do there exst nvald kernel functons that also yeld postve sem-defnte kernel matrces? It turns out that the answer s no. Ths result s known as Mercer s theorem. A kernel functon s vald f and only f the correspondng kernel matrx s postve sem-defnte for all tranng sets {x }. Ths s very convenent the vald kernel functons are exactly those that yeld optmzaton problems that we can relably solve. However, notce that Mercer s theorem refers to all sets of ponts {x }. An nvald kernel can yeld a postve sem-defnte kernel matrx for some partcular tranng set. All we know s that, for an nvald kernel, there s some tranng set that yelds a non postve sem-defnte kernel matrx.
Kernel Methods and SVMs 21 7 Our Story so Far There were a lot of techncal detals here. It s worth takng a look back to do a conceptual overvew and make sure we haven t mssed the bg pcture. The startng pont for SVMs s s mnmzng the hnge loss under rdge regresson,.e. w = mn w c (1 y w x ) + + 1 2 w 2. Fundamentally, SVMs are ust fttng an optmzaton lke ths. The dfference s that they perform the optmzaton n a dfferent way, and they allow you to work effcently wth powerful bass expansons / kernel functons. Act 1. We proved that f w s the vector of weghts that results from ths optmzaton, then we could alternatvely calculate w as w = α y x, where the α are gven by the optmzaton max α c α 1 α α y y x x. (7.1) 2 Wth that optmzaton solved, we can classfy a new pont x by f(x) = x w = α y x x, (7.2) Thus, f we want to, we can thnk of the varables α as beng the man thngs that we are fttng, rather than the weghts w. Act 2. Next, we notced that the above optmzaton (Eq. 7.3) and classfer (Eq. 7.4) only depend on the nner products of the data elements. Thus, we could replace the nner products n these expressons wth kernel evaluatons, gvng the optmzaton and the classfcaton rule max α c α 1 α α y y k(x,x ), (7.3) 2
Kernel Methods and SVMs 22 f(x) = α y k(x,x ), (7.4) where k(x,v) = x v. Act 3. Now, magne that nstead of drectly workng wth the data, we wanted to work wth some bass expanson. Ths would be easy to accomplsh ust by swtchng the kernel functon to be k(x,v) = φ(x) φ(v). However, we also notced that for some bass expansons, lke polynomals, we could compute k(x, v) much more effcently than explctly formng the bass expansons and then takng the nner product. We called ths computatonal trck the kernel trck. Act 4. Fnally, we developed a kernel algebra, whch allowed us to understand how we can combne dfferent kernel functons, and what ths meant n feature space. We also saw Mercer s theorm, whch tells us what kernel functons are and are not legal. Happly, ths corresponded wth the SVM optmzaton problem beng convex. 8 Dscusson 8.1 SVMs as Template Methods Regardless of what we say, at the end of the day, support vector machnes make ther predctons through the classfcaton rule f(x) = α y k(x,x ). Intutvely, k(x,x ) measures how smlar x s to tranng example x. Ths bears a strong resemblence to K-NN classfcaton, where we use k as the dstance metrc, rather than somethng lke the Eucldean dstance. Thus, SVMs can be seen as glorfed template methods, where the amount that each pont x partcpates n predctons s reweghted n the learnng stage. Ths s a vew usually espoused by SVM skeptcs, but a reasonable one. Remember, however, that there s absolutely nothng wrong wth template methods.
Kernel Methods and SVMs 23 8.2 Theoretcal Issues An advantage of SVMs s that rgorous theoretcal guarantees can often be gven for ther performance. It s possble to use these theoretcal bounds to do model selecton, rather than, e.g., cross valdaton. However, at the moment, these theoretcal guarantees are rather loose n practce, meanng that SVMs perform sgnfcantly better than the bounds can show. As such, one can often get better practcal results by usng more heurstc model selecton procedures lke cross valdaton. We wll see ths when we get to learnng theory.