Support Vector Machnes CS434
Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + +
Intuton of Margn Consder ponts A, B, and C We are qute confdent n our predcton for A because t s far from the decson boundary. In contrast, we are not so confdent n our predcton for C because a slght change n the decson boundary may flp the decson. A + w x + b > 0 B + + + + + + + + C w x + b = 0 w x + b < 0 Gven a tranng set, we would lke to make all of our predctons correct and confdent! Ths can be captured by the concept of margn
+ + + + + + + + + The best choce because all ponts n the tranng set are reasonably far away from t. How do we capture ths concept mathematcally?
Margn Gven a lnear decson boundary defned by 0 The functonal margn for a pont (, s defned as: For a fxed and, the larger the functonal value, the more confdence we have about the predcton A possble goal: fnd a set of and so that all tranng data ponts wll have large (maxmum) functonal margn? Ths has the rght ntuton but has a crtcal techncal flaw
Issue wth Functonal Margn + Let s say ths purple pont, has f functonal margn of 10,.e., + + + + + + + +, Decson boundary defned by 0 10 What f we change the value of and b proportonally by multplyng a large constant 1000: What s the new decson boundary? 1000 1000 0 The functonal margn of pont, becomes: 10 1000 10000 We can arbtrarly change the functonal margn wthout changng the boundary at all.
What we should measure: Geometrc margn We defne the geometrc margn of a pont,, wrt 0to be:, It measures the geometrc dstance between the pont and the decson boundary It can be ether postve or negatve Postve f the pont s correctly classfed egatve f the pont s msclassfed
Geometrc margn of a decson boundary For tranng set, : 1,, and boundary 0, compute the geometrc margn of all ponts: y ( w x b), 1,..., w ote: 0f pont s correctly classfed How can we tell f 0 s good? the smallest s large The geometrc margn of a decson boundary for tranng set s defned to be: mn 1 ( ) + B + + + + C + + A + + Goal: we am to fnd a decson boundary whose geometrc margn wrt tranng set s maxmzed! γ A
What we have done so far We have establshed that we want to fnd a lnear decson boundary whose geometrc margn s the largest We have a new learnng objectve Gven a lnearly separable (wll be relaxed later) tranng set, : 1,,, we would lke to fnd a lnear classfer, wth the maxmum geometrc margn. How can we acheve ths? Mathematcally formulate ths as an optmzaton problem and then solve t
Maxmum Margn Classfer Ths can be represented as a constraned optmzaton problem. max w, b subject to : Ths optmzaton problem s n a nasty form, so we need to do some rewrtng Let be the functonal margn (we have = can rewrte ths as max w, b ' w subject to : y () y () ( w x w ( w x b) b), ', 1,, ), we 1,,
Maxmum Margn Classfer ote that we can arbtrarly rescale w and b to make the functonal margn ' large or small So we can rescale them such that =1 max w, b ' w subject to : y ( w x ' b) ', 1,, 1 max (or equvalently mn w, b w w, b subject to : y ( w x b) 1, w 2 ) 1,, Maxmzng the geometrc margn s equvalent to mnmzng the magntude of w subject to mantanng a functonal margn of at least 1
Solvng the Optmzaton Problem 1 2 mn w w, b 2 subject to : y ( w x b) 1, 1,, Quadratc optmzaton wth lnear nequalty constrants. A well-known class of mathematcal programmng problems, Several (non-trval) algorthms exst. In practce, one can regard the QP solver as a black-box Useful to formulate an equvalent dual optmzaton problem and solve that one nstead
The soluton We can not gve you a close form soluton that you can drectly plug n the numbers and compute for an arbtrary data sets But, the soluton can always be wrtten n the followng form w y x, s.t. y 0 1 Ths s the form of w, the value for b can be calculated accordngly usng some addtonal steps A few mportant thngs to note: The weght vector s a lnear combnaton of all the tranng examples Importantly, many of the s are zeros These ponts that have non-zero s are the support vectors These are the ponts that have the smallest geometrc margn 1
Asde: Constraned Optmzaton To solve the followng optmzaton problem Consder the followng functon known as the Lagrangan Under certan condtons t can be shown that for a soluton x to the above problem we have Prmal form Dual form
Back to the Orgnal Problem subject to : 1- y (w x b) 0, The Lagrangan s 1,L, L(w,b,α) 1 2 w w We want to solve 1 {1 y (w x b)}, subject to 0 Settng the gradent of w.r.t. w and b to zero, we have y x w 0 w y x 1 1 y 0 1
The Dual Problem If we substtute w y x to, we have 1 L(α) 1 w w {y (w x b) 1} 2 1 1 j j j j yy x x j 2 yy 1 j1 1 1 1 j jyy x x j 1 2 1 j1 x x j b 1 y 1 ote that y 0 1 Ths s a functon of only
The Dual Problem The new objectve functon s n terms of only It s known as the dual problem: f we know all,we know w The orgnal problem s known as the prmal problem The objectve functon of the dual problem needs to be maxmzed! The dual problem s therefore: 1 j max L(α) jy y x x j 2 subject to 1 1 j1 0, 1,...,n, 1 y 0 Propertes of when we ntroduce the Lagrange multplers The result when we dfferentate the orgnal Lagrangan w.r.t. b
The Dual Problem max L(α) subject to 1 1 1 j1 0, 1,...,n, 2 j y y x x j j 1 y 0 Ths s also quadratc programmng (QP) problem A global maxmum of can always be found w can be recovered by w y x 1 b can also be recovered as well (wat for a bt)
Characterstcs of the Soluton Many of the are zero w s a lnear combnaton of only a small number of data ponts In fact, optmzaton theory requres that the soluton to satsfy the followng KKT condtons: 0, 1,...,n, y ( j j1 j1 {y ( x wth non-zero are called support vectors (SV) The decson boundary s determned only by the SV Let t j (j=1,..., s) be the ndces of the s support vectors. We can wrte y x j x b) 1 j y x j x b) -1} 0 s w t j j1 y t j xt j Functonal margn 1 s nonzero only when functonal margn = 1
Solve for b ote that we know that for support vectors the functonal margn = 1 We can use ths nformaton to solve for b We can use any support vector to acheve ths s y (t j j1 y t j x t j x b) 1 A numercally more stable soluton s to use all support vectors (detals n the book)
Classfyng new examples For classfyng wth a new nput z w T x b s Compute t j and classfy z j1 y t j x t j x b as postve f the sum s postve, and negatve otherwse ote: w need not be formed explctly, rather we can classfy z by takng a weghted sum of the nner products wth the support vectors (useful when we generalze from nner product to kernel functons later)
The Quadratc Programmng Problem Many approaches have been proposed Loqo, cplex, etc. (see http://www.numercal.rl.ac.uk/qp/qp.html) Most are nteror-pont methods Start wth an ntal soluton that can volate the constrants Improve ths soluton by optmzng the objectve functon and/or reducng the amount of constrant volaton For SVM, sequental mnmal optmzaton (SMO) seems to be the most popular A QP wth two varables s trval to solve Each teraton of SMO pcks a par of (, j ) and solve the QP wth these two varables; repeat untl convergence In practce, we can just regard the QP solver as a black-box wthout botherng how t works
A Geometrcal Interpretaton Class 2 8 =0.6 10 =0 =0 5 7 =0 2 =0 =0 4 9 =0 Class 1 3 =0 6 =1.4 1 =0.8
A Geometrcal Interpretaton Class 2 8 =0.6 10 =0 5 =0 2 w 7 =0 2 =0 4 =0 9 =0 Class 1 3 =0 6 =1.4 1 =0.8
A few mportant notes regardng the geometrc nterpretaton gves the decson boundary Postve support vectors le on the lne of egatve support vectors le on the lne of We can thnk of the above two lnes as defnng a fat decson boundary The support vectors exactly touches the two sdes of the fat boundary Learnng nvolves adjustng the locaton and orentaton of the decson boundary so that t can be as fat as possble wthout eatng up the tranng examples
Summary So Far We defned margn (functonal, geometrc) We demonstrated that we prefer to have lnear classfers wth large geometrc margn. We formulated the problem of fndng the maxmum margn lnear classfer as a quadratc optmzaton problem Ths problem can be solved usng effcent QP algorthms that are avalable. The solutons for w and b are very ncely formed Do we have our perfect classfer yet?
on-separable Data and ose What f the data s not lnearly separable? The soluton does not exst.e., the set of lnear constrants are not satsfable But we should stll be able to fnd a good decson boundary What f we have nose n data? The maxmum margn classfer s not robust to nose!
Soft Margn Allow functonal margns to be less than 1 Postve Class Orgnally functonal margns need to satsfy: y (w x +b) 1 ow we allow t to be less than 1: y (w x +b) 1- ξ & ξ 0 The objectve changes to: egatve Class mn w w, b 2 c 1 ξ
Soft-Margn Maxmzaton subject to : mn w, b y w 2 ( w x b) 1, 1,, Slack varables subject to : mn w, b y w ( w x 0, 2 cξ 1 b) 1 ξ, 1,, 1,, Ths allows some functonal margns < 1 (could even be < 0) The s can be vewed as the errors of our fat decson boundary Addng s to the objectve functon to mnmze errors We have a tradeoff between makng the decson boundary fat and mnmzng the error Parameter c controls the tradeoff: Large c: s ncur large penalty, so the optmal soluton wll try to avod them Small c: small cost for s, we can sacrfce some tranng examples to have a large classfer margn
Solutons to soft-margn SVM w 1 y x, s.t. y 1 0 o soft margn w 1 y x, s.t. 1 y 0 and 0 α c Wth soft margn c effectvely puts a box constrant on, the weghts of the support vectors It lmts the nfluence of ndvdual support vectors (maybe outlers) In practce, c s a parameter to be set, smlar to k n k-nearest neghbor It can be set usng cross-valdaton
How to make predctons? For classfyng wth a new nput z Compute s j1 j j w z b ( y x ) z b y t j j1 classfy z as + f postve, and - otherwse t t s t j t j ( x t j z) b ote: w need not be computed/stored explctly, we can store the α s, and classfy z by usng the above calculaton, whch nvolves takng the dot product between tranng examples and the new example z In fact, n SVM both learnng and classfcaton can be acheved usng dot products between par of nput ponts: --- ths allows us to do the Kernel trck, that s to replace the dot product wth somethng called a kernel functon.
on-lnear SVMs Soft margn works out great for datasets that are lnearly separable wth some nose 0 x But what are we gong to do f the dataset s just too hard (not lnear separable)? 0 x
Mappng the nput to a hgher dmensonal space can solve the lnearly nseparable cases 0 x x x 2 0 x (x, x 2 )
on-lnear SVMs: Feature Spaces General dea: For any data set, the orgnal nput space can always be mapped to some hgher-dmensonal feature spaces such that the data s lnearly separable: x Ф(x)
Example: Quadratc Feature Space Assume m nput dmensons x = (x 1, x 2,, x m ) umber of quadratc terms: 1+m+m+m(m-1)/2 The number of dmensons ncrease rapdly! You may be wonderng about the s At least they won t hurt anythng! You wll fnd out why they are there soon!
Dot Product n Quadratc Feature Space Ф(a) Ф(b) ow let s just look at another nterestng functon of (a b): 1 2 2 1 2 1 2 ) ( 1 ) 2( ) ( 1 1 1 1 2 2 1 1 1 1 2 1 2 m m m j j j m m m m j j j m m a b b b a a b a a b b b a a a b a b b a b a They are the same! And the later only takes O(m) to compute! 2 ( 1) ab
Kernel Functons The lnear classfer reles on nner product between vectors If every data pont s mapped nto a hgh-dmensonal space va some transformaton x (x), the nner product becomes: Defnton: A functon, s called a kernel functon f,, for some We have seen the example:, 1 Ths s equvalent to mappng to the quadratc space!
on-lnear SVM wth Kernels Dual formulaton of the learnng: Dual formulaton of predcton:
More about kernel functons In practce, we specfy the kernel functon wthout explctly statng the transformaton Gven a kernel functon, fndng ts correspondng transformaton can be very cumbersome A kernel functon can be vewed as computng some smlarty measure between objects If you have a good smlarty measure, can we use t as a kernel n svm?
Kernel functon or not Consder a fnte set of m ponts, we defne the kernel marx as,,,,,,,,, Kernel matrces are square and symmetrc Mercer theorem: A functon s a kernel functon ff for any fnte sample,,,, ts correspondng kernel matrx s postve sem-defnte (non-negatve egenvalues)
More kernel functons Lnear kernel: k(a,b) = (ab) Polynomal kernel: k(a,b) = (ab+1) d Radal-Bass-Functon kernel: In ths case, the correspondng mappng (x) s nfnte-dmensonal! Closure property of kernel: f and are kernel functons, then the followng are all kernel functons:,,,,,,,,
Key Choces n Applyng SVM Select the kernel functon In practce, we often try dfferent kernel functons and use cross-valdaton to choose Lnear kernel, polynomal kernels (wth low degrees) and RBF kernels are popular choces One can also construct a kernel usng lnear combnatons of dfferent kernels and learn the combnaton parameters (kernel learnng) Selectng the kernel parameter and c Very strong mpact on performance Often the optmal range s reasonably large Grd search wth cross-valdaton
SVM summary Strength vs weakness Strengths The soluton s globally optmal and exact Kernel allows for very flexble hypotheses It can handle non-tradtonal data lke strngs, trees, nstead of the tradtonal fxed length feature vectors Why? Because as long as we can defne a kernel (smlarty) functon for such nputs, we can apply svm Weakness eed to specfy a good kernel and tune s parameters Tranng tme can be long 2007 2006 1998