Support Vector Machines CS434 - PDF Free Download

Support Vector Machnes CS434

Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best?

Intuton of Margn Consder ponts A, B, and C We are qute confdent n our predcton for A because t s far from the decson boundary. In contrast, e are not so confdent n our predcton for C because a slght change n the decson boundary may flp the decson. A x b > 0 B C x b = 0 x b < 0 Gven a tranng set, e ould lke to make all of our predctons correct and confdent! Ths can be captured by the concept of margn

The best choce because all ponts n the tranng set are reasonably far aay from t. Ho do e capture ths concept mathematcally?

Margn Gven a lnear decson boundary defned by 0 The functonal margn for a pont (, s defned as: For a fxed and, the larger the functonal value, the more confdent e have about the predcton Ho about e set our goal to be: fnd a set of and so that all tranng data ponts ll have large (maxmum) functonal margn? Ths has the rght ntuton but has a crtcal techncal fla

Issue th Functonal Margn Let s say ths purple pont, has f functonal margn of 10,.e.,, Decson boundary defned by 0 10 What f e change the value of and b proportonally by multplyng a large constant 1000: What s the ne decson boundary? 1000 1000 0 The functonal margn of pont, becomes: 10 1000 10000 We can arbtrarly change the functonal margn thout changng the boundary at all.

What e really ant A B C Geometrc margn: the dstances beteen an example and the decson boundary

x 1 Basc facts about lnes d x 0 Multply both sdes by T x b d To see ths: Let x 0 be the projecton of x on ths lne. We have: d T x x x 0 T x the red vector = d x x 0 d 0 T x b d the red vector T b d ote: d can be postve or negatve dependng on hch sde t s on. We predct postve f 0and negatve otherse

Geometrc margn, We defne the geometrc margn of a pont,, rt 0to be: It measures the geometrc dstance beteen the pont and the decson boundary It can be ether postve or negatve Postve f the pont s correctly classfed egatve f the pont s msclassfed

Geometrc margn of a decson boundary For tranng set, : 1,, and boundary 0, compute the geometrc margn of all ponts: y ( x b), 1,..., ote: 0f pont s correctly classfed Ho can e tell f 0 s good? the smallest s large The geometrc margn of a decson boundary for tranng set s defned to be: mn 1 ( ) B C A Goal: e am to fnd a decson boundary hose geometrc margn rt tranng set s maxmzed! γ A

What e have done so far We have establshed that e ant to fnd a lnear decson boundary hose geometrc margn s the largest We have a ne learnng objectve Gven a lnearly separable (ll be relaxed later) tranng set, : 1,,, e ould lke to fnd a lnear classfer, th the maxmum geometrc margn. Ho can e acheve ths? Mathematcally formulate ths as an optmzaton problem and then solve t In ths class, e just need to kno some bascs about ths process

Maxmum Margn Classfer Ths can be represented as a constraned optmzaton problem. max, b subject to : Ths optmzaton problem s n a nasty form, so e need to do some rertng Let be the functonal margn (e have = can rerte ths as max, b ' subject to : y () y () ( x ( x b) b), ', 1,, ), e 1,,

Maxmum Margn Classfer ote that e can arbtrarly rescale and b to make the functonal margn ' large or small So e can rescale them such that =1 max, b ' subject to : y ( x ' b) ', 1,, 1 max (or equvalently mn, b, b subject to : y ( x b) 1, 2 ) 1,, Maxmzng the geometrc margn s equvalent to mnmzng the magntude of subject to mantanng a functonal margn of at least 1

Solvng the Optmzaton Problem 1 2 mn, b 2 subject to : y ( x b) 1, 1,, Ths results n a quadratc optmzaton problem th lnear nequalty constrants. Ths s a ell-knon class of mathematcal programmng problems for hch several (non-trval) algorthms exst. In practce, e can just regard the QP solver as a black-box thout botherng ho t orks You ll be spared of the excrucatng detals and jump to

The soluton We can not gve you a close form soluton that you can drectly plug n the numbers and compute for an arbtrary data sets But, the soluton can alays be rtten n the follong form y x, s.t. y 0 1 Ths s the form of, the value for b can be calculated accordngly usng some addtonal steps A fe mportant thngs to note: The eght vector s a lnear combnaton of all the tranng examples Importantly, many of the s are zeros These ponts that have non-zero s are the support vectors These are the ponts that have the smallest geometrc margn 1

A Geometrcal Interpretaton Class 2 8 =0.6 10 =0 5 =0 2 7 =0 2 =0 4 =0 9 =0 Class 1 3 =0 6 =1.4 1 =0.8

A fe mportant notes regardng the geometrc nterpretaton gves the decson boundary Postve support vectors le on the lne of egatve support vectors le on the lne of We can thnk of the above to lnes as defnng a fat decson boundary The support vectors exactly touches the to sdes of the fat boundary Learnng nvolves adjustng the locaton and orentaton of the decson boundary so that t can be as fat as possble thout eatng up the tranng examples

Summary So Far We defned margn (functonal, geometrc) We demonstrated that e prefer to have lnear classfers th large geometrc margn. We formulated the problem of fndng the maxmum margn lnear classfer as a quadratc optmzaton problem Ths problem can be solved usng effcent QP algorthms that are avalable. The solutons for and b are very ncely formed Do e have our perfect classfer yet?

on-separable Data and ose What f the data s not lnearly separable? The soluton does not exst.e., the set of lnear constrants are not satsfable But e should stll be able to fnd a good decson boundary What f e have nose n data? The maxmum margn classfer s not robust to nose!

Soft Margn Allo functonal margns to be less than 1 Postve Class Orgnally functonal margns need to satsfy: y ( x b) 1 o e allo t to be less than 1: y ( x b) 1- ξ & ξ 0 The objectve changes to: egatve Class mn, b 2 c 1 ξ

Soft-Margn Maxmzaton subject to : mn, b y 2 ( x b) 1, 1,, Slack varables subject to : mn, b y ( x 0, 2 cξ 1 b) 1 ξ, 1,, 1,, Ths allos some functonal margns < 1 (could even be < 0) The s can be veed as the errors of our fat decson boundary Addng s to the objectve functon to mnmze errors We have a tradeoff beteen makng the decson boundary fat and mnmzng the error Parameter c controls the tradeoff: Large c: s ncur large penalty, so the optmal soluton ll try to avod them Small c: small cost for s, e can sacrfce some tranng examples to have a large classfer margn

Solutons to soft-margn SVM 1 y x, s.t. y 1 0 o soft margn 1 y x, s.t. 1 y 0 and 0 α c Wth soft margn c effectvely puts a box constrant on, the eghts of the support vectors It lmts the nfluence of ndvdual support vectors (maybe outlers) In practce, c s a parameter to be set, smlar to k n k-nearest neghbor It can be set usng cross-valdaton

Ho to make predctons? For classfyng th a ne nput z Compute s j1 j j z b ( y x ) z b y t j j1 classfy z as f postve, and - otherse t t s t j t j ( x t j z) b ote: need not be computed/stored explctly, e can store the α s, and classfy z by usng the above calculaton, hch nvolves takng the dot product beteen tranng examples and the ne example z In fact, n SVM both learnng and classfcaton can be acheved usng dot products beteen par of nput ponts: --- ths allos us to do the Kernel trck, that s to replace the dot product th somethng called a kernel functon.

on-lnear SVMs Datasets that are lnearly separable th some nose ork out great: 0 x But hat are e gong to do f the dataset s just too hard? 0 x

Mappng the nput to a hgher dmensonal space can solve the lnearly nseparable cases 0 x x x 2 0 x (x, x 2 )

on-lnear SVMs: Feature Spaces General dea: For any data set, the orgnal nput space can alays be mapped to some hgher-dmensonal feature spaces such that the data s lnearly separable: x Ф(x)

Example: Quadratc Feature Space Assume m nput dmensons x = (x 1, x 2,, x m ) umber of quadratc terms: 1mmm(m-1)/2 m 2 The number of dmensons ncrease rapdly! You may be onderng about the s At least they on t hurt anythng! You ll fnd out hy they are there soon!

Dot product n quadratc feature space Ф(a) Ф(b) o let s just look at another nterestng functon of (a b): 1 2 2 1 2 1 2 ) ( 1 ) 2( ) ( 1) ( 1 1 1 1 2 2 1 1 1 1 2 1 2 2 m m m j j j m m m m j j j m m a b b b a a b a a b b b a a a b a b b a b a b a They are the same! And the later only takes O(m) to compute!

Kernel Functons If every data pont s mapped nto hgh-dmensonal space va some transformaton x (x), the dot product that e need to compute for classfyng a pont x becomes: <(x ) (x)> for all support vectors x A kernel functon s a functon that s equvalent to an dot product n some feature space. k(a,b) = <(a) (b)> We have seen the example: k(a,b) = (ab1) 2 Ths s equvalent to mappng to the quadratc space! A kernel functon can often be veed measurng smlarty

More kernel functons Lnear kernel: k(a,b) = (ab) Polynomal kernel: k(a,b) = (ab1) d Radal-Bass-Functon kernel: In ths case, the correspondng mappng (x) s nfntedmensonal! Lucky that e don t have to compute the mappng explctly! s j1 t j j ( z) b y ( ( x ) ( z)) b t j t s j1 t j y t j K( x t j z) b ote: We ll not get nto the detals but the learnng of can be acheved by usng kernel functons as ell!

onlnear SVM summary Usng kernel functons n place of dot product, e, n effect, map the nput space to a hgh dmensonal feature space and learn a lnear decson boundary n the feature space The decson boundary ll be nonlnear n the orgnal nput space Many possble choces of kernel functons Ho to choose? Most frequently used method: cross-valdaton

Strengths Strength vs eakness The soluton s globally optmal It scales ell th hgh dmensonal data It can handle non-tradtonal data lke strngs, trees, nstead of the tradtonal fxed length feature vectors Why? Because as long as e can defne a kernel (smlarty) functon for such nput, e can apply svm Weakness eed to specfy a good kernel Tranng tme can be long f you use the rong softare package 2007 2006 1998