Chapter 6 Support vector machne Séparateurs à vaste marge
Méthode de classfcaton bnare par apprentssage Introdute par Vladmr Vapnk en 1995 Repose sur l exstence d un classfcateur lnéare Apprentssage supervsé Effcace en terme de temps de calcul et de precson
IDEE DE BASE rouver un classfcateur lnéare (hyperplan) séparant les données d'un plan en deux categores : Classe 1 (+) pour les ponts à y>0 Classe (-) pour les ponts à y<0 Maxmser la dstance de separaton entre ces deux classes
Dscrmnant Functon It can be arbtrary functons of x, such as: Nearest Neghbor Decson ree g( x) Lnear Functons w x b Nonlnear Functons
Lnear Dscrmnant Functon g(x) s a lnear functon: x w x + b > 0 g( x) w x b A hyper-plane n the feature space n (Unt-length) normal vector of the hyper-plane: n w w w x + b < 0 x 1
Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! x 1
Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! x 1
Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! x 1
Lnear Dscrmnant Functon How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? x denotes +1 denotes -1 Infnte number of answers! Whch one s the best? x 1
Large Margn Lnear Classfer denotes +1 he lnear dscrmnant functon (classfer) wth the maxmum margn s the best x safe zone denotes -1 Margn Margn s defned as the wdth that the boundary could be ncreased by before httng a data pont Why t s the best? Robust to outlners and thus strong generalzaton ablty x 1
Large Margn Lnear Classfer Gven a set of data ponts: {( x, y )}, 1,,, n, where x denotes +1 denotes -1 For y 1, wxb0 For y 1, wxb0 Wth a scale transformaton on both w and b, the above s equvalent to For y 1, wxb1 For y 1, wxb1 x 1
Large Margn Lnear Classfer Formulaton: maxmze w x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 n x - x 1
Large Margn Lnear Classfer Formulaton: 1 mnmze w x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 n x - x 1
Large Margn Lnear Classfer Formulaton: 1 mnmze w x x + denotes +1 denotes -1 Margn such that x + y ( wxb) 1 n x - x 1
Solvng the Optmzaton Problem Quadratc programmng wth lnear constrants s.t. 1 mnmze w y ( wxb) 1 Lagrangan Functon 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0
Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 L p b n p w w w x 1 0 s.t. 0 L p 0 w y x w 1 n 1 n y 0
Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 Lagrangan Dual Problem maxmze s.t. 0 1 n n n j yy j j 1 1 j1 n xx, and 1 y 0
Solvng the Optmzaton Problem From KK condton, we know: y ( wxb) 1 0 x x + hus, only support vectors have 0 x + x - he soluton has the form: n w yx yx 1 SV get b from y ( wxb) 1 0, where x s support vector Support Vectors x 1
Solvng the Optmzaton Problem he lnear dscrmnant functon s: g( x) w x b x x b SV Notce t reles on a dot product between the test pont x and the support vectors x Also keep n mnd that solvng the optmzaton problem nvolved computng the dot products x x j between all pars of tranng ponts
Soluton du problème d optmsaton * : estmé (x S,y S ) étant n'mporte quel pont de support m s s m y y w y w D 1 * * 0 1 * * * 0 * ). ( ). ( ) ( x x x w x w x Seuls les α correspondant aux ponts les plus proches sont non nuls. On parle de ponts de support. Elles determnent l hyperplan optmal
Interpretaton geometrque Class 8 =0.6 10 =0 5 =0 7 =0 =0 4 =0 9 =0 Class 1 3 =0 6 =1.4 1 =0.8
Large Margn Lnear Classfer What f data s not lnear separable? (nosy data, outlers, etc.) x denotes +1 denotes -1 Slack varables ξ can be added to allow msclassfcaton of dffcult or nosy data ponts 1 x 1
Large Margn Lnear Classfer Formulaton: 1 mnmze w C n 1 such that y( wx b) 1 0 Parameter C can be vewed as a way to control over-fttng.
Large Margn Lnear Classfer Formulaton: (Lagrangan Dual Problem) maxmze 1 n n n j yy j j 1 1 j1 xx such that 0 C n 1 y 0
Non-lnear SVMs Datasets that are lnearly separable wth nose work out great: 0 x But what are we gong to do f the dataset s just too hard? 0 x How about mappng data to a hgher-dmensonal space: x 0 x hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Non-lnear SVMs: Feature Space General dea: the orgnal nput space can be mapped to some hgher-dmensonal feature space where the tranng set s separable: Φ: x φ(x) hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Nonlnear SVMs: he Kernel rck Wth ths mappng, our dscrmnant functon s now: g( x) w ( x) b ( x) ( x) b SV No need to know ths mappng explctly, because we only use the dot product of feature vectors n both the tranng and test. A kernel functon s defned as a functon that corresponds to a dot product of two feature vectors n some expanded feature space: K( x, x ) ( x ) ( x ) j j
Nonlnear SVMs: he Kernel rck An example: -dmensonal vectors x=[x 1 x ]; let K(x,x j )=(1 + x x j ), Need to show that K(x,x j ) = φ(x ) φ(x j ): K(x,x j )=(1 + x x j ), = 1+ x 1 x j1 + x 1 x j1 x x j + x x j + x 1 x j1 + x x j = [1 x 1 x 1 x x x 1 x ] [1 x j1 x j1 x j x j x j1 x j ] = φ(x ) φ(x j ), where φ(x) = [1 x 1 x 1 x x x 1 x ] hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Nonlnear SVMs: he Kernel rck Examples of commonly-used kernel functons: Lnear kernel: K( x, x ) x x j j Polynomal kernel: K( x, x ) (1 x x ) j j p Gaussan (Radal-Bass Functon (RBF) ) kernel: Sgmod: j K( x, xj) exp( x x ) K( x, x ) tanh( x x ) j 0 j 1 In general, functons that satsfy Mercer s condton can be kernel functons.
Nonlnear SVM: Optmzaton Formulaton: (Lagrangan Dual Problem) n n n 1 maxmze y y K(, ) such that 0 C x x j j j 1 1 j1 n 1 y 0 he soluton of the dscrmnant functon s g( x) K( x, x) b SV he optmzaton technque s the same.
Support Vector Machne: Algorthm 1. Choose a kernel functon. Choose a value for C 3. Solve the quadratc programmng problem (many software packages avalable) 4. Construct the dscrmnant functon from the support vectors
Some Issues Choce of kernel - Gaussan or polynomal kernel s default - f neffectve, more elaborate kernels are needed - doman experts can gve assstance n formulatng approprate smlarty measures Choce of kernel parameters - e.g. σ n Gaussan kernel - σ s the dstance between closest ponts wth dfferent classfcatons - In the absence of relable crtera, applcatons rely on the use of a valdaton set or cross-valdaton to set such parameters. Optmzaton crteron Hard margn v.s. Soft margn - a lengthy seres of experments n whch varous parameters are tested hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Summary: Support Vector Machne 1. Large Margn Classfer Better generalzaton ablty & less over-fttng. he Kernel rck Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.
Soluton du nouveau problème d optmsaton La foncton de décson devent alors D(x) m S u K(x,x) w 0 1 m S : nb de ponts de support
SHEMA DE FONCIONNEMEN des SVM sgn( u K(x,x) + w 0 ) Sorte : sgn( u K(x,x) + w 0 ) 1 3 4 K K K K Comparason : K(x, x) Échantllon x 1, x, x 3,... Vecteur d'entrée x
Archtecture of SVMs Nonlnear Classfer(usng kernel) Decson functon are computed as the soluton of quadratc program l l v y v x tran example each for substtute x b x x v k b x x v x f ) ( ) ), ( sgn( ) )) ( ) ( ( sgn( ) ( 1 1
Matlab example load fsherrs data = [meas(:,1), meas(:,)]; % Extract the Setosa class groups = smember(speces,'setosa'); % Randomly select tranng and test sets [tran, test] = crossvalnd('holdout',groups); % % Use a lnear support vector machne classfer svmstruct = svmtran(data(tran,:),groups(tran),'showplot',true); classes = svmclassfy(svmstruct,data(test,:),'showplot',true); % See how well the classfer performed cp = classperf(groups); classperf(cp,classes,test); cp.correctrate senstvty or true postve rate (PR), specfcty (SPC) or true negatve rate
4.5 4 0 (tranng) 0 (classfed) 1 (tranng) 1 (classfed) Support Vectors 3.5 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
Summary: Support Vector Machne 1. Large Margn Classfer Better generalzaton ablty & less over-fttng. he Kernel rck Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.
Addtonal Resource http://www.kernel-machnes.org/
Demo of LbSVM http://www.cse.ntu.edu.tw/~cjln/lbsvm/