Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas
What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest Neghbors Whch of the above are lnear and whch are not? (1) (6) and (7) are non-lnear o () s lnear under certan restrctons
Decson Surfaces Decson ree g( x) Lnear Functons w x b Nonlnear Functons (Neural nets)
oday: Support Vector Machne (SVM) A classfer derved from statstcal learnng theory by Vapnk, et al. n 199 SVM became famous when, usng mages as nput, t gave accuracy comparable to neural-network wth hand-desgned features n a handwrtng recognton task Currently, SVM s wdely used n object detecton & recognton, content-based mage retreval, text recognton, bometrcs, speech recognton, etc. Also used for regresson (wll not cover today) Chapter 5.1, 5., 5.3, 5.11 (5.4*) n Bshop SVM tutoral (start readng from Secton 3) V. Vapnk
Outlne Lnear Dscrmnant Functon Large Margn Lnear Classfer Nonlnear SVM: he Kernel rck Demo of SVM
Lnear Dscrmnant Functon or a Lnear Classfer x Gven data and two classes, learn a functon of the form: g( x) w x b w x + b > 0 denotes +1 denotes -1 A hyper-plane n the feature space Decde class=1 f g(x)>0 and class=-1 otherwse w x + b < 0 x 1
Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1
Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1
Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1
Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! Whch one s the best? x 1
Large Margn Lnear he lnear dscrmnant functon (classfer) wth the maxmum margn s the best Classfer x safe zone denotes +1 denotes -1 Margn Margn s defned as the wdth that the boundary could be ncreased by before httng a data pont Why t s the best? he larger the margn the better generalzaton Robust to outlers x 1
Large Margn Lnear Am: Learn a large margn classfer. Gven a set of data ponts, defne: For y 1, wxb1 For y 1, wxb1 Gve an algebrac expresson for the wdth of the margn. Classfer x safe zone denotes +1 denotes -1 Margn x 1
Algebrac Expresson for Wdth of a Margn safe zone Margn x 1
Large Margn Lnear Am: Learn a large margn classfer Mathematcal Formulaton: maxmze w Classfer x x + x + denotes +1 denotes -1 Margn such that For y 1, wxb1 For y 1, wxb1 x - Common theme n machne learnng: LEARNING IS OPIMIZAION x 1
Large Margn Lnear Formulaton: 1 mnmze w Classfer x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 x - x 1
Large Margn Lnear Formulaton: 1 mnmze w Classfer x x + denotes +1 denotes -1 Margn such that x + y ( wxb) 1 x - x 1
Large Margn Lnear Classfer Formulaton: 1 mnmze w such that hs s a Quadratc programmng problem wth lnear constrants o y ( wxb) 1 Off-the-shelf Software However, we wll convert t to Lagrangan dual n order to use the kernel trck!
Solvng the Optmzaton Problem Quadratc programmng wth lnear constrants s.t. 1 mnmze w y ( wxb) 1 Lagrangan Functon 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0
Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 L p 0 w y x w 1 L p b 0 n 1 n y 0
Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 Lagrangan Dual Problem maxmze s.t. 0 1 n n n j yy j j 1 1 j1 n xx, and 1 y 0
Solvng the Optmzaton y ( wxb) 1 0 Problem From the equatons, we can prove that: (KK condtons): hus, only support vectors have 0 x x + x - x + he soluton has the form: n w yx yx 1 SV get b from y ( wxb) 1 0, where x s support vector Support Vectors x 1
Solvng the Optmzaton Problem he lnear dscrmnant functon s: g( x) w x b x x b SV Notce t reles on a dot product between the test pont x and the support vectors x Also keep n mnd that solvng the optmzaton problem nvolved computng the dot products x x j between all pars of tranng ponts
Large Margn Lnear What f data s not lnear separable? (nosy data, outlers, etc.) Classfer x denotes +1 denotes -1 Slack varables ξ can be added to allow msclassfcaton of dffcult or nosy data ponts 1 x 1
Large Margn Lnear Classfer Formulaton: 1 mnmze w C n 1 1 mnmze w such that y( wx b) 1 0 s.t. y ( wxb) 1 Wthout slack varables Parameter C can be vewed as a way to control over-fttng.
Large Margn Lnear Classfer Formulaton: (Lagrangan Dual Problem) maxmze 1 n n n j yy j j 1 1 j1 xx such that 0 C n 1 y 0
Non-lnear SVMs Datasets that are lnearly separable wth nose work out great: 0 x But what are we gong to do f the dataset s just too hard? Kernel rck!!! 0 x SVM = Lnear SVM + Kernel rck hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Kernel rck Motvaton Lnear classfers are well understood, wdely-used and effcent. How to use lnear classfers to buld non-lnear ones? Neural networks: Construct non-lnear classfers by usng a network of lnear classfers (perceptrons). Kernels: o o Map the problem from the nput space to a new hgher-dmensonal space (called the feature space) by dong a non-lnear transformaton usng a specal functon called the kernel. hen use a lnear model n ths new hgh-dmensonal feature space. he lnear model n the feature space corresponds to a non-lnear model n the nput space.
Non-lnear SVMs: Feature Space General dea: the orgnal nput space can be mapped to some hgher-dmensonal feature space where the tranng set s separable: Φ: x φ(x) hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Nonlnear SVMs: he Kernel rck Wth ths mappng, our dscrmnant functon s now: g( x) w ( x) b ( x) ( x) b SV No need to know ths mappng explctly, because we only use the dot product of feature vectors n both the tranng and test. A kernel functon s defned as a functon that corresponds to a dot product of two feature vectors n some expanded feature space: K( x, x ) ( x ) ( x ) j j
Nonlnear SVMs: he Kernel rck An example: -dmensonal vectors x=[x 1 x ]; let K(x,x j )=(1 + x x j ), Need to show that K(x,x j ) = φ(x ) φ(x j ): K(x,x j )=(1 + x x j ), = 1+ x 1 x j1 + x 1 x j1 x x j + x x j + x 1 x j1 + x x j = [1 x 1 x 1 x x x 1 x ] [1 x j1 x j1 x j x j x j1 x j ] = φ(x ) φ(x j ), where φ(x) = [1 x 1 x 1 x x x 1 x ] hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Nonlnear SVMs: he Kernel rck Examples of commonly-used kernel functons: Lnear kernel: K( x, x ) x x j j Polynomal kernel: K( x, x ) (1 x x ) j j p Gaussan (Radal-Bass Functon (RBF) ) kernel: Sgmod: j K( x, xj) exp( x x ) K( x, x ) tanh( x x ) j 0 j 1 In general, functons that satsfy Mercer s condton can be kernel functons: Kernel matrx should be postve semdefnte.
Nonlnear SVM: Optmzaton Formulaton: (Lagrangan Dual Problem) n n n 1 maxmze y y K(, ) x x j j j 1 1 j1 such that 0 C n 1 y he soluton of the dscrmnant functon s SV 0 g( x) K( x, x) b he optmzaton technque s the same.
Support Vector Machne: Algorthm 1. Choose a kernel functon. Choose a value for C 3. Solve the quadratc programmng problem (many software packages avalable) 4. Construct the dscrmnant functon from the support vectors
Some Issues Choce of kernel - Gaussan or polynomal kernel s default - f neffectve, more elaborate kernels are needed - doman experts can gve assstance n formulatng approprate smlarty measures Choce of kernel parameters - e.g. σ n Gaussan kernel - σ s the dstance between closest ponts wth dfferent classfcatons - In the absence of relable crtera, applcatons rely on the use of a valdaton set or cross-valdaton to set such parameters. Optmzaton crteron Hard margn v.s. Soft margn - a lengthy seres of experments n whch varous parameters are tested hs slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt
Summary: Support Vector Machne 1. Large Margn Classfer o Better generalzaton ablty & less over-fttng. he Kernel rck o o Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.
Addtonal Resource http://www.kernel-machnes.org/