Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Size: px

Start display at page:

Download "Support Vector Machines. Vibhav Gogate The University of Texas at dallas"

Paulina Jacobs
6 years ago
Views:

1 Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas

2 What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest Neghbors Whch of the above are lnear and whch are not? (1) (6) and (7) are non-lnear o () s lnear under certan restrctons

3 Decson Surfaces Decson ree g( x) Lnear Functons w x b Nonlnear Functons (Neural nets)

4 oday: Support Vector Machne (SVM) A classfer derved from statstcal learnng theory by Vapnk, et al. n 199 SVM became famous when, usng mages as nput, t gave accuracy comparable to neural-network wth hand-desgned features n a handwrtng recognton task Currently, SVM s wdely used n object detecton & recognton, content-based mage retreval, text recognton, bometrcs, speech recognton, etc. Also used for regresson (wll not cover today) Chapter 5.1, 5., 5.3, 5.11 (5.4*) n Bshop SVM tutoral (start readng from Secton 3) V. Vapnk

5 Outlne Lnear Dscrmnant Functon Large Margn Lnear Classfer Nonlnear SVM: he Kernel rck Demo of SVM

6 Lnear Dscrmnant Functon or a Lnear Classfer x Gven data and two classes, learn a functon of the form: g( x) w x b w x + b > 0 denotes +1 denotes -1 A hyper-plane n the feature space Decde class=1 f g(x)>0 and class=-1 otherwse w x + b < 0 x 1

7 Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1

8 Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1

9 Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! x 1

10 Lnear Dscrmnant How would you classfy these ponts usng a lnear dscrmnant functon n order to mnmze the error rate? Functon x denotes +1 denotes -1 Infnte number of answers! Whch one s the best? x 1

11 Large Margn Lnear he lnear dscrmnant functon (classfer) wth the maxmum margn s the best Classfer x safe zone denotes +1 denotes -1 Margn Margn s defned as the wdth that the boundary could be ncreased by before httng a data pont Why t s the best? he larger the margn the better generalzaton Robust to outlers x 1

12 Large Margn Lnear Am: Learn a large margn classfer. Gven a set of data ponts, defne: For y 1, wxb1 For y 1, wxb1 Gve an algebrac expresson for the wdth of the margn. Classfer x safe zone denotes +1 denotes -1 Margn x 1

13 Algebrac Expresson for Wdth of a Margn safe zone Margn x 1

14 Large Margn Lnear Am: Learn a large margn classfer Mathematcal Formulaton: maxmze w Classfer x x + x + denotes +1 denotes -1 Margn such that For y 1, wxb1 For y 1, wxb1 x - Common theme n machne learnng: LEARNING IS OPIMIZAION x 1

15 Large Margn Lnear Formulaton: 1 mnmze w Classfer x x + denotes +1 denotes -1 Margn such that x + For y 1, wxb1 For y 1, wxb1 x - x 1

16 Large Margn Lnear Formulaton: 1 mnmze w Classfer x x + denotes +1 denotes -1 Margn such that x + y ( wxb) 1 x - x 1

17 Large Margn Lnear Classfer Formulaton: 1 mnmze w such that hs s a Quadratc programmng problem wth lnear constrants o y ( wxb) 1 Off-the-shelf Software However, we wll convert t to Lagrangan dual n order to use the kernel trck!

18 Solvng the Optmzaton Problem Quadratc programmng wth lnear constrants s.t. 1 mnmze w y ( wxb) 1 Lagrangan Functon 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0

19 Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 L p 0 w y x w 1 L p b 0 n 1 n y 0

20 Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1 n p w w w x 1 s.t. 0 Lagrangan Dual Problem maxmze s.t. 0 1 n n n j yy j j 1 1 j1 n xx, and 1 y 0

21 Solvng the Optmzaton y ( wxb) 1 0 Problem From the equatons, we can prove that: (KK condtons): hus, only support vectors have 0 x x + x - x + he soluton has the form: n w yx yx 1 SV get b from y ( wxb) 1 0, where x s support vector Support Vectors x 1

22 Solvng the Optmzaton Problem he lnear dscrmnant functon s: g( x) w x b x x b SV Notce t reles on a dot product between the test pont x and the support vectors x Also keep n mnd that solvng the optmzaton problem nvolved computng the dot products x x j between all pars of tranng ponts

23 Large Margn Lnear What f data s not lnear separable? (nosy data, outlers, etc.) Classfer x denotes +1 denotes -1 Slack varables ξ can be added to allow msclassfcaton of dffcult or nosy data ponts 1 x 1

24 Large Margn Lnear Classfer Formulaton: 1 mnmze w C n 1 1 mnmze w such that y( wx b) 1 0 s.t. y ( wxb) 1 Wthout slack varables Parameter C can be vewed as a way to control over-fttng.

25 Large Margn Lnear Classfer Formulaton: (Lagrangan Dual Problem) maxmze 1 n n n j yy j j 1 1 j1 xx such that 0 C n 1 y 0

26 Non-lnear SVMs Datasets that are lnearly separable wth nose work out great: 0 x But what are we gong to do f the dataset s just too hard? Kernel rck!!! 0 x SVM = Lnear SVM + Kernel rck hs slde s courtesy of

27 Kernel rck Motvaton Lnear classfers are well understood, wdely-used and effcent. How to use lnear classfers to buld non-lnear ones? Neural networks: Construct non-lnear classfers by usng a network of lnear classfers (perceptrons). Kernels: o o Map the problem from the nput space to a new hgher-dmensonal space (called the feature space) by dong a non-lnear transformaton usng a specal functon called the kernel. hen use a lnear model n ths new hgh-dmensonal feature space. he lnear model n the feature space corresponds to a non-lnear model n the nput space.

28 Non-lnear SVMs: Feature Space General dea: the orgnal nput space can be mapped to some hgher-dmensonal feature space where the tranng set s separable: Φ: x φ(x) hs slde s courtesy of

29 Nonlnear SVMs: he Kernel rck Wth ths mappng, our dscrmnant functon s now: g( x) w ( x) b ( x) ( x) b SV No need to know ths mappng explctly, because we only use the dot product of feature vectors n both the tranng and test. A kernel functon s defned as a functon that corresponds to a dot product of two feature vectors n some expanded feature space: K( x, x ) ( x ) ( x ) j j

30 Nonlnear SVMs: he Kernel rck An example: -dmensonal vectors x=[x 1 x ]; let K(x,x j )=(1 + x x j ), Need to show that K(x,x j ) = φ(x ) φ(x j ): K(x,x j )=(1 + x x j ), = 1+ x 1 x j1 + x 1 x j1 x x j + x x j + x 1 x j1 + x x j = [1 x 1 x 1 x x x 1 x ] [1 x j1 x j1 x j x j x j1 x j ] = φ(x ) φ(x j ), where φ(x) = [1 x 1 x 1 x x x 1 x ] hs slde s courtesy of

31 Nonlnear SVMs: he Kernel rck Examples of commonly-used kernel functons: Lnear kernel: K( x, x ) x x j j Polynomal kernel: K( x, x ) (1 x x ) j j p Gaussan (Radal-Bass Functon (RBF) ) kernel: Sgmod: j K( x, xj) exp( x x ) K( x, x ) tanh( x x ) j 0 j 1 In general, functons that satsfy Mercer s condton can be kernel functons: Kernel matrx should be postve semdefnte.

32 Nonlnear SVM: Optmzaton Formulaton: (Lagrangan Dual Problem) n n n 1 maxmze y y K(, ) x x j j j 1 1 j1 such that 0 C n 1 y he soluton of the dscrmnant functon s SV 0 g( x) K( x, x) b he optmzaton technque s the same.

33 Support Vector Machne: Algorthm 1. Choose a kernel functon. Choose a value for C 3. Solve the quadratc programmng problem (many software packages avalable) 4. Construct the dscrmnant functon from the support vectors

34 Some Issues Choce of kernel - Gaussan or polynomal kernel s default - f neffectve, more elaborate kernels are needed - doman experts can gve assstance n formulatng approprate smlarty measures Choce of kernel parameters - e.g. σ n Gaussan kernel - σ s the dstance between closest ponts wth dfferent classfcatons - In the absence of relable crtera, applcatons rely on the use of a valdaton set or cross-valdaton to set such parameters. Optmzaton crteron Hard margn v.s. Soft margn - a lengthy seres of experments n whch varous parameters are tested hs slde s courtesy of

35 Summary: Support Vector Machne 1. Large Margn Classfer o Better generalzaton ablty & less over-fttng. he Kernel rck o o Map data ponts to hgher dmensonal space n order to make them lnearly separable. Snce only dot product s used, we do not need to represent the mappng explctly.

36 Addtonal Resource

Chapter 6 Support vector machine. Séparateurs à vaste marge

Chapter 6 Support vector machine. Séparateurs à vaste marge Chapter 6 Support vector machne Séparateurs à vaste marge Méthode de classfcaton bnare par apprentssage Introdute par Vladmr Vapnk en 1995 Repose sur l exstence d un classfcateur lnéare Apprentssage supervsé