Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Size: px

Start display at page:

Download "Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan"

Roxanne Cole
5 years ago
Views:

1 Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan

2 Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

Wth NN:Mult-layer feed-forward neural networks

j j j w g z j j j z w g z Per Lug Martell -

3 Wth NN:Mult-layer feed-forward neural networks Neurons are organzed nto herarchcal layers Each layer receve ther nputs from the prevous one and transmts the output to the net one w j w j j j j w g z j j j z w g z Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

4 w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = 0 = 0 a = -0.5 z = 0 a = -0.5 z = 0 a = -0.5 z = 0 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

5 w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = = 0 a = 0. z = a = -0. z = 0 a = 0. z = Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

6 w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = 0 = a = 0. z = a = -0. z = 0 a = 0. z = Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

7 w w w w ( ) ( ) w w ( ) XOR w = 0.7 w = 0.7 = 0. 5 w = 0.3 w = 0.3 = 0. 5 w = 0.7 w = -0.7 = 0. 5 = = a = 0.9 z = a = 0. z = a = -0.5 z = 0 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

8 The hdden layer REMAPS the nput n a new representaton that s lnearly separable Input Desred Actvaton of output hdden neurons Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

9 Etenson to Non-lnear Decson Boundary So far, we have only consdered large-margn classfer wth a lnear decson boundary How to generalze t to become nonlnear? Key dea: transform to a hgher dmensonal space to make lfe easer Input space: the space the pont are located Feature space: the space of f( ) after transformaton Why to transform? Lnear operaton n the feature space s equvalent to nonlnear operaton n nput space Classfcaton can become easer wth a proper transformaton. In the XOR problem, for eample, addng a new feature of make the problem lnearly separable Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

10 XOR X Y Y Is not lnearly separable X X Y XY XY Y Is lnearly separable X

11 Fnd a feature space Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

12 Transformng the Data Input space f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Note: feature space s of hgher dmenson than the nput space n practce Computaton n the feature space can be costly because t s hgh dmensonal The feature space can be nfnte-dmensonal! The kernel trck comes to rescue Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

13 The Kernel Trck Recall the SVM optmzaton problem The data ponts only appear as scalar product As long as we can calculate the nner product n the feature space, we do not need the mappng eplctly Many common geometrc operatons (angles, dstances) can be epressed by nner products Defne the kernel functon K by Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

14 An Eample for f(.) and K(.,.) Suppose f(.) s gven as follows An nner product n the feature space s So, f we defne the kernel functon as follows, there s no need to carry out f(.) eplctly Ths use of kernel functon to avod carryng out f(.) eplctly s known as the kernel trck Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

15 Kernels Gven a mappng: φ() a kernel s represented as the nner product K (, y) φ () φ (y) A kernel must satsfy the Mercer s condton: g( ) such that g ( ) d K(, y) g( ) g( y) ddy 0 Analogous to postve-semdefnte matrces M for whch z T 0 z M z 0 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

16 Modfcaton Due to Kernel Functon Change all nner products to kernel functons For tranng, Orgnal Wth kernel functon Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

17 Modfcaton Due to Kernel Functon For testng, the new data z s classfed as class f f >0, and as class f f <0 Orgnal Wth kernel functon Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

18 More on Kernel Functons Snce the tranng of SVM only requres the value of K(, j ), there s no restrcton of the form of and j can be a sequence or a tree, nstead of a feature vector K(, j ) s just a smlarty measure comparng and j For a test object z, the dscrmnat functon essentally s a weghted sum of the smlarty between z and a preselected set of objects (the support vectors) Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

19 Eample Suppose we have 5 D data ponts =, =, 3 =4, 4 =5, 5 =6, wth,, 6 as class and 4, 5 as class y =, y =, y 3 =-, y 4 =-, y 5 = Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

20 Eample Suppose we have 5 D data ponts =, =, 3 =4, 4 =5, 5 =6, wth,, 6 as class and 4, 5 as class y =, y =, y 3 =-, y 4 =-, y 5 = class class class Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

21 Eample We use the polynomal kernel of degree K(,y) = (y+) C s set to 00 We frst fnd a (=,, 5) by Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

5 =6} The dscrmnant functon s b s recovered by solvng f()= or by f(5)=- or by

22 Eample By usng a QP solver, we get a =0, a =.5, a 3 =0, a 4 =7.333, a 5 =4.833 Note that the constrants are ndeed satsfed The support vectors are { =, 4 =5, 5 =6} The dscrmnant functon s b s recovered by solvng f()= or by f(5)=- or by f(6)=, All three gve b=9 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

23 Eample Value of dscrmnant functon f(z) f(z)>0 class class class f(z)<0 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

24 Kernel Functons In practcal use of SVM, the user specfes the kernel functon; the transformaton f(.) s not eplctly stated Gven a kernel functon K(, j ), the transformaton f(.) s gven by ts egenfunctons (a concept n functonal analyss) Egenfunctons can be dffcult to construct eplctly Ths s why people only specfy the kernel functon wthout worryng about the eact transformaton Another vew: kernel functon, beng an scalar product, s really a smlarty measure between the objects Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

25 A kernel s assocated to a transformaton Gven a kernel, n prncple t should be recovered the transformaton n the feature space that orgnates t. K(,y) = (y+) = y +y+ If and y are numbers t corresponds the transformaton What f and y are -dmensonal vectors? Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

26 A kernel s assocated to a transformaton ) ) ) ) ) ) ) ) ) ) ) ) ) ) ),, j j j j j j j j j j = K ) ) ) ) ) ) T,,, )= (,, f Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

27 XOR Smple eample (XOR problem) 0 L α) = N α = N N = j= α α j y y j K(, j ) Input vector Y [-,-] - [-,+] + [+,-] + K( ) +, j )= j [+,+] - (-,-) (-,+) (+,-) (+,+) (-,-) 9 (-,+) 9 (+,-) 9 (+,+) 9 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

28 ) α + α α α + α α α α + α + α α α α α α α ( +α +α α +α L(α ) 0 9α α L 0 9α α L 0 9α α L 0 9α α L = α α α = α α α = α α α = α α α L = = = α = α = α α The four Input vectors are All support vectors N = ) ( y α w= W = [0, 0, /sqrt(), 0, 0, 0] T XOR Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

29 XOR 0 Input vector Y [-,-] - [-,+] + [+,-] + [+,+] - f( )=, ) ) ) ),, ),, ) T w= N = α y ( ) W = [0, 0, /sqrt(), 0, 0, 0] Input vector Y [-,-] - + [-,+] + - [+,-] + - [+,+] - + Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

30 Eamples of Kernel Functons Polynomal kernel up to degree d Polynomal kernel up to degree d Radal bass functon kernel wth wdth s Sgmod wth parameter k and It does not satsfy the Mercer condton on all k and Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

31 Ploynomal kernel Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

32 Fonte:

33 Eamples of Kernel Functons Radal bass functon (or gaussan) kernel wth wdth s K(, y) ep s y ep s yy ep s Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

34 Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

35 Eamples of Kernel Functons Wth -dm vectors: K(, y) ep s y ep s y ep s It corresponds to the scalar product n the nfnte dmensonal feature space: 3 ( T f ) ep,,, s s s 3! s 3,..., ) n! s... For vector n m-dm the feature space s more complcated Per Lug Martell - Systems and In Slco Bology Unversty of Bologna n n

36 Wthout slack varables Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

37 Wth slack varables Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

38 Gaussan RBF kernel Bshop C, Pattern recognton and Machne Learnng, Sprnger Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

39 Buldng new kernels If k (,y) and k (,y) are two vald kernels then the followng kernels are vald Lnear Combnaton Eponental Product Polymomal transformaton (Q: polymonal wth non negatve coeffcents) Functon product (f: any functon) ), ( ), ( ), ( y k c y c k y k ), ( ep ), ( y k y k ), ( ), ( ), ( y k y k y k ), ( ), ( y Q k y k ) ( ), ( ) ( ), ( y f y k f y k Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

40 Choosng the Kernel Functon Probably the most trcky part of usng SVM. The kernel functon s mportant because t creates the kernel matr, whch summarzes all the data Many prncples have been proposed (dffuson kernel, Fsher kernel, strng kernel, ) There s even research to estmate the kernel matr from avalable nformaton In practce, a low degree polynomal kernel or RBF kernel wth a reasonable wdth s a good ntal try Note that SVM wth RBF kernel s closely related to RBF neural networks, wth the centers of the radal bass functons automatcally chosen for SVM Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

41 Kernels can be defned also for structures other than vectors Computatonal bology often deals wth structures dfferent from vectors: Sequences (DNA, RNA, protens) Trees (Phylogenetc relatonshps) Graphs (Interacton networks) 3-D structures (protens) Is t possble to buld kernels for that structures? Transform data onto a feature space made of n-dmensonal real vectors and then compute the scalar product. Wrte a kernel wthout wrtng eplctly the feature space (but.. What s a kernel?) Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

44 Defnng kernels wthout defnng feature transformaton What a kernel represent? Dstance n feature space

45 Defnng kernels wthout defnng feature transformaton What a kernel represent? Dstance n feature space Kernel s a SIMILARITY measure Moreover t has to fullfll a «postvty» condton

48 Spectral kernel for sequences Gven a DNA sequence we can count the number of bases (4-D feature space) f ( ) ( n, n, n, n A C G T ) Or the number of dmers (6-D space) f ( ) ( n, n, n, n Or l-mers (4 l D space), n, n, n, n AA AC AG AT CA CC CG CT,..) The spectral kernel s k l (, y) l ) f y) f l Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

53 l s usually lower than

59 Kernel out of generatve models Gven a generatve model assocatng a probablty p( θ) to a gven nput, we defne : Fsher Kernel ) ( ) ( ), ( y p p y K ), ( ), ( ), ( ), ( ), ( ), ( ), ( ) ( ln ), ( y g F g y K g g N g g E F p g T N T Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

60 Other Aspects of SVM How to use SVM for mult-class classfcaton? One can change the QP formulaton to become mult-class More often, multple bnary classfers are combned One can tran multple one-versus-all classfers, or combne multple parwse classfers ntellgently Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

61 Other Aspects of SVM How to nterpret the SVM dscrmnant functon value as probablty? By performng logstc regresson on the SVM output of a set of data (valdaton set) that s not used for tranng Some SVM software (lke lbsvm) have these features bult-n Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

62 Software A lst of SVM mplementaton can be found at Some mplementaton (such as LIBSVM) can handle mult-class classfcaton SVMLght s among one of the earlest mplementaton of SVM Several Matlab toolboes for SVM are also avalable Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

63 Summary: Steps for Classfcaton Prepare the pattern matr Select the kernel functon to use Select the parameter of the kernel functon and the value of C You can use the values suggested by the SVM software, or you can set apart a valdaton set to determne the values of the parameter Eecute the tranng algorthm and obtan the a Unseen data can be classfed usng the a and the support vectors Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

64 Strengths and Weaknesses of SVM Strengths Tranng s relatvely easy No local optmal, unlke n neural networks It scales relatvely well to hgh dmensonal data Tradeoff between classfer complety and error can be controlled eplctly Non-tradtonal data lke strngs and trees can be used as nput to SVM, nstead of feature vectors Weaknesses Need to choose a good kernel functon. Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

65 Other Types of Kernel Methods A lesson learnt n SVM: a lnear algorthm n the feature space s equvalent to a non-lnear algorthm n the nput space Standard lnear algorthms can be generalzed to ts nonlnear verson by gong to the feature space Kernel prncpal component analyss, kernel ndependent component analyss, kernel canoncal correlaton analyss, kernel k-means, -class SVM are some eamples Per Lug Martell - Systems and In Slco Bology Unversty of Bologna

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest