KERNEL MODELS AND SUPPORT VECTOR MACHINES

COMPUAIONAL INELLIGENCE Vol. I - Kerel Models ad Support Vector Machies - K azushi Ikeda KERNEL MODELS AND SUPPOR VECOR MACHINES Kazushi Ikeda Nara Istitute of Sciece ad echology, Ikoma, Nara, Japa Keywords: kerel, margi, statistical learig theory Cotets. Itroductio. Kerel Fuctio ad Feature Space 3. Represeter heorem 4. Example 5. Pre-Image Problem 6. Properties of Kerel Methods 7. Statistical Learig heory 8. Support Vector Machies 9. Variatios of SVMs 0. Coclusios Glossary Bibliography Biographical Sketches Summary Kerel models are a method for itroducig oliear preprocessig to liear models. A kerel fuctio determies a feature space called a reproducig kerel Hilbert space. he solutio of a optimizatio problem with kerel models is expressed as a liear combiatio of give examples thaks to the represeter theorem. his meas that o explicit feature vectors are required ad hece kerel models are applicable to eve sequeces or graphs. he first half of this chapter itroduces the readers to the fudametals of kerel models such as the kerel trick ad the represeter theorem. he most successful applicatio of kerel models is support vector machies (SVMs). SVMs are derived based o the statistical learig theory, which does ot assume ay statistical model beforehad ad leads to margi maximizatio for higher geeralizatio ability. Although a SVM is costructed so that oly the upper boud of the geeralizatio error i a worst case is miimized, it performs better tha covetioal methods o average. Aother advatage of SVMs is its covexity, that is, it has a uique solutio differetly from other hierarchical learig models. he idea of margi maximizatio has bee applied to regressio, desity estimatio ad other areas. he secod half of this chapter itroduces the readers to the derivatio of SVMs ad their properties as well as several variatios.. Itroductio he simplest model of a euro is Perceptro that calculates the weighted sum of iput sigals ad outputs + whe the sum exceeds a threshold ad otherwise (Misky ad Ecyclopedia of Life Support Systems(EOLSS) 63

Papert, 969). he model attracted much attetio as a biary classifier or dichotomy with a adaptive algorithm for a give dataset called the Perceptro learig rule because it has good properties such as the covergece theorem that assures fiite times of iteratios for a liearly separable dataset ad the cyclig theorem that proves periodical solutios otherwise. Oe way to itroduce a oliear preprocessig to Perceptro is the multi-layer Perceptro model (MLP), where the threshold fuctio is replaced with a sigmoid fuctio such as the hyperbolic taget ad the preprocessor is autoomously acquired through learig. his model is simple ad works very well but it suffers a lot of local optimal solutios ad plateaus i a gradiet learig method whe the cost fuctio is give as the sum of squared errors. A alterative to the MLP is a kerel model. he preprocessor of a kerel model must be fixed a priori but it has a large variety of choices. I fact, we eed oly the ier product of preprocessed data, without describig them i a explicit form. his eables the feature space to have eve ifiite dimesios, or treat graphs directly, which leads to a wide rage of applicatios. Nowadays, kerel models are applied to ot oly classifiers but also regressio problems ad other liear sigal processig fields. From the mathematical viewpoit, a kerel model is based o its positive (or oegative) defiiteess. Positive defiite kerels appeared i the literature early i the twetieth cetury ad their geeral theory was established as the reproducig kerel Hilbert space i 950 s. However, the idea of kerel models has become popular after it was itroduced as a oliearity for support vector machies or SVMs i 990 s. A SVM is a liear dichotomy like Perceptro but it maximizes the margi, that is, the miimum distace of examples from the discrimiative boudary. he idea of margi maximizatio is based o Vapik s statistical learig theory that miimizes the structural risk i the framework of the probably approximately correct (PAC) learig itroduced by Valiat i 984. Oe of the good properties of SVMs is the structural risk miimizatio or SRM. he predictio error by SVMs is theoretically bouded eve i the worst case ad is ofte much lower tha the boud or that of MLPs. Aother pro is the uiqueess of the solutio. A SVM results i a covex quadratic programmig problem that has a uique solutio. Moreover, recet developmet of computer sciece ad egieerig such as ier poit methods makes the solvable size larger ad larger. haks to such superiority, SVMs have attracted much attetio ad their theoretical properties as well as their variats have bee well studied.. Kerel Fuctio ad Feature Space A kerel method maps a iput vector x X to the correspodig feature vector f( x ) i a high-dimesioal feature space F ad processes it liearly, such as Perceptro, the liear regressio or the pricipal compoet aalysis (PCA). Let us cosider the Perceptro learig here as a simple example. 64

Give the t - th example ( x () t, y( t) ) where () t { k} k= () x x is the t -th iput vector ad y t is the true label for the iput vector, the Perceptro learig rule i the feature space F is formulated as () t = ( t ) + y() t ( () t ) y() t () t ( () t ) () t = w( t ) otherwise, w w f x if sg w f x, w () where w () t is the weight vector of Perceptro at iteratio t ad deotes the traspose. Here, we assume that the weight vector ca be expressed as a weighted sum of iputs, that is, w() t = αk () t f( x k) () k= he, the algorithm () is simply rewritte as α α () t α ( t ) y() t if y() t sg () t ( () t ), () t () t = α ( t ) otherwise, k k k k = + w f x x = x k (3) for all k,, =. his is called the kerel Perceptro. Here, the ier product of w ( t) ad f( x ) for a arbitrary x X is calculated as ( t) ( ) = α ( t) ( ) ( ) = α ( t) K( ) w f x f x f x x, x. (4) k k k k k= k= where K ( xk, x) = f ( xk) f( x ) is called a kerel fuctio. Hece, although we do ot kow the feature vector f( x ) itself, we ca calculate the ier product. his makes the feature vector very high-dimesioal or eve ifiite without icrease of computatioal complexity, which is called the kerel trick. Oe example of kerel fuctios is the p -th order polyomial kerel K xx, = x x + p. Whe x R ad =, a possible feature vector is ( ) ( ) ( ) = ( x x x x x x ) f x,,,,,, (5) 65

ad hece w () t ( ) f x ca express ay polyomial fuctio with order two or less. x x Aother popular kerel fuctio is Gaussia kerel K ( xx, ) = exp, whose σ feature vector is essetially the Fourier trasform as show below: Suppose that the feature of x idexed by ω R is exp( iω x) where ω ad x are oe-dimesioal for simplicity. he, the ier product of exp( iω x) ad exp( iω x ) is calculated as ω K( x, x ) = exp( iωx) exp( iωx ) exp dω π ( ω i( x x )) ( x x ) = exp exp dω (6) π ( x x ) = exp where ( ω ) exp π is a measure of the space of ω. A advatage of kerel models is that we eed ot cosider the feature space for a kerel fuctio explicitly. his property makes kerel models possible to apply to the spaces of statistical models, sequeces, ad eve graphs. I fact, Mercer s theorem assures the existece of a ier product space F for ay positive semi defiite (or o-egative defiite) kerel fuctio K (, ), where a symmetric cotiuous fuctio K (, ) is said to be positive semi defiite if ad oly if it satisfies cc k mk( xk, x m) 0 (7) k= m= k k= c k k= for ay vector { x }, ay real umber { } ad ay positive iteger. he Gaussia kerel, for example, has a feature space thaks to this property although it is difficult to describe explicitly. From the practical viewpoit, however, provig a kerel fuctio beig positive semi defiite is more difficult tha costructig a feature space explicitly. Hece, popular kerel fuctios are simple oes such as the polyomial kerel or Gaussia kerel, or their combiatios. I fact, whe two fuctios K ad K are positive semi defiite, so are the followig fuctios: ( ) ( ) ( ) ( ) ck + ck for c, c 0, KK, exp K, g x K x, x g x. (8) 66

he polyomial kerel as well as Gaussia kerel are prove to be positive semi defiite usig (8). What kerel should we use? he aswer strogly depeds o the applicatio we cosider but a alterative approach is kerel learig, that is, automatic selectio of optimal of a c i i= kerel from data. Oe typical example is to optimize the coefficiets { } weighted sum of kerels, becomes best. 3. Represeter heorem i= ck i i, so that the performace of the learig machie I the previous sectio, we assumed that the weight vector could be expressed as a weighted sum of iputs, that is, w( x) = αkk ( x, x k). (9) k= A geeral theory to ustify the above assumptio is the Represeter heorem, which says the solutio of the optimizatio problem below has a solutio w of the form i (9): mi { }, { }, L xk yk w( xk) + ch i i( xk) +Ψ( w ), (0), k= k= w Hc R i= k = { } = where (, y ) x is a dataset, K is a kerel fuctio of x, H is the reproducig k k k kerel Hilbert space of K, { i} i h = is a set of basis fuctios of x, ad () Ψ is a mootoically icreasig fuctio (regularizatio term). Ituitively speakig, ay compoet i the complemet space H 0 of H0 = spa ({ K(, x k )} k= ) i w does ot chage the cost L but oly icreases the ormalizatio term Ψ. Oe simple example of L is mi w H ( ) y k w( x k ) + λ w H k= () ad aother example is the support vector machie as show later. haks to the kerel trick, the computatioal complexity of kerel methods does ot deped o the dimesios of the feature space but oly o the umber of examples. 67

Perceptro (MLP) PAC learig Perceptro Pricipal Compoet Aalysis (PCA) Represeter heorem Support Vector Machie (SVM) Support Vector Regressio (SVR) sigmoid fuctio. It ca approximate ay cotiuous fuctio with a arbitrary precisio if it has eough hidde uits. : Probably Approximately correct learig. It cosiders the coditio uder which the probability that the error of a learig algorithm exceeds a specific value is bouded by a specific value. : A liear dichotomy. It outputs the sig of the ier product betwee a weight vector ad a iput vector. : A dimesioality reductio method. It leaves the space spaed by eigevectors with large eigevalues (pricipal compoets) ad removes the compoets iduced by oise. : A theorem that assures the solutio of a optimizatio problem is described as a weighted sum of data. his is the theoretical base of kerel models. : he liear dichotomy that maximizes the margi to give examples. Sice such oe does ot exist for a liearly iseparable dataset, there are variatios to cope with this difficulty. : he liear regressio that employs the ε -isesitive error fuctio. his is robust agaist outliers due to the L-orm regularizer as well as a small umber of support vectors due to the isesitivity. Bibliography Misky M.L., Papert S.A. (969). Perceptros, MI Press, Cambridge, MA. [his is a comprehesive textbook for perceptros]. Rumelhert D.E., McClellad J.L. (986). Parallel Distributed Processig, MI Press, Cambridge, MA. [his presets a theory for multi-layer perceptros ad their applicatios]. Fukumizu K (00). Itroductio to Kerel Methods, Asakura-Shote, okyo. [his presets a theoretical backgroud of kerel methods for begiers. Note that it is writte i Japaese]. Wataabe S (00). Algebraic Aalysis for Noidetifiable Learig Machies. Neural Computatio, 3(4), 899-933. [his is a epoch-makig paper itroducig algebraic geometry to machie learig theory]. Vapik V.N. (995). he Nature of Statistical Learig heory, Spriger, Heidelberg. [his is the first book itroducig support vector machies]. Valiat L.G. (984). A heory of Learable. Commuicatios of the ACM, 7(), 34-4. [his paper presets a framework of the PAC learig]. Kwok J.., sag I.W. (004). he Pre-Image Problem i Kerel Methods, IEEE rasactios o Neural Networks, 5(6), 57-55. [his gives a cocrete method to cope with the pre-image problem of kerel methods]. Schoelkopf B., Smola A.J., Williamso R.C., Bartlett P.L. (000). New Support Vector Algorithms, Neural Computatio,, 07-45. [his paper itroduces the framework of the ν -SVMs]. Beett K.P., Bredesteier E.J. (000). Duality ad Geometry i SVM Classifiers, Proc. ICML, 57-64. [his is the first paper to show aother view of the ν -SVMs, that is, the covex hulls of data]. Ikeda K. (004). A Asymptotic Statistical heory of Polyomial Kerel Methods. Neural Computatio, 6(8), 705-79. [his paper solves the cotradictio of the theoretical results o SVMs]. 77

Ikeda K., Aoishi. (005). A Asymptotic Statistical Aalysis of Support Vector Machies with Soft Margis, Neural Computatio, 8, 5-59. [his paper gives a theoretical aalysis of the effect of soft margis]. Schoelkopf B., Platt J.C., Shawe-aylor J., Smola A.J. (00). Estimatig the Support of a High-Dimesioal Distributio, Neural Computatio, 3, 443-47. [his paper applies the idea of support vector macchies to a desity estimatio by formulatig it as a oe-class SVM]. Biographical Sketch Kazushi Ikeda received the B.E., M.E. ad Ph.D. degrees i Mathematical Egieerig ad Iformatio Physics from the Uiversity of okyo, okyo, Japa, i 989, 99 ad 994, respectively. He the oied the Departmet of Electrical ad Computer Egieerig, Kaazawa Uiversity, Kaazawa, Japa, as a Assistat Professor ad moved to the Departmet of Systems Sciece, Kyoto Uiversity, Kyoto, Japa, as a Associate Professor, i 998. Sice 008, he has bee a Professor i the Graduate School of Iformatio Sciece, Nara Istitute of Sciece ad echology, Nara, Japa. His research iterests iclude machie learig theory such as support vector machies ad iformatio geometry, ad applicatios to adaptive systems ad brai iformatics. He is curretly the Editor-i-Chief of Joural of Japaese Neural Network Society (JNNS), a Actio Editor of Neural Networks (Elsevier) ad a Associate Editor of IEEE rasactios o Neural Networks ad Learig Systems ad IEICE rasactios o Iformatio ad Systems. He has served as a member of Board of Goverors of JNNS. 78