fr kernel methds Building classifiers in high dimensinal space Pierre Dupnt Pierre.Dupnt@ucluvain.be Classifiers define decisin surfaces in sme feature space where the data is either initially represented r mapped t Representing r mapping the data in a high dimensinal space may ease the separability between the classes (see Cver s therem) but... discriminatin is nt easier if the mapped pints naturally lie int a lwer dimensinal manifld the higher the dimensin f the feature space the mre parameters may need t be estimated Typeset by FilTEX Outline The curse f dimensinality The curse f dimensinality The 3 cre ideas f kernel methds, and specifically SVMs Hw t avid the curse f dimensinality? sme results frm Vapnik s Statistical Learning Thery Why SVMs are interesting techniques but nt the panacea? If the number f parameters is t large with respect t the number f training samples there is a risk f ver-fitting the training data Over-fitting implies pr generalizatin t crrectly classify new data Additinally, sensitivity t nise and cmputatinal cmpleity may increase with the dimensin f the feature space This prblem is knwn as the curse f dimensinality Hwever... Kernels and regularized risk 3
The 3 cre ideas f kernel methds VC dimensin. The s-called kernel trick allws t define an implicit mapping t a higher dimensinal feature space with tw interesting cnsequences there is n need t cmpute anything in the higher dimensinal space the number f parameters t be estimated becmes independent f the dimensin f the feature space. The capacity f the class f discriminant functins cnsidered matters mre than the dimensin f the space they lie int Capacity is a measure f the cmpleity f a class f functins The best knwn capacity cncept is the Vapnik-Chervnenkis (VC) dimensin 3. Cntrlling the capacity f linear discriminants can be dne by maimizing the margin f the hyperplane with respect t the training samples Kernels + capacity cntrl thrugh margin maimizatin the blessing f dimensinality 4 The Vapnik-Chervnenkis dimensin (VC-dim) f a functin class F defined ver an instance space X is the size f the largest subset f X shattered by F. If arbitrarily large finite sets f X can be shattered by F then V C(F) We have seen a set f 3 pints in IR which can be shattered by hyperplanes even thugh they are sets f 3 pints which cannt be shattered? N sets f 4 pints can be shattered by a hyperplane in IR, n matter hw they are placed the VC-dim f hyperplanes in IR is 3 Mre generally, the VC-dim f hyperplanes in IR d is d +? 6 Shattering Empirical Risk A set f samples S is shattered by a functin class F if and nly if fr every pssible +/- labeling f the samples in S there eists sme functin in F which perfectly classify the samples Fr instance, if the functin class F is the set f hyperplanes in IR (= lines) and if we cnsider 3 samples, there are 3 = 8 pssible labellings Fr any such labeling there is a hyperplane classifying crrectly the samples Let g() be a discriminant fr a binary classificatin prblem Ω = {ω, ω } The decisin functin f : X {, } is defined as f = sign (g()) Let,..., n be a training set f n samples with assciated labels z,..., z n By definitin z i = if i is labeled ω, and z i = therwise The zer-ne lss functin f() z defines the crrectness f the classificatin f any sample. The lss is 0 if is crrectly classified, and therwise The average training errr r empirical risk is defined as R emp [f] = n n i= f( i) z i 5 7
Bunding the Risk Practical use f the VC-bund? The true risk r prbability f misclassificatin fr any test sample drawn frm P (, z) the (unknwn) jined distributin f samples and class labels is defined as R[f] = f() z dp (, z) Over-fitting ccurs when a functin f minimizing the empirical risk R emp [f] des nt minimize the true risk Frtunately, we can bund the risk if we knw the VC-dim h f the functin class F which f belngs t In particular, if h < n (the number f training samples), then fr all functins f F, independently f the underlying distributin P, with prbability at least δ R[f] R emp [f] + ( h (ln nh ) n + + ln 4 ) δ R[f] R emp [f] + ( h (ln nh ) n + + ln 4 ) δ The abve bund is nt tight because it derives frm a wrst-case analysis and must hld fr any distributin P (, z) In practice, the distributin P (, z) is unknwn but nt arbitrary The interest f this bund is nt s much its practical use but it mtivates t best fit the data with the simplest pssible class f functins 8 0 Interpretatin f the VC-bund Structural risk minimizatin R[f] R emp [f] + n ( h (ln nh ) + + ln 4 ) δ } {{ } capacity r cnfidence term Minimizing bth R emp [f] and the capacity term by chsing the class f functins suitable fr the amunt f training data is the cre f structural risk minimizatin The results hlds nly with prbability (at least) δ because the test data may be particularly difficult When the training set size n the capacity term 0 and R[f] R emp [f] Cnsidering a functin class F with lw VC-dim h reduces the capacity term Hwever if the functin class is t simple (t lw VC-dim) it will be difficult t minimize R emp [f] This prperty can be seen as anther frmulatin f the classical bias-variance trade-ff there is an ptimum t be fund 9 There is n curse f dimensinality but there is a curse f capacity
Supprt Vectrs and Maimal Margin Hyperplane Discussin When the data is linearly separable in sme apprpriate feature space, the separating hyperplane is nt unique The maimal margin hyperplane separates the data with the largest margin Fr each separating hyperplane, there is an assciated set f supprt vectrs z z The abve prperty defines the VC-dim f cannical hyperplanes relative t a dataset (nt all hyperplanes in IR d ) The maimal margin ρ needs t be defined a priri (nt strictly equivalent t the SVM ptimizatin prblem) A similar and mre general result hlds fr anther capacity cncept: the fat shattering dimensin Overfitting is still pssible depending n the kernel chice (see later... ) z z margins maimal margin Supprt Vectrs 4 Maimizing the margin is a gd idea Mercer kernels Recall that the VC-dim f hyperplanes in IR d is d + the capacity term increases with the dimensin f the space Hpefully, fr hyperplanes with margin ρ it was shwn that the VC-dim h is bunded h R ρ + where R is the radius f the smallest hypersphere cntaining the data The key advantage f this bund is that it is independent f the dimensin d!!! Maimizing the margin is a way t cntrl the curse f capacity while wrking in very high dimensinal spaces Maimizing the margin is als a way t increase rbustness t nise since perturbatins arund the training pints d nt affect much the decisin bundary A kernel k is a symmetric functin with k(, ) = φ(), φ( ) = k(, ) where φ is a mapping frm the riginal input space X t a feature space Y Mercer Cnditins: A symmetric functin k : X X IR is a kernel if fr any finite subset {,..., n } f X the gram matri K = [k( i, j )] n i,j= is psitive semi-definite (has nn-negative eigenvalues) k(, ) can be thught f as a similarity measure between and which generalizes the simple dt prduct, 3 5
Implicit mapping induced by a kernel SVMs pr s If k satisfies the Mercer cnditins, there eists a mapping φ such that k(, ) = φ(), φ( ) We can directly specify k rather than φ there is an implicit mapping t a new feature space Linear kernel k(, ) =, (φ maps t itself) Plynmial kernel k(, ) = (, + c) b with b N, c 0 ( ) Gaussian Radial Basis Functin kernel k(, ) = ep with σ 0 σ Sigmid kernel k(, ) = tanh(κ, + ϑ) with κ > 0 and ϑ < 0 The kernel trick Any learning algrithm that uses the data nly via dt prducts can rely n this implicit mapping by replacing, by k(, ) SVMs are theretically mtivated by Vapnik s statistical learning thery estimatin is a cnve ptimizatin prblem (n multiple lcal minima) primal-dual frmulatin and the duality gap t measure distance t ptimum sparse slutin: nly the supprt vectrs matter in the decisin functin state f the art results n many different datasets the kernel trick allws t build classifiers fr structured data such as strings, trees, graphs, prbability distributins, etc relatively few meta-parameters: C (sft margin frmulatin), σ (RBF kernel), the kernel itself,... 6 8 Hard margin SVMs SVMs are interesting but nt the panacea The SVM estimatin prblem (i.e. finding a maimal margin hyperplane in the feature space) may be frmulated (in its dual frm) as ma W (α) = n α i α i= n α i α j z i z j k( i, j ) i,j= The number f parameters nly depends n the number f training samples n, nt the dimensin f the input r the feature space The decisin functin is defined as: f() = sign i SV α i z i k( i, ) + w 0 which nly depends n the (s-called supprt) vectrs i such that α i 0 many aspects are nt new: The kernel trick is nearly a century ld (Mercer 909) but it was used nly much later t build a classifier (Bser, Guyn and Vapnik; COLT 9) The H and Kashyap algrithm (965) estimates a hyperplane with a large margin (with a minimum-squared errr criterin and withut the kernel trick) SVMs with RBF kernels are clse t RBF netwrks (identical decisin functins, different estimatin prcedures: k-means vs prttypes selected as supprt vectrs) Cmputatinal cst f the training prcedure becmes prhibitive fr very large datasets (but chunking can help) The kernel chice is critical This is a practical cncern but als a theretical issue: The VC-bund applies in the (implicit) feature space!!! 7 9
Over-fitting induced by the kernel Kernel chice is a regularizatin chice ( ) Cnsider a RBF kernel k(, ) = ep with σ 0 σ The gram matri K = [k( i, j )] n i,j= tends t ci In ther wrds, training pints are nly cnsidered (very) similar t themselves fitting the training set is easy but generalizatin is likely t be pr The ideal kernel is such that any pair f pints (, ) are cnsidered similar if and nly if they shuld be assciated t the same class label z the design f this kernel wuld require the knwledge f P (, z) t minimize the true risk R[f] Representer therem (see [Schölkpf and Smla, 00], chap. 4) Let H dente the feature space assciated t a kernel k and {,..., n } be a labeled data set Each minimizer f H f the regularized risk R emp [f] + λω[f] admits a representatin f the frm: f() = n i= α ik( i, ) In ther wrds, the kernel chice is a regularizatin chice The RBF kernel can be shwn t penalize derivatives f all rders, and thus enfrce mre r less smthness depending n σ 0 Regularized risk Take hme message True risk minimizatin can be apprimated by minimizing a regularized risk R reg [f] = R emp [f] + λω[f] where Ω[f] penalizes the lack f smthness f the functin f and λ is a regularizatin cnstant Maimizing the margin f classificatin by a hyperplane in feature space is equivalent t minimizing Ω[f] = w the curse f capacity matters mre than the curse f dimensinality maimizing the margin is a gd idea t cntrl the capacity f the functin class cnsidered and t build classifiers rbust t nise ξ j This setting crrespnds t sft-margin SVMs with R emp [f] apprimated by a functin f the slack variables ξ i there is n free lunch in the kernel chice but each kernel crrespnds t a regularizatin peratr w ξ i min w,ξ } w {{} margin maimizatin + C n n i= ξ i }{{} margin errr 3
References [Bser et al., 99] Bser, B., Guyn, I., and Vapnik, V. (99). A training algrithm fr ptimal margin classifiers. In Prceedings f the 5th Annual ACM Wrkshp n Cmputatinal Learning Thery, pages 44 5, Pittsburgh, PA, USA. [Cristianini and Shawe-Taylr, 000] Cristianini, N. and Shawe-Taylr, J. (000). Supprt Vectr Machines and ther kernel-based learning methds. Cambridge University Press. [H and Kashyap, 965] H, Y.-C. and Kashyap, R. (965). An algrithm fr linear inequalities and its applicatins. IEEE Transactins n Electrnic Cmputers, EC 4:683 688. [Mercer, 909] Mercer, J. (909). Functins f psitive and negative type and their cnnectin t the thery f integral equatins. Philsphical Transactins f The Ryal Sciety Lndn, A 09:45 446. 4 [Schölkpf and Smla, 00] Schölkpf, B. and Smla, A. (00). Learning with Kernels: Supprt Vectr Machines, Regularizatin, Optimizatin and Beynd. MIT Press, Cambridge, MA. [Shawe-Taylr and Cristianini, 004] Shawe-Taylr, J. and Cristianini, N. (004). Kernel Methds fr Pattern Analysis. Cambridge University Press. [Vapnik, 000] Vapnik, V. (000). Springer, nd editin. The Nature f Statistical Learning Thery. 5