On the Influence of the Kernel on the Consistency of Support Vector Machines

Similar documents
Measure and Measurable Functions

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

REGRESSION WITH QUADRATIC LOSS

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Regression with quadratic loss

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture Notes for Analysis Class

Math Solutions to homework 6

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Singular Continuous Measures by Michael Pejic 5/14/10

Chapter 6 Infinite Series

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

lim za n n = z lim a n n.

Axioms of Measure Theory

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

10-701/ Machine Learning Mid-term Exam Solution

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Advanced Stochastic Processes.

Machine Learning Brett Bernstein

Sequences and Series of Functions

6.3 Testing Series With Positive Terms

Sieve Estimators: Consistency and Rates of Convergence

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

Seunghee Ye Ma 8: Week 5 Oct 28

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

The Boolean Ring of Intervals

An Introduction to Randomized Algorithms

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

MAT1026 Calculus II Basic Convergence Tests for Series

Lecture 3 The Lebesgue Integral

Infinite Sequences and Series

Lecture 3 : Random variables and their distributions

Math 341 Lecture #31 6.5: Power Series

Lecture 10 October Minimaxity and least favorable prior sequences

Chapter 7 Isoperimetric problem

Fall 2013 MTH431/531 Real analysis Section Notes

Math 61CM - Solutions to homework 3

MA131 - Analysis 1. Workbook 3 Sequences II

Notes 19 : Martingale CLT

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Riesz-Fischer Sequences and Lower Frame Bounds

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Lecture 2. The Lovász Local Lemma

Optimally Sparse SVMs

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

MAS111 Convergence and Continuity

Chapter IV Integration Theory

Notes 27 : Brownian motion: path properties

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

Empirical Processes: Glivenko Cantelli Theorems

A Proof of Birkhoff s Ergodic Theorem

Chapter 0. Review of set theory. 0.1 Sets

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Mathematical Methods for Physics and Engineering

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

A survey on penalized empirical risk minimization Sara A. van de Geer

5 Many points of continuity

Application to Random Graphs

The random version of Dvoretzky s theorem in l n

Chapter 2. Periodic points of toral. automorphisms. 2.1 General introduction

Chapter 6 Principles of Data Reduction

A REMARK ON A PROBLEM OF KLEE

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Math 220A Fall 2007 Homework #2. Will Garner A

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

CHAPTER I: Vector Spaces

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Chapter 10: Power Series

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Fast Rates for Support Vector Machines

4 The Sperner property.

5 Birkhoff s Ergodic Theorem

7.1 Convergence of sequences of random variables

On forward improvement iteration for stopping problems

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Intro to Learning Theory

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture 19: Convergence

Distribution of Random Samples & Limit theorems

Math 113, Calculus II Winter 2007 Final Exam Solutions

1 Review and Overview

n p (Ω). This means that the

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

7 Sequences of real numbers

Math 155 (Lecture 3)

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Empirical Process Theory and Oracle Inequalities

SOME GENERALIZATIONS OF OLIVIER S THEOREM

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Solutions to home assignments (sketches)

Statistics 511 Additional Materials

Transcription:

Joural of achie Learig Research 2 2001 67-93 Submitted 08/01; Published 12/01 O the Ifluece of the Kerel o the Cosistecy of Support Vector achies Igo Steiwart athematisches Istitut Friedrich-Schiller-Uiversität Erst-Abbe-Platz 1-4 07743 Jea, Germay steiwart@miet.ui-jea.de Editor: Berhard Schölkopf for HP://WWW.KERNEL-ACHINES.ORG Abstract I this article we study the geeralizatio abilities of several classifiers of support vector machie SV type usig a certai class of kerels that we call uiversal. It is show that the soft margi algorithms with uiversal kerels are cosistet for a large class of classificatio problems icludig some kid of oisy tasks provided that the regularizatio parameter is chose well. I particular we derive a simple sufficiet coditio for this parameter i the case of Gaussia RBF kerels. O the oe had our cosideratios are based o a ivestigatio of a approximatio property the so-called uiversality of the used kerels that esures that all cotiuous fuctios ca be approximated by certai kerel expressios. his approximatio property also gives a ew isight ito the role of kerels i these ad other algorithms. O the other had the results are achieved by a precise study of the uderlyig optimizatio problems of the classifiers. Furthermore, we show cosistecy for the maximal margi classifier as well as for the soft margi SV s i the presece of large margis. I this case it turs out that also costat regularizatio parameters esure cosistecy for the soft margi SV s. Fially we prove that eve for simple, oise free classificatio problems SV s with polyomial kerels ca behave arbitrarily badly. Keywords: Computatioal learig theory, patter recogitio, PAC model, support vector machies, kerel methods 1. Itroductio Support vector machies comprise a class of learig algorithms origially itroduced for patter recogitio problems. Although their developmet was motivated by results of statistical learig theory the kow bouds o their geeralizatio performace are ot fully satisfactory. I particular, the ifluece of the chose kerel is far from beig completely uderstood. he aim of this paper is to give a ew isight ito the role of the kerels. Our cosideratios are maily based o a certai approximatio property of various stadard kerels that geerate fuctio classes with ifiite VC-dimesio. Sice i this case classical Vapik-Chervoekis theory fails to be applicable for support vector machies, other cocepts such as data depedet structural risk miimizatio, e.g. i terms of the observed margi, were itroduced cf. Shawe-aylor et al., 1998; Bartlett & Shawe-aylor, 1999; Cristiaii & Shawe-aylor, 2000, chap. 2. he latter usually eeds large margis o the c 2001 Steiwart.

Steiwart traiig sets to provide good bouds. It is, however, ope which distributios ad kerels guaratee this assumptio. A systematical study of this questio is the startig poit of this paper. he resultig techiques allow us to show ew bouds for the geeralizatio performace of several stadard support vector classifiers. We begi with a descriptio of the problem of patter recogitio cf. Vapik, 1998; Cristiaii & Shawe-aylor, 2000. Let X, d be a compact metric space 1, Y := { 1, 1} ad P be a probability measure o X Y, where X is equipped with the Borel σ-algebra. By disitegratio cf. Dudley, 1989, Lem. 1.2.1. there exists a map x P. x from X ito the set of all probability measures o Y such that P is the joit distributio of P. x x ad of the margial distributio P X of P o X. We call P.., which is i fact a regular coditioal probability, the supervisor. A classifier is a algorithm that costructs to every traiig set = x 1, y 1,..., x, y X Y a decisio fuctio f : X Y. I our cotext it is always assumed that is i.i.d. accordig to P, which itself is ukow. he the decisio fuctio f : X Y costructed by the classifier should guaratee a small probability for the misclassificatio of a example x, y radomly geerated accordig to P. Here, misclassificatio meas fx y. o make this precise, for a measurable fuctio f : X { 1, 1} we defie the risk of f by R P f := 1 {fx y} P dx, dy = P {x, y : fx y}. X Y Whe cosiderig oisy supervisors we caot expect that we obtai zero risk. Ideed, for B 1 P := { x X : P y = 1 x > P y = 1 x } B 1 P := { x X : P y = 1 x < P y = 1 x } B 0 P := { x X : P y = 1 x = P y = 1 x } ad a fuctio f 0 : X { 1, 1} with f 0 x = 1 if x B 1 P ad f 0 x = 1 if x B 1 P we have cf. Devroye et al., 1997, hm. 2.1. R P f 0 = if { R P f : f :X { 1, 1} measurable } = px P X dx, 1 where the oise level p : X R is defied by px := P y = 1 x for x B 1 P, px := P y = 1 x for x B 1 P ad px = 1/2 otherwise. Equatio 1 shows that o fuctio ca yield less risk tha f 0. he fuctio f 0 is called a optimal Bayes decisio rule ad we write R P := R P f 0. Now, a classifier C should guaratee with high probability that R P f is close to R P provided that is large eough. Here, f deotes the decisio fuctio costructed by C o the basis of. Asymptotically, this meas that R P f R P should hold i probability if. I this case the algorithm C is called cosistet for P cf. Devroye et al., 1997, Def. 6.1. If a classifier is cosistet for all distributios o 1. For mathematical otios see Sectio 2. X 68

O the Cosistecy of Support Vector achies X Y it is said to be uiversally cosistet. Although several algorithms such as the k- earest eighbour classifier for k ad k/ 0 cf. Devroye et al., 1997, hm. 6.4 are uiversally cosistet it is a ope questio whether support vector machies are uiversally cosistet for a particular choice of the free parameters. I this article we show that at least for a restricted class of distributios SV s are cosistet provided that the parameters are chose i a specific maer. I particular our results cover both the oise free case ad the case of costat oise level, i.e p p, p [0, 1/2. Our results are based o a ivestigatio of a certai approximatio property of kerels. Recall that the asatz of SV s is to solve specific optimizatio problems over the class of fuctios { w, Φ. : w H} or { w, Φ. + b : w H, b R}, where Φ : X H is a feature map of the used kerel. If this fuctio class is dese i CX we shall call the correspodig kerel uiversal cf. Def. 4. Roughly speakig this otio eables us to approximate the Bayes decisio rule i probability, a fact that is frequetly used i our proofs of cosistecy. Usig the approximatio theorem of Stoe-Weierstraß we show that kerels that ca be expaded i certai types of aylor or Fourier series are uiversal cf. Cor. 10 ad Cor. 11. I particular it turs out that the Gaussia RBF kerel is uiversal cf. Ex. 1. Besides the importace of the otio of uiversality i the cotext of cosistecy it also turs out that this cocept has strog implicatios for the geometric iterpretatio of the shape of the feature map cf. Cor. 6 ad the followig remark. Sice the class of fuctios implemeted by a SV with uiversal kerel is very rich the problem of overfittig ca always occur i the presece of oise. hus it is very importat to kow how to chose the regularizatio parameter. he secod part of this work is devoted to this questio. Here, we show i particular that for a soft margi SV with Gaussia RBF kerel o X R d the regularizatio sequece c = β 1, 0 < β < 1/d, yields cosistecy for all problems with costat oise level cf. Cor. 17 ad Cor. 23. hese results are of special iterest sice they show at the first time that SV s are able to lear oisy problems arbitrarily well. oreover, we prove that for problems that esure a large margi it suffices to use uiversal kerels ad a regularizatio parameter that is idepedet of the traiig set size cf. hm. 18, hm. 19 ad hm. 24. For this class of problems we also prove cosistecy for the maximal margi classifier cf. hm. 25. Fially, it turs out that eve for these simple, oise free classificatio problems SV s with polyomial kerels ca behave arbitrarily badly cf. Prop. 20. his work is orgaized as follows: we itroduce some mathematical otios i Sectio 2. I the third sectio we study the cocept of uiversal kerels. he followig sectios are devoted to applicatios of these kerels to support vector machies. We begi with the 2-soft margi classifier i Sectio 4. Here, cosistecy results for both oisy distributios ad problems esurig a large margi are proved. I the fifth sectio we show that these results also hold for the 1-soft margi algorithm. I the last sectio we discuss the maximal margi classifier i the presece of large margis. 2. Prelimiaries For a set X, a metric d o X is a fuctio d : X X [0, such that for all x, y, z X we have dx, y = dy, x ad dx, y dx, z+dz, y as well as dx, y = 0 if ad oly if x = y. We deote the closed ball with radius ε ad cetre x by B d x, ε := {y X : dx, y ε}. 69

Steiwart he coverig umbers of X are defied by { N X, d, ε := mi N { } : x 1,..., x with X i=1 } B d x i, ε for all ε > 0. he space X, d is precompact if ad oly if N X, d, ε is fiite for all ε > 0. oreover, X is called compact if every ope coverig of X has a fiite subcoverig. If the space X, d is complete, i.e. every Cauchy sequece coverges i X, the X is compact if ad oly if X is precompact. For give A, B X we deote the distace of A ad B by da, B := if dx, y. x A y B We ofte use that if A is closed, B is compact ad both sets are disjoit the da, B > 0 holds. A σ-algebra o a set X is a set of subsets of X that cotais ad is closed uder elemetary, coutable set operatios such as complemets ad coutable itersectios. For a metric space X, d the Borel σ-algebra is the smallest σ-algebra that cotais all ope sets. Let A be a σ-algebra o X. A subset A of X is called measurable if A A. We say that a fuctio f : X R is measurable if the pre-image of every Borel measurable B R is i A. Basic examples are the fuctios 1 A, where A A ad 1 A x = 1 if x A ad 1 A x = 0 otherwise. A probability measure P : A R + is a σ-additive fuctio with P = 0 ad P X = 1. If A is a Borel σ-algebra we call P a Borel probability measure. I this case P is said to be regular, if for all Borel measurable B X we have P B = sup{p K : K B, K compact}. If X, d is compact, the every Borel probability measure o X is regular cf. Dudley, 1989, p. 176. I this paper H always deotes a Hilbert space, i.e. a complete, ormed vector space edowed with a dot product.,. givig rise to its orm via x = x, x. Let B H := {x H : x 1} be the closed uit ball of H ad S H := {x H : x = 1} be its sphere. Recall, that every separable Hilbert space is isometrically isomorphic to the space of all 2-summable sequeces l 2. A commutative algebra A is a vector space equipped with a additioal associative ad commutative multiplicatio : A A A such that x y + z = x y + x z λx y = λx y holds for all x, y, z A ad λ R. A classical example of a algebra is the space CX of all cotiuous fuctios f : X R o the compact metric space X, d edowed with the usual supremum orm f := sup fx. x X he followig well-kow approximatio theorem of Stoe-Weierstraß cf. Pederse, 1988, Cor. 4.3.5. states that certai subalgebras of CX geerate the whole space. his result will be the key tool whe cosiderig approximatio properties of kerels i the ext sectio: 70

O the Cosistecy of Support Vector achies heorem 1 Let X, d be a compact metric space ad A CX be a algebra. he A is dese i CX if both A does ot vaish, i.e. for all x X there exists a f A with fx 0, ad A separates poits, i.e. for all x, y X with x y there exists a f A with fx fy. 3. Kerels I the followig let X, d be a metric space. A fuctio k : X X R is called a kerel o X if there exists a Hilbert space H ad a map Φ : X H with kx, y = Φx, Φy for all x, y X. We call Φ a feature map ad H a feature space of k. Note, that both H ad Φ are far from beig uique. However, for a give kerel there exists a caoical feature space with associated feature map, which is the so-called reproducig kerel Hilbert space RKHS cf. Cristiaii & Shawe-aylor, 2000, Ch. 3. Sice our asatz is maily based o specific series expasios of certai kerels we do ot eed to cosider these spaces. Let k be a kerel o X ad Φ : X H be a feature map of k. A fuctio f : X R is iduced by k if there exists a elemet w H such that f = w, Φ.. he ext lemma shows that this defiitio is idepedet of Φ ad H: Lemma 2 Let k : X X R be a kerel ad Φ 1 : X H 1, Φ 2 : X H 2 be two feature maps of k. he for all w 1 H 1 there exist w 2 H 2 with w 2 w 1 ad w 1, Φ 1 x = w 2, Φ 2 x for all x X. Proof Let H 1 := spa Φ 1X ad H 1 its orthogoal complemet i H 1. he w 1 H 1 ca be writte as w 1 = w 1 + w 1 with w 1 H 1 ad w 1 H 1. Give a x X we have w 1, Φ 1 x = 0 ad therefore we obtai w 1, Φ 1x = w 1, Φ 1 x for all x X. Now by the defiitio of H1 ad w1 = =1 w1. he for w 2 there exists a sequece w1 spa Φ 1 X with w 1 = m m=1 λ m Φ 1 x m ad l 2 l 1 1 we obtai l 2 w 1 =l 1 2 = = = := m m=1 λ m Φ 2 x l 2 m l 2 m i λ m λ i j =l 1 m=1 i=l 1 j=1 l 2 m l 2 m i λ m λ i j =l 1 m=1 i=l 1 j=1 l 2 w 2 =l 1 2. Φ 1x m, Φ 1 x i j Φ 2x m, Φ 2 x i j herefore, m =1 w2 m 1 is a Cauchy sequece ad hece coverges to w 2 := =1 w2 H 2. Clearly, we the have w 2 = w1 w 1. oreover, a easy calculatio similar to the cosideratio above shows w 1, Φ 1 x = w 2, Φ 2 x for all x X. m 71

Steiwart I the followig we oly cosider cotiuous kerels. he followig lemma provides some useful properties of this class: Lemma 3 Let k be a kerel o the metric space X, d ad Φ : X H be a feature map of k. he k is cotiuous if ad oly if Φ is cotiuous. I this case d k x, y := Φx Φy defies a semi-metric o X such that the idetity map id : X, d X, d k is cotiuous. If Φ is ijective d k is eve a metric. Proof Let us first suppose that k is cotiuous. Sice d k x, y = kx, x 2kx, y + ky, y we observe that d k x,. : X, d R is cotiuous for every x X. I particular, {y X : d k x, y < ε} is ope with respect to d ad therefore id : X, d X, d k is cotiuous. Furthermore, Φ : X, d k H is cotiuous ad hece Φ : X, d H is also cotiuous. Coversely, assume that Φ is cotiuous. Sice for all x, x, y, y X we have kx, y kx, y Φx, Φy Φy + Φx Φx, Φy Φx Φy Φy + Φy Φx Φx it is easily verified that k is also cotiuous. he metric d k ejoys the property that every iduced fuctio w, Φ. is Lipschitz cotiuous with respect to d k ad the Lipschitz costat is bouded from above by w. his fact turs out to be very importat i the proof of heorem 12 sice it allows us to cotrol the behaviour of solutios of SV s o subsets of small diameters. From the last lemma we kow i particular that for a cotiuous kerel every iduced fuctio is cotiuous. he followig defiitio plays a cetral role throughout this paper: Defiitio 4 A cotiuous kerel k o a compact metric space X, d is called uiversal if the space of all fuctios iduced by k is dese i CX, i.e. for every fuctio f CX ad every ε > 0 there exists a fuctio g iduced by k with f g ε. We also eed a weaker cocept. Let Φ : X H be a feature map of k ad A, B be disjoit subsets of X. We say that k separates A ad B with margi γ 0 if ΦA ad ΦB ca be separated by a hyperplae with margi γ, i.e. if there exists a pair w, b S H R such that w, Φx + b > γ for all x A ad w, Φy + b < γ for all y B. If γ = 0 we simply say that k separates A ad B. I this case the restrictio w S H is superfluous. By Lemma 2 both defiitios are idepedet of the feature map Φ. We say that the kerel k separates all fiite, resp. compact subsets if it separates all fiite, resp. 72

O the Cosistecy of Support Vector achies compact disjoit subsets of X. Note, that if k separates compact sets A ad B the it automatically separates them with a suitable margi γ > 0. oreover, it was show i Steiwart 2001a, Ex. 3.13 that there exists a cotiuous kerel that separates all fiite subsets but fails to separate all compact subsets. Before we ivestigate which kerels are uiversal we collect some useful properties of these kerels. Firstly, let X, d ad X, d be compact metric spaces, k be a uiversal kerel o X ad ι : X X be a cotiuous ad ijective map. he oe easily checks that kι., ι. is a uiversal kerel o X. oreover, we have kx, x > 0 for all x X sice ky, y = 0 implies gy = 0 for all iduced fuctios g. Sice all feature maps of k are cotiuous ad X is compact we may also restrict ourselves to separable feature spaces of k. he ext propositio is fudametal for our cosideratios of support vector machies: Propositio 5 Let X, d be a compact metric space ad k be a uiversal kerel o X. he for all compact ad mutually disjoit subsets K 1,..., K X, all α 1,..., α R ad all ε > 0 there exists a fuctio g iduced by k with g max i α i + ε such that g K α i 1 Ki ε, where K := i=1 K i ad g K deotes the restrictio of g to K. i=1 Proof Sice dk i, K j > 0 for all i j we obtai i=1 α i1 Ki CK. Sice this fuctio ca be exteded to a cotiuous fuctio f o X with f max i α i by the Lemma of Urysoh or by a direct costructio with the help of d the assertio follows. Corollary 6 Every uiversal kerel separates all compact subsets. Proof Let X, d be a compact metric space ad k be a uiversal kerel o X with feature map Φ : X H. Give two compact ad disjoit subsets K 1 ad K 1 of X there exists a iduced fuctio g = w, Φ. with g K 1 K 1 1 K1 1 K 1 < 1/2. his implies that w 1 w, 0 separates K 1 ad K 1 with margi 1 2 w. Although the previous corollary is a almost trivial cosequece of the otio of uiversality it has surprisig implicatios for the geometric iterpretatio of the shape of the feature map: let us suppose that we have a fiite subset {x 1,..., x } of X. he the above corollary esures that for every sequece of sigs y 1,..., y the correspodig traiig set ca be correctly separated by a hyperplae i the feature space. oreover, this ca eve be doe by a hyperplae that has almost the same distace to every poit of {x 1,..., x }. herefore, ay fiite dimesioal iterpretatio of the geometric situatio i a feature space of a uiversal kerel must fail. I particular this holds for 2- or 3-dimesioal drawigs. Actually, the shape of the feature map is eve more complicated sice ot oly all fiite subsets but every pair of compact disjoit subsets ca be separated. he followig corollary esures i particular that the semi-metric d k iduced by a uiversal kerel k is i fact a metric: 73

Steiwart Corollary 7 Every feature map of a uiversal kerel is ijective. Proof Fiite subsets are compact ad thus the assertio follows by the previous corollary. Propositio 8 Let X, d be a compact metric space ad k be a uiversal kerel o X. he k kx, y x, y := kx, xky, y defies a uiversal kerel o X. Proof Let Φ : X H be a feature map of k ad αx := kx, x 1/2. Clearly, αφ : X H is a feature map of k ad thus k is a kerel. o show that k is uiversal we fix a fuctio f CX ad ε > 0. For a := α we the get a iduced fuctio g = w, Φ. with α 1 f g ε a. his yields ad thus the assertio is proved. f w, αφ. α α 1 f g ε Up to ow we do ot kow whether there exist uiversal kerels. o attack this questio we begi with a simple criterio that makes it possible to check whether a give kerel is uiversal: heorem 9 Let X, d be a compact metric space ad k be a cotiuous kerel o X with kx, x > 0 for all x X. Suppose that we have a ijective feature map Φ : X l 2 of k with Φx = Φ x N. If A := spa {Φ : N} is a algebra the k is uiversal. Proof Because of kx, x > 0 for all x X the algebra A does ot vaish. Sice k is cotiuous every Φ : X R is cotiuous by Lemma 3 ad hece A CX. oreover, A is eve dese i CX sice the ijectivity of Φ implies that A separates poits ad thus heorem 1 ca be applied. Now we fix f CX ad ε > 0. he there exists a fuctio g = λ j Φ j A j=1 such that f g ε. However, if we defie w := λ j for = j ad w := 0 otherwise, we have w := w l 2 ad w, Φ. = g. he followig corollaries give various examples of uiversal kerels. We begi with kerels that ca be expressed by a aylor series: Corollary 10 Let 0 < r ad f : r, r R be a C -fuctio that ca be expaded ito its aylor series i 0, i.e. fx = a x for all x r, r. =0 74

O the Cosistecy of Support Vector achies Let X := {x R d : x 2 < r}. If we have a > 0 for all 0 the kx, y := f x, y defies a uiversal kerel o every compact subset of X. Proof Sice x, y x 2 y 2 < r for all x, y X we see that k is well-defied. We also have kx, y = = = d a x k y k =0 =0 a k 1,...,k d 0 k=1 k 1 + +k d = k 1,...,k d 0 c k1,...,k d i=1 a k1 + +k d c k1,...,k d d x i y i k i d i=1 x k i i d i=1 y k i i, where c k1,...,k d := d i=1 k i! 1 d i=1 k i! cf. also Poggio, 1975, Lem. 2.1. Note, that the series ca be rearraged sice it is absolutely summable. I particular, for x = y we obtai that Φ : X l 2 N d 0 is well defied by Φx := ak1 + +k d c k1,...,k d d i=1 x k i i. k 1,...,k d 0 he above equatio also shows that kx, y = Φx, Φy holds for all x, y X ad hece k is ideed a kerel. oreover, a 0 > 0 implies kx, x > 0 for all x X ad trivially, Φ is ijective. Sice A := spa {Φ k1,...,k d : k 1,..., k d 0} is a algebra we thus obtai by heorem 9 that k is uiversal. Istead of aylor series oe ca also cosider Fourier expasios. he result reads as follows: Corollary 11 Let f : [0, 2π] R be a cotiuous fuctio that ca be expaded i a poitwise absolutely coverget Fourier series of the form ft = a cost. 2 =0 If a > 0 holds for all 0 the kx, y := d i=1 f x i y i defies a uiversal kerel o every compact subset of [0, 2π d. Recall, that every fuctio f : [0, 2π] R that ca be exteded to a cotiuous, symmetric, periodic ad piecewise cotiuously differetiable fuctio o R has a Fourier series of the form 2. Proof By iductio ad the Cauchy product of series we may restrict ourselves to d = 1. he kx, y = a 0 + a six siy + a cosx cosy =1 75 =1

Steiwart holds for all x, y [0, 2π ad hece Φ = Φ 0 defied by Φ 0 x := a 0 ad Φ 2 1 x := a six, Φ 2 x := a cosx for 1 is a ijective feature map of k with image i l 2. oreover, A := spa { a si. : 1} { a 0 cos. : 0} is a algebra ad sice a 0 > 0 implies kx, x > 0 for all x X we obtai that k is uiversal. he followig examples show that various well-kow kerels are uiversal: Example 1 he kerels exp σ 2.. 2 2 ad exp.,. are uiversal o every compact subset of R d. Proof he uiversality of exp.,. is due to Corollary 10. herefore, by Propositio 8 ad exp σ 2 x y 2 2 = exp σx 2 2 exp σy 2 2 exp 2σx, 2σy the assertio follows for the RBF kerel. Example 2 Let X := {x R d : x 2 < 1} ad α > 0. he V. Vovk s cf. Sauders et al., 1998, p. 15 ifiite polyomial kerel kx, y := 1 x, y α, x, y X, is uiversal o every compact subset of X. Proof o check the assertio we use that 1 t α = α =0 1 t holds for t < 1. Sice α 1 > 0 for all 0, the assertio the follows by Corollary 10. Example 3 Let 0 < q < 1 ad ft := 1 q 2 /2 4q cos t + 2q 2, t R. he the stroger regularized Fourier kerel kx, y := d i=1 fx i y i cosidered by Vapik 1998, p. 470 ad Sauders et al. 1998, p. 15 is uiversal o every compact subset of [0, 2π d. Proof he assertio ca be see usig Corollary 11 ad ft = 1/2 + =1 q cost cf. Gradstei & Ryshik, 1981, p. 68. Example 4 Let 0 < q < ad ft := π cosh π t /q / 2q sihπ/q for all t with 2π t 2π. he the weaker regularized Fourier kerel kx, y := d i=1 fx i y i cosidered by Vapik 1998, p. 470/1 ad Sauders et al. 1998, p. 15 is uiversal o every compact subset of [0, 2π d. Proof o obtai the assertio we use ft = 1/2 + =1 cost/1 + q2 2 cf. Gradstei & Ryshik, 1981, p. 68. 4. he 2-orm soft margi classifier Let k be a kerel o X ad Φ : X H be a feature map of k. For a traiig set = x 1, y 1,..., x, y X Y ad c > 0 we deote the uique solutio of the 76

O the Cosistecy of Support Vector achies optimizatio problem miimize Ww, b, ξ := w, w + c ξi 2 i=1 over w, b, ξ subject to y i w, Φx i + b 1 ξ i, i = 1,..., 3 by w 2,k,c, b 2,k,c H R. A algorithm C 2,c k that provides the decisio fuctio f 2,k,c x := sig w 2,k,c, Φx + b 2,k,c, x X for every traiig set is called a 2-orm soft margi classifier 2-SC with kerel k ad parameter sequece c. Note, that i order to have a small set of free parameters oe usually fixes c := c for all 1. I this sectio it turs out that this is ot suitable for problems that do o guaratee a large margi. Istead oe should use sequeces c = c β 1 where β > 0 is a parameter a-priori determied by the kerel ad c is a ew free parameter cf. Cor. 17. Of course, for fixed traiig set sizes both parameterizatios are equivalet, i.e. they ca be trasformed ito each other. By Lemma 2 the decisio fuctio is idepedet of the choice of the feature map Φ. oreover, f 2,k,c ca be expressed by f 2,k,c x = i=1 y i α i kx i, x + b 2,k,c, where α i 0 are suitable costats depedig o ad b 2,k,c ca also be computed with the help of the kerel cf. Cristiaii & Shawe-aylor, 2000; Vapik, 1998; Schölkopf et al., 2001. Note, that if k is a kerel o X which separates all fiite sets ad X has ifiitely may elemets the the fuctio class represeted by the 2-SC has ifiite VC-dimesio. For more iformatio o this we refer to Vapik 1998, Ch. 4, Cristiaii & Shawe-aylor 2000, Ch. 4 ad va der Vaart & Weller 1996, Ch. 2.6. Give a Borel probability measure P o X Y with oise level p we deote the odetermiistic part of the supervisor by X + := {x X : px > 0}. If P X X + > 0 we write q := if x X px ad p := sup x X px. Due to techical reasos we defie q := p := 1/4 otherwise. We begi with a prelimiary result: heorem 12 Let X, d be a compact metric space ad k be a uiversal kerel o X. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 there exist c > 0 ad δ > 0 such that for all c c, 0 < δ δ ad all 1 we have Pr { X Y : R P f 2,k,c/ R P + 4 p q } 1 2q P XX + + ε 1 3e 2 δ 2, where := N X, d k, δ c is the coverig umber of X with respect to the metric dk which is iduced by the kerel k. oreover, Pr deotes the outer probability measure of P. Note, that i order to avoid the probably very difficult questio whether the sets { X Y : R P f 2,k,c/ α } 77

Steiwart are measurable we cosider the outer probability measure, oly. Sice the proof of heorem 12 is very techical we like to explai the basic idea of the proof firstly. Let us suppose that the supervisor has a costat level of oise p [0, 1/2. oreover, we assume that we have a iduced fuctio w, Φ. which has the costat values 1 2p, resp. 1 2p o B 1 P, resp. B 1 P. Now let us take a represetative traiig set of legth. he oe easily checks cf. estimate 4 that w k,c/, w k,c/ + c ξl 2 w, w + 4cp1 p l=1 holds. Here meas that the relatio oly holds approximately. O the other had, by the cotiuity of the decisio fuctio w k,c/ a misclassified compared with the optimal Bayes decisio rule elemet z forces the sum of those slack variables, which belog to samples i the eighbourhood of z, to be approximately greater tha their cardiality cf. iequality 6. Coversely, for a correctly classified elemet the correspodig sum of the slack variables is approximately larger tha 4p1 p times their cardiality cf. iequality 7. Combiig these cosideratios we obtai c1 2p 2 P X E + 4cp1 p = c P X E + 4p1 pp X X \ E w k,c/, w k,c/ + c l=1 w, w + 4cp1 p, where E deotes the set of misclassified compared with the optimal Bayes decisio rule elemets. hus P X E must be small if we have chose c large eough. he difficulty of the proof below is firstly, to trasfer the idea to the geeral case ad secodly, to give exact formulatios of represetative, eighbourhood ad approximately. Proof of heorem 12 Without loss of geerality we may assume ε 0, 1]. Let X 0 := X \ X + be the determiistic part of the supervisor ad X 1 0 := X0 B 1 P, X1 0 := X0 B 1 P be the parts of the classes B 1 P, B 1 P i X 0. Furthermore, let X 1 + := X+ B 1 P ad X 1 + := X+ B 1 P be the parts of the classes B 1 P ad B 1 P i X +. We defie δ := mi{2p, ε 74 q 1 2q } ad fix δ δ. Sice P X is regular cf. Dudley, 1989, p. 176 there exist compact subsets K j i of X j i with P X X j i \ Kj i δ/4 for i { 1, 1} ad j {0, +}. oreover, for a fixed feature map Φ : X H of k Propositio 5 esures the existece of a elemet w H with w, Φx [1, 1 + δ] if x K1 0 [ 1 + δ, 1] if x K 1 0 [1 2p δ, 1 2p ] if x K 1 + [ 1 2p, 1 2p δ] if x K 1 + [ 1 + δ, 1 + δ] otherwise. ξ 2 l 78

O the Cosistecy of Support Vector achies We defie c := 2 ε1 2q w 2 2 ad for fixed c c let σ := obtai partitios P j i of K j i with diam dk A σ for all A P j i ad N X, d k, σ =. i { 1,1} j {0,+} P j i δ c. By Lemma 13 we the Let P j i := {A P j i : P X A δ q } for i { 1, 1} ad j {0, +}. represetative traiig sets we defie F,A := { x1, y 1,..., x, y : {l : x l A} P X A δ } o costruct for all A P j i, i { 1, 1} ad j {0, +}. oreover, for A P+ i, i { 1, 1} let F +,A := { x1, y 1,..., x, y : { x1 F,A :=, y 1,..., x, y : F := F,A {l : xl A ad y l = i} {l : x l A ad y l i} F,A F +,A F,A. 1 p P X A δ q P X A δ } } A P 0 1 P0 1 A P + 1 P+ 1 Lemma 14 yields P F 1 3e 2 δ 2 ad thus it suffices to show that R P f 2,k,c/ R P + 4 p q 1 2q P XX + + ε holds for all F. herefore, let us fix a traiig set = x 1, y 1,..., x, y F. For c := c/ we deote the solutio of 3 by w, b, ξ. Our first step is to estimate Ww, b, ξ from above by comparig it with Ww, 0, ξ, where ξ is a admissible slack vector of w, 0. Hece we have to costruct ξ. For this let us first assume that we have a sample x l, y l K1 0. he we observe that y l w, Φx l = w, Φx l 1 ad thus we may defie ξl := 0. Aalogous cosideratios yield 0 if x l Ki 0 ξl 2p + δ if x l K i + :=, y l = i 2 2p if x l K i +, y l i 2 + δ otherwise. oreover, our costructio of F guaratees that there are at most i { 1,1} j {0,+} A P j i P X A δ 1 i { 1,1} j {0,+} P X X j i + 2 + 1 δ q = 2 q δ 79

Steiwart samples which are ot elemets of a suitable K j i. Furthermore, sice there are at most samples i K + := K + 1 i { 1,1} A Pi 0 K+ 1 P X X + + 2 q δ P X A δ P X X + + 2 q δ we also obtai that there are at most i { 1,1} A P + i 1 p P X A δ p P X X + + 4 q δ samples i K + which have icorrect labels. Sice these have larger slack variables with respect to w, 0 tha the correctly labeled samples i K + we obtai Ww, b, ξ Ww, 0, ξ w, w + c i { 1,1} x l K + i y l =i ξ l 2 + c i { 1,1} x l K + i y l i ξ l 2 + 2δc2 + δ 2 w, w + c1 p P X X + 2p + δ 2 + c p P X X + + 4 q δ 2 2p 2 + 2δc2 + δ 2 w, w + c 4p 1 p P X X + + 27 q δ. 4 For later purposes we ote that we also have Ww, b, ξ W 0, 0, 1,..., 1 c ad thus w 2 c. I the secod step of the proof we estimate Ww, b, ξ from below. For this let us deote the set of misclassified poits i X j i by E j i := {x X j i : f x i}. For brevity s sake we also write E j := E j 1 Ej 1 ad E := E0 E +. Let us first cosider a A P 0 i with A E. Without loss of geerality we may assume that i = 1. he for x l A ad z A E we obtai 1 ξ l y l w, Φx l + b = w, Φx l Φz + w, Φz + b w d k x l, z δ, i.e. ξ 2 l 1 δ2 1 2δ. Sice the same estimate holds i the case i = 1 our costructio of F implies 1 i { 1,1} A Pi 0 x l A A E ξ 2 l i { 1,1} A Pi 0 A E 1 2δ P X A δ 80 P X E 0 3 q δ. 5

O the Cosistecy of Support Vector achies Now let us cosider a A P + i with A E. Without loss of geerality we may assume that i = 1 agai. he for fixed z A E ad a := w, Φz + b 0 we obtai ξ l { 1 δ + a for x l A with y l = 1 max{0, 1 δ a} for x l A with y l = 1 aalogously to the above cosideratios. We first treat the case 1 δ a 0. Sice there are at least P X A δ/ samples i A ad at least 1 p P X A δ/ correctly labeled samples i A we get 1 ξ 2 l 1 δ + a 2 1 p P X A δ x l A + 1 δ a 2 p P X A P X A 1 δ + a 2 1 p + 1 δ a 2 p 4δ. Now a easy miimizatio with respect to a [0, 1 δ] yields 1 x l A ξ 2 l P X A 1 δ 2 1 p + 1 δ 2 p 4δ 1 2δP XA 4δ. O the other had, if 1 δ a < 0 we have 1 δ + a > 2 2δ ad thus the same iequality follows: 1 ξ 2 l 1 δ + a 2 1 p P X A δ x l A 1 2δP X A 4δ. herefore, we obtai 1 i { 1,1} A P + x l A i A E ξ 2 l i { 1,1} A P + i A E 1 2δP X A 4δ. 6 Fially, a aalogous cosideratio yields 1 i { 1,1} A P + x l A i A E= ξ 2 l i { 1,1} A P + i A E= 1 2δ4q 1 q P X A 4δ. 7 81

Steiwart Now, the estimates 6 ad 7 imply 1 ξ 2 l + 1 i { 1,1} A P + x l A i A E i { 1,1} A P + i A E i { 1,1} A P + i A E 1 2δP X A 4δ i { 1,1} A P + x l A i A E= + ξ 2 l i { 1,1} A P + i A E= 1 2δ1 2q 2 P X A + i { 1,1} A P + i 1 2δ4q 1 q P X A 4δ 1 2δ4q 1 q P X A 4δ 1 2δ1 2q 2 P X E + + 1 2δ4q 1 q P X X + 6δ 2 q δ 1 2q 2 P X E + + 4q 1 q P X X + 7 q δ. he latter iequality together with 5 yields 1 l=1 ξ 2 l P X E 0 + 1 2q 2 P X E + + 4q 1 q P X X + 10 q δ. Combiig this estimate with 4 we ow obtai w, w c P X E 0 + 1 2q 2 P X E + + 4q 1 q P X X + 4p 1 p P X X + 37 q δ c P X E 0 + 1 2q 2 P X E + 4p q P X X + 37 q δ. 8 oreover, a simple calculatio shows R P f 2,k,c/ 1 = R P + 1 2p dp X R P + 1 2q P X E 0 + 1 2q 2 P X E + E ad thus P X E 0 + 1 2q 2 P X E + 1 2q R P f 2,k,c/ R P. With this ad 8 we fid w, w 2 w, w ε1 2q he latter iequality fially implies 1 2q R P f 2,k,c/ R P 4p q P X X + 37 q δ. R P f 2,k,c/ R P ε 2 + 4 p q 1 2q P XX + 37 ε1 2q q + 1 2q q 74 = 4 p q 1 2q P XX + + ε. hus the assertio follows. 82

O the Cosistecy of Support Vector achies We have to prove the remaiig lemmas ow. We begi with the lemma that costructs the partitios P j i of the above proof: Lemma 13 Usig the otatios of the proof of heorem 12 there exist partitios P j i of Kj i, i { 1, 1}, j {0, +}, such that diam dk A σ for all A P j i, i { 1, 1}, j {0, +}, ad N X, d k, σ. 9 i { 1,1} j {0,+} P j i Proof By the defiitio of the coverig umbers there exists a partitio P of X with diam dk A σ for all A P ad P N X, d k, σ. Let us defie P j i := {A P : A K j i } ad Pj i := {A Kj i : A P j i }. herefore, to prove 9 we have to show that the P j i s are mutually disjoit. Assume the coverse, i.e. there exists A P j j i P i with i i or j j. By the defiitio of the partitios there exist z 1, z 2 A with z 1 K j i ad z 2 K j i. Now o the oe had, we obtai w, Φz 1 Φz 2 w d k z 1, z 2 w σ w δ c < δ but o the other had we also have w, Φz 1 Φz 2 = w, Φz 1 w, Φz 2 1 1 2p δ δ. herefore the assertio follows. Lemma 14 Usig the otatios of the proof of heorem 12 we have P F 1 3e 2 δ 2. Proof Let us recall Hoeffdig s iequality cf. Devroye et al., 1997, hm. 8.1 which i particular states that for all i.i.d. radom variables z i : Ω, A, Q {0, 1} ad all ε > 0, 1 we have Q z i q ε e 2ε2, i=1 where q := Qz i = 1. hus for A P + i we get P F +,A P { x1, y 1,..., x, y : {l : x l A, y l = i} p dp X A1 δ 1 P { x1, y 1,..., x, y : {l : x l A, y l = i} p dp X A1 δ 1 e 2 δ 2, } } 83

Steiwart i.e. P Z \F +,A e 2 δ 2, where Z := X Y. Aalogously, we obtai P Z \F,A e 2 δ 2 for all A P + i yield ad P Z \ F,A e 2 δ 2 for all A P j i. hese estimates P F = P F,A F,A F +,A F,A A P 0 1 P0 1 = 1 P Z \ F,A A P + 1 P+ 1 Z \ F,A Z \ F +,A Z \ F,A 1 A P 0 1 P0 1 P Z \ F,A A P + 1 P+ 1 P Z \ F +,A P Z \ F,A A P j 1 Pj 1 j {0,+} 1 3e 2 δ 2. A P + 1 P+ 1 A P + 1 P+ 1 With the help of heorem 12 we ca ow ivestigate how to choose the parameter sequece c for a give uiversal kerel: heorem 15 Let X, d be a compact metric space ad k be a uiversal kerel o X such that the coverig umbers of X, d k fulfill N X, d k, ε Oε α for some α > 0. Suppose that we have a positive sequece c with c O β 1 for some 0 < β < 1 α ad c. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 we have lim Pr { X Y : R P f 2,k,c R P + 4 p q } 1 2q P XX + + ε = 1. Proof Let γ := 1 αβ 41+α > 0 ad δ := γ. By heorem 12 there exist c > 0 ad δ > 0 such that for all c c, 0 < δ δ ad all 1 we have Pr { X Y : R P f 2,k,c/ R P + 4 p q } 1 2q P XX + + ε 1 3e 2 δ 2, where := N X, d k, δ c. Sice δ 0 ad c we may fix a 0 1 such that for all 0 we have both δ δ ad c c. his yields Pr { X Y : R P f 2,k,c R P + 4 p q } 1 2q P XX + + ε 1 3 e 2 δ 2, where := N δ X, d k, c. his implies O αγ+β/2 ad hece we easily check that e δ 2 Oe 21+αγ holds. hus the assertio follows. 84

O the Cosistecy of Support Vector achies Corollary 16 Uder the assumptios of heorem 15 the 2-SC with sequece c is cosistet for all Borel probability measures P o X Y with q = p < 1/2. I particular this holds for all Borel probability measures with costat oise level. Corollary 17 Let X R d be compact ad k be a Gaussia RBF kerel o X. Let 0 < β < 1 d ad c be a positive sequece with c O β 1 ad c. he the 2-SC with kerel k ad sequece c is cosistet for all Borel probability measures P o X Y with q = p < 1/2. Proof observe Let σ > 0 ad kx, y := exp σ 2 x y 2 2. Sice 1 e t t for all t 0 we d k x, y = 2 2 exp σ 2 x y 2 2 2σ x y 2. his yields N X, d k, ε N ε X,. 2, 2σ ad thus N X, dk, ε Oε d cf. Carl & Stephai, 1990, p. 9. For the classificatio problems we have cosidered up to ow we usually may ot expect that we obtai a large margi for sample sizes growig to ifiity. I the followig we restrict ourselves to distributios that guaratee a fixed ad strictly positive margi for all traiig sets. Of course, these classificatio problems must have a determiistic supervisor, i.e. a oise level that vaishes almost everywhere. I geeral, additioal assumptios are required. Usig uiversal kerel these reduce to a simple geometric coditio: heorem 18 Let X, d be a compact metric space ad k be a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all c > 0, ε > 0 ad m we have Pr { X Y : R P,S f 2,k,c ε} 1 e ε 2 +m. Here, := N X, d k, γ/2 is the coverig umber of X, d k ad m := 4 cγ 2 + 1. Proof Sice X, d k is precompact ad d B 1 P, B 1 P > 0 both B i P are compact, too. hus they ca be separated with margi γ > 0 by Corollary 6. I aalogue to Lemma 13 we ow costruct partitios P i of B i P, i { 1, 1} such that diam dk A γ/2 for all A P i ad P 1 P 1. We defie P i := {A P i : P X A > ε } for i { 1, 1}. oreover, for m ad A P 1 P 1 let F,A := F := { x1, y 1,..., x, y X Y : A P 1 P 1 F,A. 85 {l : x l A} } m

Steiwart For m the Cheroff-Okamoto iequality see e.g. Dudley, 1978 the yields P X Y ε \ F,A exp m 1 2 2 ε 1 ε exp 2 ε 2 2 ε m 1 + m 12 exp ε 2 + m ad thus P F 1 e ε 2 +m for all m. Hece it suffices to show that for all = x 1, y 1,..., x, y F the decisio fuctio f 2,k,c classifies A P 1 A ad A P 1 A correctly. o see this we fix a feature map Φ : X H of k. For brevity s sake the uique solutio of problem 3 is deoted by w, b, ξ. Furthermore, w, b S H R is a hyperplae that separates Φ B 1 P ad Φ B 1 P with margi γ > 0. he we have y l w, Φx l + b γ for all l = 1,..., ad therefore we obtai w, w + c l=1 ξ 2 l w γ, w γ 2 ε = 1 γ 2. I particular this implies w 1/γ. Now, let us suppose that there exists a misclassified poit z. Without loss of geerality we may assume that z A P 1 A. Hece there is a A P 1 with z A ad for this there exist mutually differet idexes l 1,..., l m such that x lj A. I particular this yields d k x lj, z γ/2 for all j = 1,..., m. Sice w, Φz + b 0 we thus obtai 1 ξ lj w, Φx lj +b = w, Φx lj Φz + w, Φz +b w d k x lj, z 1 2. Hece we have ξ lj 1/2 ad this leads to the cotradictio 1 γ 2 w, w + c ξl 2 l=1 c m ξl 2 j cm 4 j=1 = c 4 4 cγ 2 + 1 > 1 γ 2. herefore the assertio follows. heorem 18 shows that the 2-SC works well with fixed weight factor c wheever it treats a classificatio problem that esures a large margi. We believe that these distributios are also the oly oes for which a fixed c is suitable. Our cojecture is based o the observatio that the costat c cotrols the Lipschitz costat of the solutio of 3 with respect to the metric d k : if we have a classificatio problem that does ot guaratee a large margi the Lipschitz costat may grow like. he proofs of this sectio idicate that this may be too fast sice for large sample sizes the solutio eed ot be almost costat o each elemet of the partitios, i.e. overfittig may occur. I the proof of the above theorem we oly used elemets of the partitios P i whose probability was larger tha or equal to ε. If we exted our cosideratios to all elemets with strictly positive probability we obtai the followig theorem: 86

O the Cosistecy of Support Vector achies heorem 19 Let X, d be a compact metric space ad k a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all c > 0 there exists a costat ρ > 0 such that for all 4 cγ 2 + 1 N X, dk, γ/2 we have Pr { X Y : R P,S f 2,k,c = 0} 1 e ρ, Up to ow we have oly treated uiversal kerels. Oe may ask whether other classes of kerels are also suitable to treat with the classificatio problems cosidered i this work. Oe type ofte used are polyomial kerels of the form.,. + c m, c 0, m N, o a subset X of R. For these kerels the fuctios geerated by the 2-SC are polyomials o X of degree up to m. hus the ext propositio shows that these kerels are ot eve capable to solve the problems of heorems 18 ad 19: Propositio 20 Let P d be the set of all polyomials o X := [0, 1] d whose degree is less tha + 1. he for all ε > 0 there exists a Borel probability measure P o X Y with d B 1 P, B 1 P > 0, R P = 0 ad if{r P sig f : f P d } 1 2 ε. Proof We first treat the case d = 1. We fix a iteger m 3 + 2/ε ad let I i := [i + ε/m, i + 1 ε/m], i = 0,..., m 1. Deotig the Lebesgue measure o I i by λ Ii let P X := 1 2ε 1 m 1 i=0 λ I i. oreover, we defie a determiistic supervisor P.. by P y = 1 x := 1 for x I 2i, i = 0,..., m + 1/2, ad P y = 1 x := 1 otherwise. For a fixed polyomial f P d we deote its mutually differet ad ordered zeroes i 0, 1 by x 1 <... < x k, k. For brevity s sake let x 0 := 0 ad x k+1 := 1. Fially, we defie a j := { i : I i [x j, x j+1 ] } for j = 0,..., k. he by a easy observatio we get k j=0 a j m k m. oreover, at most a j + 1/2 itervals I i are correctly classified o [x j, x j+1 ] by the fuctio sig f. Hece at least k j=0 aj 2 k j=0 aj 2 1 m 2 k + 1 m 3 2 2 itervals I i are ot correctly classified o [0, 1] by sig f. Sice P X I i = 1/m we thus obtai R P sig f 1 k aj 1 m 2 2 3 + 2 m 1 2 ε. j=0 o treat the case d > 1 we have to embed [0, 1] ito [0, 1] d via t t, 0,..., 0. Cosiderig the above distributio P embedded ito [0, 1] d we the get the assertio sice polyomials i d variables o [0, 1] embedded ito [0, 1] d as above are essetially polyomials i oe variable. 87

Steiwart 5. he 1-orm soft margi classifier We ow cosider the 1-orm soft margi classifier. Agai we fix a kerel k o X with feature map Φ : X H. For a traiig set = x 1, y 1,..., x, y X Y ad c > 0 we deote a solutio of the optimizatio problem by w 1,k,c miimize w, w + c ξ i i=1 over w, b, ξ subject to y i w, Φx i + b 1 ξ i, i = 1,..., ξ i 0, i = 1,...,, b 1,k,c H R. A algorithm C 1,c k that provides the decisio fuctio f 1,k,c x := sig w 1,k,c, Φx + b 1,k,c, x X for every traiig set is called a 1-orm soft margi classifier 1-SC with kerel k ad parameter sequece c. For a itroductio to the 1-SC as well as for implemetatio techiques we refer to Cristiaii & Shawe-aylor 2000, Ch. 6 ad 7 ad Vapik 1998, Ch. 10 Burges & Crisp 2000 proved that i geeral the optimizatio problem 10 has o uique solutio. Although the 1-SC oly has to costruct a arbitrary solutio we show i this sectio that it ejoys all the properties of the 2-SC prove i this work. We begi with a statemet which is aalogous to heorem 12: heorem 21 Let X, d be a compact metric space ad k be a uiversal kerel o X. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 there exist c > 0 ad δ > 0 such that for all c c, 0 < δ δ ad all 1 we have Pr { X Y : R P f 1,k,c/ R P + 4 p q } 1 2q P XX + + ε 10 1 3e 2 δ 2, where := 4N X, d k, δ c is essetially the coverig umber of X with respect to the metric d k which is iduced by the kerel k. Sketch of the proof Sice the proof is very similar to that of heorem 12 we oly poit out the mai differeces besides adjustig the costats: Firstly the vector w H has to be chose i a differet maer, amely it has to fulfill { w i [1, 1 + δ] if x K j i, Φx [ 1 + δ, 1 + δ] otherwise. his coditio also eforces the secod modificatio sice Lemma 13 does ot hold aymore i this settig. Ideed, we caot guaratee that the defiitio of the P j i s i Lemma 13 implies that they are mutually disjoit. Istead, we oly obtai P j i N δ X, d k, c ad thus 4N X, d k, σ. i { 1,1} j {0,+} P j i 88

O the Cosistecy of Support Vector achies he latter explais the defiitio of, which is differet to that of heorem 12. Followig the proof of heorem 15 we obtai a aalogous result for the 1-SC: heorem 22 Let X, d k be a compact metric space ad k be a uiversal kerel o X such that the coverig umbers of X, d k fulfill N X, d k, ε Oε α for some α > 0. Suppose that we have a positive sequece c with c O β 1 for some 0 < β < 1 α ad c. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 we have lim Pr { X Y : R P f 1,k,c R P + 4 p q } 1 2q P XX + + ε = 1 o complete our cosideratios of oisy problems we metio that for the 1-SC with Gaussia RBF kerel we obtai the followig cosistecy result which has already bee proved for the 2-SC: Corollary 23 Let X R d be compact ad k be a Gaussia RBF kerel o X. Let 0 < β < 1 d ad c be a positive sequece with c O β 1 ad c. he the 1-SC with kerel k ad sequece c is cosistet for all Borel probability measures P o X Y with q = p < 1/2. I the presece of large margis it turs out that we may fix the weight factor aalogously to the 2-SC. For brevity s sake we oly state a result that is similar to heorem 18. However, a result that is aalogous to heorem 19 also holds. heorem 24 Let X, d be a compact metric space ad k a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all c > 0, ε > 0 ad m we have Pr { X Y : R P,S f 1,k,c ε} 1 e ε 2 +m, where := N X, d k, γ/2 is the coverig umber of X, d k ad m := 2 cγ 2 + 1. Sketch of the proof he proof is completely aalogous to that of heorem 18. However, sice the slack variables are ot squared we obtai c m ξ lj cm 2 j=1 i the last estimate of the proof of heorem 18 ad this yields the slightly better value of m. Fially we metio that usig polyomial kerels Propositio 20 ca also be applied. hus problems that caot be treated well with a fixed polyomial kerel ad the 2-SC caot be treated well with the 1-SC ad the same kerel, either ad vice versa. 89

Steiwart 6. he maximal margi hyperplae classifier We fially cosider the maximal margi classifier. Agai we fix a kerel k o X with feature map Φ : X H. For a traiig set = x 1, y 1,..., x, y X Y we deote the uique solutio of the optimizatio problem miimize w, w over w, b subject to y i w, Φx i + b 1, i = 1,..., 11 by w k, bk H R. A algorithm C k that provides the decisio fuctio f k x := sig w k, Φx + b k, x X for every traiig set is called a maximal margi classifier C with kerel k. For a itroductio to the C as well as for implemetatio techiques ad a geometric motivatio we agai refer to Cristiaii & Shawe-aylor 2000, Ch. 6 ad 7 ad Vapik 1998, Ch. 10. he C is assumed to work poorly i the absece of large margis cf. og Zhag, 2001. hus we oly cosider the settig of heorem 18. We begi with a result similar to heorem 18 ad heorem 24: heorem 25 Let X, d be a compact metric space ad k a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all ε > 0 ad := N X, d k, γ/2 we have Pr { X Y : R P,S f k ε} 1 e l1 ε 2. Sketch of the proof We repeat the costructio of the proof of heorem 18 with m := 1. A easy calculatio the shows P F 1 e l1 ε 2. Now suppose that for F we have a elemet z A P 1 A misclassified by f k. Hece there exist a A P 1 with z A ad a sample x l, y l of with x l A. he the followig estimate yields a cotradictio: 0 w k, Φz +b k = w k, Φz Φx lj + w k, Φx lj +b k w k d k x lj, z+1 1 2. herefore the assertio follows. he above result ad its proof idicate that i the presece of large margis it may be more suitable to use the C istead of a soft margi algorithm. We metio that a estimate which is very similar to heorem 25 ca be obtaied usig data-depedet margi-based bouds of Shawe-aylor et al. 1998. o compare both we first state a corollary: 90

O the Cosistecy of Support Vector achies Corollary 26 Let k σ be the Gaussia RBF kerel with parameter σ o the uit ball X := of the d-dimesioal Euclidea space. Let P be a Borel probability measure o X Y B l d 2 which ca be separated by k σ with margi γ > 0. he for all δ 0, 1 ad all 2 16σ γ we have Pr { X Y : R P,S f k 4d 16σ d d l 16σ γ γ + l 2 } 1 δ. δ d Proof As already show i the proof of Corollary 17 we have N X, d k, ε ε N X,. 2, 2σ 2 5 d ε d 2σ 28σ d ε d ad thus := N X, d k, γ/2 216σ d γ d. Now let ε := 21 δ 1/ for 216σ d γ d, i.e. δ = 1 ε. 2 Sice ε < 2 l δ heorem 25 yields the assertio. Corollary 26 shows that the learig curve of the C is of order O 1 provided that P guaratees a large margi. hese coditios also allow the applicatio of margi-based bouds o geeralizatio proved by Shawe-aylor et al. 1998 cf. also Bartlett & Shawe- aylor, 1999; Cristiaii & Shawe-aylor, 2000. We the obtai that the learig curve is of order O 1 log 2. However, to compare both results oe also has to cosider the costats that arise sice the sample size oly varies i a typical rage. Here we observe that the ifluece of the margi γ is essetially of order Oγ 2 i the estimates of Shawe- aylor et al. 1998 while the Corollary 26 shows that our estimates are essetially iflueced by order Oγ d, where d is the dimesio of the iput space X. We thus suppose that oly for small iput dimesios d our estimates are more suitable to treat realistic sample sizes. 7. Coclusios he aim of this paper has bee to ivestigate which kid of distributios could be classified well by support vector machies SV s. It has tured out that the ability of the kerel to approximate arbitrary cotiuous fuctios plays a fudametal role for this questio. Sice the resultig fuctio classes represeted by the classifier are very large there always exists the risk of overfittig. However, usig soft margi support vector machies with specific sequeces of regularizatio parameters this risk ca be cotrolled at least for simple oise models, e.g. models with costat oise level. I particular, the restrictio to large margi problems i og Zhag 2001, p. 442 has bee sigificatly weakeed. Sice the asatz of this paper is ew may questios remai ope, ad are worth for further ivestigatios. Firstly it is iterestig whether the soft margi algorithms yield arbitrarily good geeralizatio for all distributios. Up to ow our results oly provide cosistecy if the oise level is costat. However, approximatig a arbitrary oise level by step fuctios it seems possible that the soft margi algorithms with certai parameter sequeces are uiversally cosistet. Of course, the uiversality of kerels, which roughly speakig eables us to do almost everythig with iduced fuctios o compact subsets 91