On the Influence of the Kernel on the Consistency of Support Vector Machines

Size: px
Start display at page:

Download "On the Influence of the Kernel on the Consistency of Support Vector Machines"

Transcription

1 Joural of achie Learig Research Submitted 08/01; Published 12/01 O the Ifluece of the Kerel o the Cosistecy of Support Vector achies Igo Steiwart athematisches Istitut Friedrich-Schiller-Uiversität Erst-Abbe-Platz Jea, Germay steiwart@miet.ui-jea.de Editor: Berhard Schölkopf for HP:// Abstract I this article we study the geeralizatio abilities of several classifiers of support vector machie SV type usig a certai class of kerels that we call uiversal. It is show that the soft margi algorithms with uiversal kerels are cosistet for a large class of classificatio problems icludig some kid of oisy tasks provided that the regularizatio parameter is chose well. I particular we derive a simple sufficiet coditio for this parameter i the case of Gaussia RBF kerels. O the oe had our cosideratios are based o a ivestigatio of a approximatio property the so-called uiversality of the used kerels that esures that all cotiuous fuctios ca be approximated by certai kerel expressios. his approximatio property also gives a ew isight ito the role of kerels i these ad other algorithms. O the other had the results are achieved by a precise study of the uderlyig optimizatio problems of the classifiers. Furthermore, we show cosistecy for the maximal margi classifier as well as for the soft margi SV s i the presece of large margis. I this case it turs out that also costat regularizatio parameters esure cosistecy for the soft margi SV s. Fially we prove that eve for simple, oise free classificatio problems SV s with polyomial kerels ca behave arbitrarily badly. Keywords: Computatioal learig theory, patter recogitio, PAC model, support vector machies, kerel methods 1. Itroductio Support vector machies comprise a class of learig algorithms origially itroduced for patter recogitio problems. Although their developmet was motivated by results of statistical learig theory the kow bouds o their geeralizatio performace are ot fully satisfactory. I particular, the ifluece of the chose kerel is far from beig completely uderstood. he aim of this paper is to give a ew isight ito the role of the kerels. Our cosideratios are maily based o a certai approximatio property of various stadard kerels that geerate fuctio classes with ifiite VC-dimesio. Sice i this case classical Vapik-Chervoekis theory fails to be applicable for support vector machies, other cocepts such as data depedet structural risk miimizatio, e.g. i terms of the observed margi, were itroduced cf. Shawe-aylor et al., 1998; Bartlett & Shawe-aylor, 1999; Cristiaii & Shawe-aylor, 2000, chap. 2. he latter usually eeds large margis o the c 2001 Steiwart.

2 Steiwart traiig sets to provide good bouds. It is, however, ope which distributios ad kerels guaratee this assumptio. A systematical study of this questio is the startig poit of this paper. he resultig techiques allow us to show ew bouds for the geeralizatio performace of several stadard support vector classifiers. We begi with a descriptio of the problem of patter recogitio cf. Vapik, 1998; Cristiaii & Shawe-aylor, Let X, d be a compact metric space 1, Y := { 1, 1} ad P be a probability measure o X Y, where X is equipped with the Borel σ-algebra. By disitegratio cf. Dudley, 1989, Lem there exists a map x P. x from X ito the set of all probability measures o Y such that P is the joit distributio of P. x x ad of the margial distributio P X of P o X. We call P.., which is i fact a regular coditioal probability, the supervisor. A classifier is a algorithm that costructs to every traiig set = x 1, y 1,..., x, y X Y a decisio fuctio f : X Y. I our cotext it is always assumed that is i.i.d. accordig to P, which itself is ukow. he the decisio fuctio f : X Y costructed by the classifier should guaratee a small probability for the misclassificatio of a example x, y radomly geerated accordig to P. Here, misclassificatio meas fx y. o make this precise, for a measurable fuctio f : X { 1, 1} we defie the risk of f by R P f := 1 {fx y} P dx, dy = P {x, y : fx y}. X Y Whe cosiderig oisy supervisors we caot expect that we obtai zero risk. Ideed, for B 1 P := { x X : P y = 1 x > P y = 1 x } B 1 P := { x X : P y = 1 x < P y = 1 x } B 0 P := { x X : P y = 1 x = P y = 1 x } ad a fuctio f 0 : X { 1, 1} with f 0 x = 1 if x B 1 P ad f 0 x = 1 if x B 1 P we have cf. Devroye et al., 1997, hm R P f 0 = if { R P f : f :X { 1, 1} measurable } = px P X dx, 1 where the oise level p : X R is defied by px := P y = 1 x for x B 1 P, px := P y = 1 x for x B 1 P ad px = 1/2 otherwise. Equatio 1 shows that o fuctio ca yield less risk tha f 0. he fuctio f 0 is called a optimal Bayes decisio rule ad we write R P := R P f 0. Now, a classifier C should guaratee with high probability that R P f is close to R P provided that is large eough. Here, f deotes the decisio fuctio costructed by C o the basis of. Asymptotically, this meas that R P f R P should hold i probability if. I this case the algorithm C is called cosistet for P cf. Devroye et al., 1997, Def If a classifier is cosistet for all distributios o 1. For mathematical otios see Sectio 2. X 68

3 O the Cosistecy of Support Vector achies X Y it is said to be uiversally cosistet. Although several algorithms such as the k- earest eighbour classifier for k ad k/ 0 cf. Devroye et al., 1997, hm. 6.4 are uiversally cosistet it is a ope questio whether support vector machies are uiversally cosistet for a particular choice of the free parameters. I this article we show that at least for a restricted class of distributios SV s are cosistet provided that the parameters are chose i a specific maer. I particular our results cover both the oise free case ad the case of costat oise level, i.e p p, p [0, 1/2. Our results are based o a ivestigatio of a certai approximatio property of kerels. Recall that the asatz of SV s is to solve specific optimizatio problems over the class of fuctios { w, Φ. : w H} or { w, Φ. + b : w H, b R}, where Φ : X H is a feature map of the used kerel. If this fuctio class is dese i CX we shall call the correspodig kerel uiversal cf. Def. 4. Roughly speakig this otio eables us to approximate the Bayes decisio rule i probability, a fact that is frequetly used i our proofs of cosistecy. Usig the approximatio theorem of Stoe-Weierstraß we show that kerels that ca be expaded i certai types of aylor or Fourier series are uiversal cf. Cor. 10 ad Cor. 11. I particular it turs out that the Gaussia RBF kerel is uiversal cf. Ex. 1. Besides the importace of the otio of uiversality i the cotext of cosistecy it also turs out that this cocept has strog implicatios for the geometric iterpretatio of the shape of the feature map cf. Cor. 6 ad the followig remark. Sice the class of fuctios implemeted by a SV with uiversal kerel is very rich the problem of overfittig ca always occur i the presece of oise. hus it is very importat to kow how to chose the regularizatio parameter. he secod part of this work is devoted to this questio. Here, we show i particular that for a soft margi SV with Gaussia RBF kerel o X R d the regularizatio sequece c = β 1, 0 < β < 1/d, yields cosistecy for all problems with costat oise level cf. Cor. 17 ad Cor. 23. hese results are of special iterest sice they show at the first time that SV s are able to lear oisy problems arbitrarily well. oreover, we prove that for problems that esure a large margi it suffices to use uiversal kerels ad a regularizatio parameter that is idepedet of the traiig set size cf. hm. 18, hm. 19 ad hm. 24. For this class of problems we also prove cosistecy for the maximal margi classifier cf. hm. 25. Fially, it turs out that eve for these simple, oise free classificatio problems SV s with polyomial kerels ca behave arbitrarily badly cf. Prop. 20. his work is orgaized as follows: we itroduce some mathematical otios i Sectio 2. I the third sectio we study the cocept of uiversal kerels. he followig sectios are devoted to applicatios of these kerels to support vector machies. We begi with the 2-soft margi classifier i Sectio 4. Here, cosistecy results for both oisy distributios ad problems esurig a large margi are proved. I the fifth sectio we show that these results also hold for the 1-soft margi algorithm. I the last sectio we discuss the maximal margi classifier i the presece of large margis. 2. Prelimiaries For a set X, a metric d o X is a fuctio d : X X [0, such that for all x, y, z X we have dx, y = dy, x ad dx, y dx, z+dz, y as well as dx, y = 0 if ad oly if x = y. We deote the closed ball with radius ε ad cetre x by B d x, ε := {y X : dx, y ε}. 69

4 Steiwart he coverig umbers of X are defied by { N X, d, ε := mi N { } : x 1,..., x with X i=1 } B d x i, ε for all ε > 0. he space X, d is precompact if ad oly if N X, d, ε is fiite for all ε > 0. oreover, X is called compact if every ope coverig of X has a fiite subcoverig. If the space X, d is complete, i.e. every Cauchy sequece coverges i X, the X is compact if ad oly if X is precompact. For give A, B X we deote the distace of A ad B by da, B := if dx, y. x A y B We ofte use that if A is closed, B is compact ad both sets are disjoit the da, B > 0 holds. A σ-algebra o a set X is a set of subsets of X that cotais ad is closed uder elemetary, coutable set operatios such as complemets ad coutable itersectios. For a metric space X, d the Borel σ-algebra is the smallest σ-algebra that cotais all ope sets. Let A be a σ-algebra o X. A subset A of X is called measurable if A A. We say that a fuctio f : X R is measurable if the pre-image of every Borel measurable B R is i A. Basic examples are the fuctios 1 A, where A A ad 1 A x = 1 if x A ad 1 A x = 0 otherwise. A probability measure P : A R + is a σ-additive fuctio with P = 0 ad P X = 1. If A is a Borel σ-algebra we call P a Borel probability measure. I this case P is said to be regular, if for all Borel measurable B X we have P B = sup{p K : K B, K compact}. If X, d is compact, the every Borel probability measure o X is regular cf. Dudley, 1989, p I this paper H always deotes a Hilbert space, i.e. a complete, ormed vector space edowed with a dot product.,. givig rise to its orm via x = x, x. Let B H := {x H : x 1} be the closed uit ball of H ad S H := {x H : x = 1} be its sphere. Recall, that every separable Hilbert space is isometrically isomorphic to the space of all 2-summable sequeces l 2. A commutative algebra A is a vector space equipped with a additioal associative ad commutative multiplicatio : A A A such that x y + z = x y + x z λx y = λx y holds for all x, y, z A ad λ R. A classical example of a algebra is the space CX of all cotiuous fuctios f : X R o the compact metric space X, d edowed with the usual supremum orm f := sup fx. x X he followig well-kow approximatio theorem of Stoe-Weierstraß cf. Pederse, 1988, Cor states that certai subalgebras of CX geerate the whole space. his result will be the key tool whe cosiderig approximatio properties of kerels i the ext sectio: 70

5 O the Cosistecy of Support Vector achies heorem 1 Let X, d be a compact metric space ad A CX be a algebra. he A is dese i CX if both A does ot vaish, i.e. for all x X there exists a f A with fx 0, ad A separates poits, i.e. for all x, y X with x y there exists a f A with fx fy. 3. Kerels I the followig let X, d be a metric space. A fuctio k : X X R is called a kerel o X if there exists a Hilbert space H ad a map Φ : X H with kx, y = Φx, Φy for all x, y X. We call Φ a feature map ad H a feature space of k. Note, that both H ad Φ are far from beig uique. However, for a give kerel there exists a caoical feature space with associated feature map, which is the so-called reproducig kerel Hilbert space RKHS cf. Cristiaii & Shawe-aylor, 2000, Ch. 3. Sice our asatz is maily based o specific series expasios of certai kerels we do ot eed to cosider these spaces. Let k be a kerel o X ad Φ : X H be a feature map of k. A fuctio f : X R is iduced by k if there exists a elemet w H such that f = w, Φ.. he ext lemma shows that this defiitio is idepedet of Φ ad H: Lemma 2 Let k : X X R be a kerel ad Φ 1 : X H 1, Φ 2 : X H 2 be two feature maps of k. he for all w 1 H 1 there exist w 2 H 2 with w 2 w 1 ad w 1, Φ 1 x = w 2, Φ 2 x for all x X. Proof Let H 1 := spa Φ 1X ad H 1 its orthogoal complemet i H 1. he w 1 H 1 ca be writte as w 1 = w 1 + w 1 with w 1 H 1 ad w 1 H 1. Give a x X we have w 1, Φ 1 x = 0 ad therefore we obtai w 1, Φ 1x = w 1, Φ 1 x for all x X. Now by the defiitio of H1 ad w1 = =1 w1. he for w 2 there exists a sequece w1 spa Φ 1 X with w 1 = m m=1 λ m Φ 1 x m ad l 2 l 1 1 we obtai l 2 w 1 =l 1 2 = = = := m m=1 λ m Φ 2 x l 2 m l 2 m i λ m λ i j =l 1 m=1 i=l 1 j=1 l 2 m l 2 m i λ m λ i j =l 1 m=1 i=l 1 j=1 l 2 w 2 =l 1 2. Φ 1x m, Φ 1 x i j Φ 2x m, Φ 2 x i j herefore, m =1 w2 m 1 is a Cauchy sequece ad hece coverges to w 2 := =1 w2 H 2. Clearly, we the have w 2 = w1 w 1. oreover, a easy calculatio similar to the cosideratio above shows w 1, Φ 1 x = w 2, Φ 2 x for all x X. m 71

6 Steiwart I the followig we oly cosider cotiuous kerels. he followig lemma provides some useful properties of this class: Lemma 3 Let k be a kerel o the metric space X, d ad Φ : X H be a feature map of k. he k is cotiuous if ad oly if Φ is cotiuous. I this case d k x, y := Φx Φy defies a semi-metric o X such that the idetity map id : X, d X, d k is cotiuous. If Φ is ijective d k is eve a metric. Proof Let us first suppose that k is cotiuous. Sice d k x, y = kx, x 2kx, y + ky, y we observe that d k x,. : X, d R is cotiuous for every x X. I particular, {y X : d k x, y < ε} is ope with respect to d ad therefore id : X, d X, d k is cotiuous. Furthermore, Φ : X, d k H is cotiuous ad hece Φ : X, d H is also cotiuous. Coversely, assume that Φ is cotiuous. Sice for all x, x, y, y X we have kx, y kx, y Φx, Φy Φy + Φx Φx, Φy Φx Φy Φy + Φy Φx Φx it is easily verified that k is also cotiuous. he metric d k ejoys the property that every iduced fuctio w, Φ. is Lipschitz cotiuous with respect to d k ad the Lipschitz costat is bouded from above by w. his fact turs out to be very importat i the proof of heorem 12 sice it allows us to cotrol the behaviour of solutios of SV s o subsets of small diameters. From the last lemma we kow i particular that for a cotiuous kerel every iduced fuctio is cotiuous. he followig defiitio plays a cetral role throughout this paper: Defiitio 4 A cotiuous kerel k o a compact metric space X, d is called uiversal if the space of all fuctios iduced by k is dese i CX, i.e. for every fuctio f CX ad every ε > 0 there exists a fuctio g iduced by k with f g ε. We also eed a weaker cocept. Let Φ : X H be a feature map of k ad A, B be disjoit subsets of X. We say that k separates A ad B with margi γ 0 if ΦA ad ΦB ca be separated by a hyperplae with margi γ, i.e. if there exists a pair w, b S H R such that w, Φx + b > γ for all x A ad w, Φy + b < γ for all y B. If γ = 0 we simply say that k separates A ad B. I this case the restrictio w S H is superfluous. By Lemma 2 both defiitios are idepedet of the feature map Φ. We say that the kerel k separates all fiite, resp. compact subsets if it separates all fiite, resp. 72

7 O the Cosistecy of Support Vector achies compact disjoit subsets of X. Note, that if k separates compact sets A ad B the it automatically separates them with a suitable margi γ > 0. oreover, it was show i Steiwart 2001a, Ex that there exists a cotiuous kerel that separates all fiite subsets but fails to separate all compact subsets. Before we ivestigate which kerels are uiversal we collect some useful properties of these kerels. Firstly, let X, d ad X, d be compact metric spaces, k be a uiversal kerel o X ad ι : X X be a cotiuous ad ijective map. he oe easily checks that kι., ι. is a uiversal kerel o X. oreover, we have kx, x > 0 for all x X sice ky, y = 0 implies gy = 0 for all iduced fuctios g. Sice all feature maps of k are cotiuous ad X is compact we may also restrict ourselves to separable feature spaces of k. he ext propositio is fudametal for our cosideratios of support vector machies: Propositio 5 Let X, d be a compact metric space ad k be a uiversal kerel o X. he for all compact ad mutually disjoit subsets K 1,..., K X, all α 1,..., α R ad all ε > 0 there exists a fuctio g iduced by k with g max i α i + ε such that g K α i 1 Ki ε, where K := i=1 K i ad g K deotes the restrictio of g to K. i=1 Proof Sice dk i, K j > 0 for all i j we obtai i=1 α i1 Ki CK. Sice this fuctio ca be exteded to a cotiuous fuctio f o X with f max i α i by the Lemma of Urysoh or by a direct costructio with the help of d the assertio follows. Corollary 6 Every uiversal kerel separates all compact subsets. Proof Let X, d be a compact metric space ad k be a uiversal kerel o X with feature map Φ : X H. Give two compact ad disjoit subsets K 1 ad K 1 of X there exists a iduced fuctio g = w, Φ. with g K 1 K 1 1 K1 1 K 1 < 1/2. his implies that w 1 w, 0 separates K 1 ad K 1 with margi 1 2 w. Although the previous corollary is a almost trivial cosequece of the otio of uiversality it has surprisig implicatios for the geometric iterpretatio of the shape of the feature map: let us suppose that we have a fiite subset {x 1,..., x } of X. he the above corollary esures that for every sequece of sigs y 1,..., y the correspodig traiig set ca be correctly separated by a hyperplae i the feature space. oreover, this ca eve be doe by a hyperplae that has almost the same distace to every poit of {x 1,..., x }. herefore, ay fiite dimesioal iterpretatio of the geometric situatio i a feature space of a uiversal kerel must fail. I particular this holds for 2- or 3-dimesioal drawigs. Actually, the shape of the feature map is eve more complicated sice ot oly all fiite subsets but every pair of compact disjoit subsets ca be separated. he followig corollary esures i particular that the semi-metric d k iduced by a uiversal kerel k is i fact a metric: 73

8 Steiwart Corollary 7 Every feature map of a uiversal kerel is ijective. Proof Fiite subsets are compact ad thus the assertio follows by the previous corollary. Propositio 8 Let X, d be a compact metric space ad k be a uiversal kerel o X. he k kx, y x, y := kx, xky, y defies a uiversal kerel o X. Proof Let Φ : X H be a feature map of k ad αx := kx, x 1/2. Clearly, αφ : X H is a feature map of k ad thus k is a kerel. o show that k is uiversal we fix a fuctio f CX ad ε > 0. For a := α we the get a iduced fuctio g = w, Φ. with α 1 f g ε a. his yields ad thus the assertio is proved. f w, αφ. α α 1 f g ε Up to ow we do ot kow whether there exist uiversal kerels. o attack this questio we begi with a simple criterio that makes it possible to check whether a give kerel is uiversal: heorem 9 Let X, d be a compact metric space ad k be a cotiuous kerel o X with kx, x > 0 for all x X. Suppose that we have a ijective feature map Φ : X l 2 of k with Φx = Φ x N. If A := spa {Φ : N} is a algebra the k is uiversal. Proof Because of kx, x > 0 for all x X the algebra A does ot vaish. Sice k is cotiuous every Φ : X R is cotiuous by Lemma 3 ad hece A CX. oreover, A is eve dese i CX sice the ijectivity of Φ implies that A separates poits ad thus heorem 1 ca be applied. Now we fix f CX ad ε > 0. he there exists a fuctio g = λ j Φ j A j=1 such that f g ε. However, if we defie w := λ j for = j ad w := 0 otherwise, we have w := w l 2 ad w, Φ. = g. he followig corollaries give various examples of uiversal kerels. We begi with kerels that ca be expressed by a aylor series: Corollary 10 Let 0 < r ad f : r, r R be a C -fuctio that ca be expaded ito its aylor series i 0, i.e. fx = a x for all x r, r. =0 74

9 O the Cosistecy of Support Vector achies Let X := {x R d : x 2 < r}. If we have a > 0 for all 0 the kx, y := f x, y defies a uiversal kerel o every compact subset of X. Proof Sice x, y x 2 y 2 < r for all x, y X we see that k is well-defied. We also have kx, y = = = d a x k y k =0 =0 a k 1,...,k d 0 k=1 k 1 + +k d = k 1,...,k d 0 c k1,...,k d i=1 a k1 + +k d c k1,...,k d d x i y i k i d i=1 x k i i d i=1 y k i i, where c k1,...,k d := d i=1 k i! 1 d i=1 k i! cf. also Poggio, 1975, Lem Note, that the series ca be rearraged sice it is absolutely summable. I particular, for x = y we obtai that Φ : X l 2 N d 0 is well defied by Φx := ak1 + +k d c k1,...,k d d i=1 x k i i. k 1,...,k d 0 he above equatio also shows that kx, y = Φx, Φy holds for all x, y X ad hece k is ideed a kerel. oreover, a 0 > 0 implies kx, x > 0 for all x X ad trivially, Φ is ijective. Sice A := spa {Φ k1,...,k d : k 1,..., k d 0} is a algebra we thus obtai by heorem 9 that k is uiversal. Istead of aylor series oe ca also cosider Fourier expasios. he result reads as follows: Corollary 11 Let f : [0, 2π] R be a cotiuous fuctio that ca be expaded i a poitwise absolutely coverget Fourier series of the form ft = a cost. 2 =0 If a > 0 holds for all 0 the kx, y := d i=1 f x i y i defies a uiversal kerel o every compact subset of [0, 2π d. Recall, that every fuctio f : [0, 2π] R that ca be exteded to a cotiuous, symmetric, periodic ad piecewise cotiuously differetiable fuctio o R has a Fourier series of the form 2. Proof By iductio ad the Cauchy product of series we may restrict ourselves to d = 1. he kx, y = a 0 + a six siy + a cosx cosy =1 75 =1

10 Steiwart holds for all x, y [0, 2π ad hece Φ = Φ 0 defied by Φ 0 x := a 0 ad Φ 2 1 x := a six, Φ 2 x := a cosx for 1 is a ijective feature map of k with image i l 2. oreover, A := spa { a si. : 1} { a 0 cos. : 0} is a algebra ad sice a 0 > 0 implies kx, x > 0 for all x X we obtai that k is uiversal. he followig examples show that various well-kow kerels are uiversal: Example 1 he kerels exp σ ad exp.,. are uiversal o every compact subset of R d. Proof he uiversality of exp.,. is due to Corollary 10. herefore, by Propositio 8 ad exp σ 2 x y 2 2 = exp σx 2 2 exp σy 2 2 exp 2σx, 2σy the assertio follows for the RBF kerel. Example 2 Let X := {x R d : x 2 < 1} ad α > 0. he V. Vovk s cf. Sauders et al., 1998, p. 15 ifiite polyomial kerel kx, y := 1 x, y α, x, y X, is uiversal o every compact subset of X. Proof o check the assertio we use that 1 t α = α =0 1 t holds for t < 1. Sice α 1 > 0 for all 0, the assertio the follows by Corollary 10. Example 3 Let 0 < q < 1 ad ft := 1 q 2 /2 4q cos t + 2q 2, t R. he the stroger regularized Fourier kerel kx, y := d i=1 fx i y i cosidered by Vapik 1998, p. 470 ad Sauders et al. 1998, p. 15 is uiversal o every compact subset of [0, 2π d. Proof he assertio ca be see usig Corollary 11 ad ft = 1/2 + =1 q cost cf. Gradstei & Ryshik, 1981, p. 68. Example 4 Let 0 < q < ad ft := π cosh π t /q / 2q sihπ/q for all t with 2π t 2π. he the weaker regularized Fourier kerel kx, y := d i=1 fx i y i cosidered by Vapik 1998, p. 470/1 ad Sauders et al. 1998, p. 15 is uiversal o every compact subset of [0, 2π d. Proof o obtai the assertio we use ft = 1/2 + =1 cost/1 + q2 2 cf. Gradstei & Ryshik, 1981, p he 2-orm soft margi classifier Let k be a kerel o X ad Φ : X H be a feature map of k. For a traiig set = x 1, y 1,..., x, y X Y ad c > 0 we deote the uique solutio of the 76

11 O the Cosistecy of Support Vector achies optimizatio problem miimize Ww, b, ξ := w, w + c ξi 2 i=1 over w, b, ξ subject to y i w, Φx i + b 1 ξ i, i = 1,..., 3 by w 2,k,c, b 2,k,c H R. A algorithm C 2,c k that provides the decisio fuctio f 2,k,c x := sig w 2,k,c, Φx + b 2,k,c, x X for every traiig set is called a 2-orm soft margi classifier 2-SC with kerel k ad parameter sequece c. Note, that i order to have a small set of free parameters oe usually fixes c := c for all 1. I this sectio it turs out that this is ot suitable for problems that do o guaratee a large margi. Istead oe should use sequeces c = c β 1 where β > 0 is a parameter a-priori determied by the kerel ad c is a ew free parameter cf. Cor. 17. Of course, for fixed traiig set sizes both parameterizatios are equivalet, i.e. they ca be trasformed ito each other. By Lemma 2 the decisio fuctio is idepedet of the choice of the feature map Φ. oreover, f 2,k,c ca be expressed by f 2,k,c x = i=1 y i α i kx i, x + b 2,k,c, where α i 0 are suitable costats depedig o ad b 2,k,c ca also be computed with the help of the kerel cf. Cristiaii & Shawe-aylor, 2000; Vapik, 1998; Schölkopf et al., Note, that if k is a kerel o X which separates all fiite sets ad X has ifiitely may elemets the the fuctio class represeted by the 2-SC has ifiite VC-dimesio. For more iformatio o this we refer to Vapik 1998, Ch. 4, Cristiaii & Shawe-aylor 2000, Ch. 4 ad va der Vaart & Weller 1996, Ch Give a Borel probability measure P o X Y with oise level p we deote the odetermiistic part of the supervisor by X + := {x X : px > 0}. If P X X + > 0 we write q := if x X px ad p := sup x X px. Due to techical reasos we defie q := p := 1/4 otherwise. We begi with a prelimiary result: heorem 12 Let X, d be a compact metric space ad k be a uiversal kerel o X. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 there exist c > 0 ad δ > 0 such that for all c c, 0 < δ δ ad all 1 we have Pr { X Y : R P f 2,k,c/ R P + 4 p q } 1 2q P XX + + ε 1 3e 2 δ 2, where := N X, d k, δ c is the coverig umber of X with respect to the metric dk which is iduced by the kerel k. oreover, Pr deotes the outer probability measure of P. Note, that i order to avoid the probably very difficult questio whether the sets { X Y : R P f 2,k,c/ α } 77

12 Steiwart are measurable we cosider the outer probability measure, oly. Sice the proof of heorem 12 is very techical we like to explai the basic idea of the proof firstly. Let us suppose that the supervisor has a costat level of oise p [0, 1/2. oreover, we assume that we have a iduced fuctio w, Φ. which has the costat values 1 2p, resp. 1 2p o B 1 P, resp. B 1 P. Now let us take a represetative traiig set of legth. he oe easily checks cf. estimate 4 that w k,c/, w k,c/ + c ξl 2 w, w + 4cp1 p l=1 holds. Here meas that the relatio oly holds approximately. O the other had, by the cotiuity of the decisio fuctio w k,c/ a misclassified compared with the optimal Bayes decisio rule elemet z forces the sum of those slack variables, which belog to samples i the eighbourhood of z, to be approximately greater tha their cardiality cf. iequality 6. Coversely, for a correctly classified elemet the correspodig sum of the slack variables is approximately larger tha 4p1 p times their cardiality cf. iequality 7. Combiig these cosideratios we obtai c1 2p 2 P X E + 4cp1 p = c P X E + 4p1 pp X X \ E w k,c/, w k,c/ + c l=1 w, w + 4cp1 p, where E deotes the set of misclassified compared with the optimal Bayes decisio rule elemets. hus P X E must be small if we have chose c large eough. he difficulty of the proof below is firstly, to trasfer the idea to the geeral case ad secodly, to give exact formulatios of represetative, eighbourhood ad approximately. Proof of heorem 12 Without loss of geerality we may assume ε 0, 1]. Let X 0 := X \ X + be the determiistic part of the supervisor ad X 1 0 := X0 B 1 P, X1 0 := X0 B 1 P be the parts of the classes B 1 P, B 1 P i X 0. Furthermore, let X 1 + := X+ B 1 P ad X 1 + := X+ B 1 P be the parts of the classes B 1 P ad B 1 P i X +. We defie δ := mi{2p, ε 74 q 1 2q } ad fix δ δ. Sice P X is regular cf. Dudley, 1989, p. 176 there exist compact subsets K j i of X j i with P X X j i \ Kj i δ/4 for i { 1, 1} ad j {0, +}. oreover, for a fixed feature map Φ : X H of k Propositio 5 esures the existece of a elemet w H with w, Φx [1, 1 + δ] if x K1 0 [ 1 + δ, 1] if x K 1 0 [1 2p δ, 1 2p ] if x K 1 + [ 1 2p, 1 2p δ] if x K 1 + [ 1 + δ, 1 + δ] otherwise. ξ 2 l 78

13 O the Cosistecy of Support Vector achies We defie c := 2 ε1 2q w 2 2 ad for fixed c c let σ := obtai partitios P j i of K j i with diam dk A σ for all A P j i ad N X, d k, σ =. i { 1,1} j {0,+} P j i δ c. By Lemma 13 we the Let P j i := {A P j i : P X A δ q } for i { 1, 1} ad j {0, +}. represetative traiig sets we defie F,A := { x1, y 1,..., x, y : {l : x l A} P X A δ } o costruct for all A P j i, i { 1, 1} ad j {0, +}. oreover, for A P+ i, i { 1, 1} let F +,A := { x1, y 1,..., x, y : { x1 F,A :=, y 1,..., x, y : F := F,A {l : xl A ad y l = i} {l : x l A ad y l i} F,A F +,A F,A. 1 p P X A δ q P X A δ } } A P 0 1 P0 1 A P + 1 P+ 1 Lemma 14 yields P F 1 3e 2 δ 2 ad thus it suffices to show that R P f 2,k,c/ R P + 4 p q 1 2q P XX + + ε holds for all F. herefore, let us fix a traiig set = x 1, y 1,..., x, y F. For c := c/ we deote the solutio of 3 by w, b, ξ. Our first step is to estimate Ww, b, ξ from above by comparig it with Ww, 0, ξ, where ξ is a admissible slack vector of w, 0. Hece we have to costruct ξ. For this let us first assume that we have a sample x l, y l K1 0. he we observe that y l w, Φx l = w, Φx l 1 ad thus we may defie ξl := 0. Aalogous cosideratios yield 0 if x l Ki 0 ξl 2p + δ if x l K i + :=, y l = i 2 2p if x l K i +, y l i 2 + δ otherwise. oreover, our costructio of F guaratees that there are at most i { 1,1} j {0,+} A P j i P X A δ 1 i { 1,1} j {0,+} P X X j i δ q = 2 q δ 79

14 Steiwart samples which are ot elemets of a suitable K j i. Furthermore, sice there are at most samples i K + := K + 1 i { 1,1} A Pi 0 K+ 1 P X X q δ P X A δ P X X q δ we also obtai that there are at most i { 1,1} A P + i 1 p P X A δ p P X X q δ samples i K + which have icorrect labels. Sice these have larger slack variables with respect to w, 0 tha the correctly labeled samples i K + we obtai Ww, b, ξ Ww, 0, ξ w, w + c i { 1,1} x l K + i y l =i ξ l 2 + c i { 1,1} x l K + i y l i ξ l 2 + 2δc2 + δ 2 w, w + c1 p P X X + 2p + δ 2 + c p P X X q δ 2 2p 2 + 2δc2 + δ 2 w, w + c 4p 1 p P X X q δ. 4 For later purposes we ote that we also have Ww, b, ξ W 0, 0, 1,..., 1 c ad thus w 2 c. I the secod step of the proof we estimate Ww, b, ξ from below. For this let us deote the set of misclassified poits i X j i by E j i := {x X j i : f x i}. For brevity s sake we also write E j := E j 1 Ej 1 ad E := E0 E +. Let us first cosider a A P 0 i with A E. Without loss of geerality we may assume that i = 1. he for x l A ad z A E we obtai 1 ξ l y l w, Φx l + b = w, Φx l Φz + w, Φz + b w d k x l, z δ, i.e. ξ 2 l 1 δ2 1 2δ. Sice the same estimate holds i the case i = 1 our costructio of F implies 1 i { 1,1} A Pi 0 x l A A E ξ 2 l i { 1,1} A Pi 0 A E 1 2δ P X A δ 80 P X E 0 3 q δ. 5

15 O the Cosistecy of Support Vector achies Now let us cosider a A P + i with A E. Without loss of geerality we may assume that i = 1 agai. he for fixed z A E ad a := w, Φz + b 0 we obtai ξ l { 1 δ + a for x l A with y l = 1 max{0, 1 δ a} for x l A with y l = 1 aalogously to the above cosideratios. We first treat the case 1 δ a 0. Sice there are at least P X A δ/ samples i A ad at least 1 p P X A δ/ correctly labeled samples i A we get 1 ξ 2 l 1 δ + a 2 1 p P X A δ x l A + 1 δ a 2 p P X A P X A 1 δ + a 2 1 p + 1 δ a 2 p 4δ. Now a easy miimizatio with respect to a [0, 1 δ] yields 1 x l A ξ 2 l P X A 1 δ 2 1 p + 1 δ 2 p 4δ 1 2δP XA 4δ. O the other had, if 1 δ a < 0 we have 1 δ + a > 2 2δ ad thus the same iequality follows: 1 ξ 2 l 1 δ + a 2 1 p P X A δ x l A 1 2δP X A 4δ. herefore, we obtai 1 i { 1,1} A P + x l A i A E ξ 2 l i { 1,1} A P + i A E 1 2δP X A 4δ. 6 Fially, a aalogous cosideratio yields 1 i { 1,1} A P + x l A i A E= ξ 2 l i { 1,1} A P + i A E= 1 2δ4q 1 q P X A 4δ. 7 81

16 Steiwart Now, the estimates 6 ad 7 imply 1 ξ 2 l + 1 i { 1,1} A P + x l A i A E i { 1,1} A P + i A E i { 1,1} A P + i A E 1 2δP X A 4δ i { 1,1} A P + x l A i A E= + ξ 2 l i { 1,1} A P + i A E= 1 2δ1 2q 2 P X A + i { 1,1} A P + i 1 2δ4q 1 q P X A 4δ 1 2δ4q 1 q P X A 4δ 1 2δ1 2q 2 P X E δ4q 1 q P X X + 6δ 2 q δ 1 2q 2 P X E + + 4q 1 q P X X + 7 q δ. he latter iequality together with 5 yields 1 l=1 ξ 2 l P X E q 2 P X E + + 4q 1 q P X X + 10 q δ. Combiig this estimate with 4 we ow obtai w, w c P X E q 2 P X E + + 4q 1 q P X X + 4p 1 p P X X + 37 q δ c P X E q 2 P X E + 4p q P X X + 37 q δ. 8 oreover, a simple calculatio shows R P f 2,k,c/ 1 = R P + 1 2p dp X R P + 1 2q P X E q 2 P X E + E ad thus P X E q 2 P X E + 1 2q R P f 2,k,c/ R P. With this ad 8 we fid w, w 2 w, w ε1 2q he latter iequality fially implies 1 2q R P f 2,k,c/ R P 4p q P X X + 37 q δ. R P f 2,k,c/ R P ε p q 1 2q P XX + 37 ε1 2q q + 1 2q q 74 = 4 p q 1 2q P XX + + ε. hus the assertio follows. 82

17 O the Cosistecy of Support Vector achies We have to prove the remaiig lemmas ow. We begi with the lemma that costructs the partitios P j i of the above proof: Lemma 13 Usig the otatios of the proof of heorem 12 there exist partitios P j i of Kj i, i { 1, 1}, j {0, +}, such that diam dk A σ for all A P j i, i { 1, 1}, j {0, +}, ad N X, d k, σ. 9 i { 1,1} j {0,+} P j i Proof By the defiitio of the coverig umbers there exists a partitio P of X with diam dk A σ for all A P ad P N X, d k, σ. Let us defie P j i := {A P : A K j i } ad Pj i := {A Kj i : A P j i }. herefore, to prove 9 we have to show that the P j i s are mutually disjoit. Assume the coverse, i.e. there exists A P j j i P i with i i or j j. By the defiitio of the partitios there exist z 1, z 2 A with z 1 K j i ad z 2 K j i. Now o the oe had, we obtai w, Φz 1 Φz 2 w d k z 1, z 2 w σ w δ c < δ but o the other had we also have w, Φz 1 Φz 2 = w, Φz 1 w, Φz p δ δ. herefore the assertio follows. Lemma 14 Usig the otatios of the proof of heorem 12 we have P F 1 3e 2 δ 2. Proof Let us recall Hoeffdig s iequality cf. Devroye et al., 1997, hm. 8.1 which i particular states that for all i.i.d. radom variables z i : Ω, A, Q {0, 1} ad all ε > 0, 1 we have Q z i q ε e 2ε2, i=1 where q := Qz i = 1. hus for A P + i we get P F +,A P { x1, y 1,..., x, y : {l : x l A, y l = i} p dp X A1 δ 1 P { x1, y 1,..., x, y : {l : x l A, y l = i} p dp X A1 δ 1 e 2 δ 2, } } 83

18 Steiwart i.e. P Z \F +,A e 2 δ 2, where Z := X Y. Aalogously, we obtai P Z \F,A e 2 δ 2 for all A P + i yield ad P Z \ F,A e 2 δ 2 for all A P j i. hese estimates P F = P F,A F,A F +,A F,A A P 0 1 P0 1 = 1 P Z \ F,A A P + 1 P+ 1 Z \ F,A Z \ F +,A Z \ F,A 1 A P 0 1 P0 1 P Z \ F,A A P + 1 P+ 1 P Z \ F +,A P Z \ F,A A P j 1 Pj 1 j {0,+} 1 3e 2 δ 2. A P + 1 P+ 1 A P + 1 P+ 1 With the help of heorem 12 we ca ow ivestigate how to choose the parameter sequece c for a give uiversal kerel: heorem 15 Let X, d be a compact metric space ad k be a uiversal kerel o X such that the coverig umbers of X, d k fulfill N X, d k, ε Oε α for some α > 0. Suppose that we have a positive sequece c with c O β 1 for some 0 < β < 1 α ad c. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 we have lim Pr { X Y : R P f 2,k,c R P + 4 p q } 1 2q P XX + + ε = 1. Proof Let γ := 1 αβ 41+α > 0 ad δ := γ. By heorem 12 there exist c > 0 ad δ > 0 such that for all c c, 0 < δ δ ad all 1 we have Pr { X Y : R P f 2,k,c/ R P + 4 p q } 1 2q P XX + + ε 1 3e 2 δ 2, where := N X, d k, δ c. Sice δ 0 ad c we may fix a 0 1 such that for all 0 we have both δ δ ad c c. his yields Pr { X Y : R P f 2,k,c R P + 4 p q } 1 2q P XX + + ε 1 3 e 2 δ 2, where := N δ X, d k, c. his implies O αγ+β/2 ad hece we easily check that e δ 2 Oe 21+αγ holds. hus the assertio follows. 84

19 O the Cosistecy of Support Vector achies Corollary 16 Uder the assumptios of heorem 15 the 2-SC with sequece c is cosistet for all Borel probability measures P o X Y with q = p < 1/2. I particular this holds for all Borel probability measures with costat oise level. Corollary 17 Let X R d be compact ad k be a Gaussia RBF kerel o X. Let 0 < β < 1 d ad c be a positive sequece with c O β 1 ad c. he the 2-SC with kerel k ad sequece c is cosistet for all Borel probability measures P o X Y with q = p < 1/2. Proof observe Let σ > 0 ad kx, y := exp σ 2 x y 2 2. Sice 1 e t t for all t 0 we d k x, y = 2 2 exp σ 2 x y 2 2 2σ x y 2. his yields N X, d k, ε N ε X,. 2, 2σ ad thus N X, dk, ε Oε d cf. Carl & Stephai, 1990, p. 9. For the classificatio problems we have cosidered up to ow we usually may ot expect that we obtai a large margi for sample sizes growig to ifiity. I the followig we restrict ourselves to distributios that guaratee a fixed ad strictly positive margi for all traiig sets. Of course, these classificatio problems must have a determiistic supervisor, i.e. a oise level that vaishes almost everywhere. I geeral, additioal assumptios are required. Usig uiversal kerel these reduce to a simple geometric coditio: heorem 18 Let X, d be a compact metric space ad k be a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all c > 0, ε > 0 ad m we have Pr { X Y : R P,S f 2,k,c ε} 1 e ε 2 +m. Here, := N X, d k, γ/2 is the coverig umber of X, d k ad m := 4 cγ Proof Sice X, d k is precompact ad d B 1 P, B 1 P > 0 both B i P are compact, too. hus they ca be separated with margi γ > 0 by Corollary 6. I aalogue to Lemma 13 we ow costruct partitios P i of B i P, i { 1, 1} such that diam dk A γ/2 for all A P i ad P 1 P 1. We defie P i := {A P i : P X A > ε } for i { 1, 1}. oreover, for m ad A P 1 P 1 let F,A := F := { x1, y 1,..., x, y X Y : A P 1 P 1 F,A. 85 {l : x l A} } m

20 Steiwart For m the Cheroff-Okamoto iequality see e.g. Dudley, 1978 the yields P X Y ε \ F,A exp m ε 1 ε exp 2 ε 2 2 ε m 1 + m 12 exp ε 2 + m ad thus P F 1 e ε 2 +m for all m. Hece it suffices to show that for all = x 1, y 1,..., x, y F the decisio fuctio f 2,k,c classifies A P 1 A ad A P 1 A correctly. o see this we fix a feature map Φ : X H of k. For brevity s sake the uique solutio of problem 3 is deoted by w, b, ξ. Furthermore, w, b S H R is a hyperplae that separates Φ B 1 P ad Φ B 1 P with margi γ > 0. he we have y l w, Φx l + b γ for all l = 1,..., ad therefore we obtai w, w + c l=1 ξ 2 l w γ, w γ 2 ε = 1 γ 2. I particular this implies w 1/γ. Now, let us suppose that there exists a misclassified poit z. Without loss of geerality we may assume that z A P 1 A. Hece there is a A P 1 with z A ad for this there exist mutually differet idexes l 1,..., l m such that x lj A. I particular this yields d k x lj, z γ/2 for all j = 1,..., m. Sice w, Φz + b 0 we thus obtai 1 ξ lj w, Φx lj +b = w, Φx lj Φz + w, Φz +b w d k x lj, z 1 2. Hece we have ξ lj 1/2 ad this leads to the cotradictio 1 γ 2 w, w + c ξl 2 l=1 c m ξl 2 j cm 4 j=1 = c 4 4 cγ > 1 γ 2. herefore the assertio follows. heorem 18 shows that the 2-SC works well with fixed weight factor c wheever it treats a classificatio problem that esures a large margi. We believe that these distributios are also the oly oes for which a fixed c is suitable. Our cojecture is based o the observatio that the costat c cotrols the Lipschitz costat of the solutio of 3 with respect to the metric d k : if we have a classificatio problem that does ot guaratee a large margi the Lipschitz costat may grow like. he proofs of this sectio idicate that this may be too fast sice for large sample sizes the solutio eed ot be almost costat o each elemet of the partitios, i.e. overfittig may occur. I the proof of the above theorem we oly used elemets of the partitios P i whose probability was larger tha or equal to ε. If we exted our cosideratios to all elemets with strictly positive probability we obtai the followig theorem: 86

21 O the Cosistecy of Support Vector achies heorem 19 Let X, d be a compact metric space ad k a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all c > 0 there exists a costat ρ > 0 such that for all 4 cγ N X, dk, γ/2 we have Pr { X Y : R P,S f 2,k,c = 0} 1 e ρ, Up to ow we have oly treated uiversal kerels. Oe may ask whether other classes of kerels are also suitable to treat with the classificatio problems cosidered i this work. Oe type ofte used are polyomial kerels of the form.,. + c m, c 0, m N, o a subset X of R. For these kerels the fuctios geerated by the 2-SC are polyomials o X of degree up to m. hus the ext propositio shows that these kerels are ot eve capable to solve the problems of heorems 18 ad 19: Propositio 20 Let P d be the set of all polyomials o X := [0, 1] d whose degree is less tha + 1. he for all ε > 0 there exists a Borel probability measure P o X Y with d B 1 P, B 1 P > 0, R P = 0 ad if{r P sig f : f P d } 1 2 ε. Proof We first treat the case d = 1. We fix a iteger m 3 + 2/ε ad let I i := [i + ε/m, i + 1 ε/m], i = 0,..., m 1. Deotig the Lebesgue measure o I i by λ Ii let P X := 1 2ε 1 m 1 i=0 λ I i. oreover, we defie a determiistic supervisor P.. by P y = 1 x := 1 for x I 2i, i = 0,..., m + 1/2, ad P y = 1 x := 1 otherwise. For a fixed polyomial f P d we deote its mutually differet ad ordered zeroes i 0, 1 by x 1 <... < x k, k. For brevity s sake let x 0 := 0 ad x k+1 := 1. Fially, we defie a j := { i : I i [x j, x j+1 ] } for j = 0,..., k. he by a easy observatio we get k j=0 a j m k m. oreover, at most a j + 1/2 itervals I i are correctly classified o [x j, x j+1 ] by the fuctio sig f. Hece at least k j=0 aj 2 k j=0 aj 2 1 m 2 k + 1 m itervals I i are ot correctly classified o [0, 1] by sig f. Sice P X I i = 1/m we thus obtai R P sig f 1 k aj 1 m m 1 2 ε. j=0 o treat the case d > 1 we have to embed [0, 1] ito [0, 1] d via t t, 0,..., 0. Cosiderig the above distributio P embedded ito [0, 1] d we the get the assertio sice polyomials i d variables o [0, 1] embedded ito [0, 1] d as above are essetially polyomials i oe variable. 87

22 Steiwart 5. he 1-orm soft margi classifier We ow cosider the 1-orm soft margi classifier. Agai we fix a kerel k o X with feature map Φ : X H. For a traiig set = x 1, y 1,..., x, y X Y ad c > 0 we deote a solutio of the optimizatio problem by w 1,k,c miimize w, w + c ξ i i=1 over w, b, ξ subject to y i w, Φx i + b 1 ξ i, i = 1,..., ξ i 0, i = 1,...,, b 1,k,c H R. A algorithm C 1,c k that provides the decisio fuctio f 1,k,c x := sig w 1,k,c, Φx + b 1,k,c, x X for every traiig set is called a 1-orm soft margi classifier 1-SC with kerel k ad parameter sequece c. For a itroductio to the 1-SC as well as for implemetatio techiques we refer to Cristiaii & Shawe-aylor 2000, Ch. 6 ad 7 ad Vapik 1998, Ch. 10 Burges & Crisp 2000 proved that i geeral the optimizatio problem 10 has o uique solutio. Although the 1-SC oly has to costruct a arbitrary solutio we show i this sectio that it ejoys all the properties of the 2-SC prove i this work. We begi with a statemet which is aalogous to heorem 12: heorem 21 Let X, d be a compact metric space ad k be a uiversal kerel o X. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 there exist c > 0 ad δ > 0 such that for all c c, 0 < δ δ ad all 1 we have Pr { X Y : R P f 1,k,c/ R P + 4 p q } 1 2q P XX + + ε e 2 δ 2, where := 4N X, d k, δ c is essetially the coverig umber of X with respect to the metric d k which is iduced by the kerel k. Sketch of the proof Sice the proof is very similar to that of heorem 12 we oly poit out the mai differeces besides adjustig the costats: Firstly the vector w H has to be chose i a differet maer, amely it has to fulfill { w i [1, 1 + δ] if x K j i, Φx [ 1 + δ, 1 + δ] otherwise. his coditio also eforces the secod modificatio sice Lemma 13 does ot hold aymore i this settig. Ideed, we caot guaratee that the defiitio of the P j i s i Lemma 13 implies that they are mutually disjoit. Istead, we oly obtai P j i N δ X, d k, c ad thus 4N X, d k, σ. i { 1,1} j {0,+} P j i 88

23 O the Cosistecy of Support Vector achies he latter explais the defiitio of, which is differet to that of heorem 12. Followig the proof of heorem 15 we obtai a aalogous result for the 1-SC: heorem 22 Let X, d k be a compact metric space ad k be a uiversal kerel o X such that the coverig umbers of X, d k fulfill N X, d k, ε Oε α for some α > 0. Suppose that we have a positive sequece c with c O β 1 for some 0 < β < 1 α ad c. he for all Borel probability measures P o X Y with q, p 0, 1/2 ad all ε > 0 we have lim Pr { X Y : R P f 1,k,c R P + 4 p q } 1 2q P XX + + ε = 1 o complete our cosideratios of oisy problems we metio that for the 1-SC with Gaussia RBF kerel we obtai the followig cosistecy result which has already bee proved for the 2-SC: Corollary 23 Let X R d be compact ad k be a Gaussia RBF kerel o X. Let 0 < β < 1 d ad c be a positive sequece with c O β 1 ad c. he the 1-SC with kerel k ad sequece c is cosistet for all Borel probability measures P o X Y with q = p < 1/2. I the presece of large margis it turs out that we may fix the weight factor aalogously to the 2-SC. For brevity s sake we oly state a result that is similar to heorem 18. However, a result that is aalogous to heorem 19 also holds. heorem 24 Let X, d be a compact metric space ad k a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all c > 0, ε > 0 ad m we have Pr { X Y : R P,S f 1,k,c ε} 1 e ε 2 +m, where := N X, d k, γ/2 is the coverig umber of X, d k ad m := 2 cγ Sketch of the proof he proof is completely aalogous to that of heorem 18. However, sice the slack variables are ot squared we obtai c m ξ lj cm 2 j=1 i the last estimate of the proof of heorem 18 ad this yields the slightly better value of m. Fially we metio that usig polyomial kerels Propositio 20 ca also be applied. hus problems that caot be treated well with a fixed polyomial kerel ad the 2-SC caot be treated well with the 1-SC ad the same kerel, either ad vice versa. 89

24 Steiwart 6. he maximal margi hyperplae classifier We fially cosider the maximal margi classifier. Agai we fix a kerel k o X with feature map Φ : X H. For a traiig set = x 1, y 1,..., x, y X Y we deote the uique solutio of the optimizatio problem miimize w, w over w, b subject to y i w, Φx i + b 1, i = 1,..., 11 by w k, bk H R. A algorithm C k that provides the decisio fuctio f k x := sig w k, Φx + b k, x X for every traiig set is called a maximal margi classifier C with kerel k. For a itroductio to the C as well as for implemetatio techiques ad a geometric motivatio we agai refer to Cristiaii & Shawe-aylor 2000, Ch. 6 ad 7 ad Vapik 1998, Ch. 10. he C is assumed to work poorly i the absece of large margis cf. og Zhag, hus we oly cosider the settig of heorem 18. We begi with a result similar to heorem 18 ad heorem 24: heorem 25 Let X, d be a compact metric space ad k a uiversal kerel o X. Suppose that we have a Borel probability measure P o X Y with a determiistic supervisor ad with classes B 1 P, B 1 P which have strictly positive distace, i.e. d B 1 P, B 1 P > 0. he k separates B 1 P ad B 1 P with margi γ > 0 ad for all ε > 0 ad := N X, d k, γ/2 we have Pr { X Y : R P,S f k ε} 1 e l1 ε 2. Sketch of the proof We repeat the costructio of the proof of heorem 18 with m := 1. A easy calculatio the shows P F 1 e l1 ε 2. Now suppose that for F we have a elemet z A P 1 A misclassified by f k. Hece there exist a A P 1 with z A ad a sample x l, y l of with x l A. he the followig estimate yields a cotradictio: 0 w k, Φz +b k = w k, Φz Φx lj + w k, Φx lj +b k w k d k x lj, z herefore the assertio follows. he above result ad its proof idicate that i the presece of large margis it may be more suitable to use the C istead of a soft margi algorithm. We metio that a estimate which is very similar to heorem 25 ca be obtaied usig data-depedet margi-based bouds of Shawe-aylor et al o compare both we first state a corollary: 90

25 O the Cosistecy of Support Vector achies Corollary 26 Let k σ be the Gaussia RBF kerel with parameter σ o the uit ball X := of the d-dimesioal Euclidea space. Let P be a Borel probability measure o X Y B l d 2 which ca be separated by k σ with margi γ > 0. he for all δ 0, 1 ad all 2 16σ γ we have Pr { X Y : R P,S f k 4d 16σ d d l 16σ γ γ + l 2 } 1 δ. δ d Proof As already show i the proof of Corollary 17 we have N X, d k, ε ε N X,. 2, 2σ 2 5 d ε d 2σ 28σ d ε d ad thus := N X, d k, γ/2 216σ d γ d. Now let ε := 21 δ 1/ for 216σ d γ d, i.e. δ = 1 ε. 2 Sice ε < 2 l δ heorem 25 yields the assertio. Corollary 26 shows that the learig curve of the C is of order O 1 provided that P guaratees a large margi. hese coditios also allow the applicatio of margi-based bouds o geeralizatio proved by Shawe-aylor et al cf. also Bartlett & Shawe- aylor, 1999; Cristiaii & Shawe-aylor, We the obtai that the learig curve is of order O 1 log 2. However, to compare both results oe also has to cosider the costats that arise sice the sample size oly varies i a typical rage. Here we observe that the ifluece of the margi γ is essetially of order Oγ 2 i the estimates of Shawe- aylor et al while the Corollary 26 shows that our estimates are essetially iflueced by order Oγ d, where d is the dimesio of the iput space X. We thus suppose that oly for small iput dimesios d our estimates are more suitable to treat realistic sample sizes. 7. Coclusios he aim of this paper has bee to ivestigate which kid of distributios could be classified well by support vector machies SV s. It has tured out that the ability of the kerel to approximate arbitrary cotiuous fuctios plays a fudametal role for this questio. Sice the resultig fuctio classes represeted by the classifier are very large there always exists the risk of overfittig. However, usig soft margi support vector machies with specific sequeces of regularizatio parameters this risk ca be cotrolled at least for simple oise models, e.g. models with costat oise level. I particular, the restrictio to large margi problems i og Zhag 2001, p. 442 has bee sigificatly weakeed. Sice the asatz of this paper is ew may questios remai ope, ad are worth for further ivestigatios. Firstly it is iterestig whether the soft margi algorithms yield arbitrarily good geeralizatio for all distributios. Up to ow our results oly provide cosistecy if the oise level is costat. However, approximatig a arbitrary oise level by step fuctios it seems possible that the soft margi algorithms with certai parameter sequeces are uiversally cosistet. Of course, the uiversality of kerels, which roughly speakig eables us to do almost everythig with iduced fuctios o compact subsets 91

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size. Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Axioms of Measure Theory

Axioms of Measure Theory MATH 532 Axioms of Measure Theory Dr. Neal, WKU I. The Space Throughout the course, we shall let X deote a geeric o-empty set. I geeral, we shall ot assume that ay algebraic structure exists o X so that

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck! Uiversity of Colorado Dever Dept. Math. & Stat. Scieces Applied Aalysis Prelimiary Exam 13 Jauary 01, 10:00 am :00 pm Name: The proctor will let you read the followig coditios before the exam begis, ad

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover.

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover. Compactess Defiitio 1. A cover or a coverig of a topological space X is a family C of subsets of X whose uio is X. A subcover of a cover C is a subfamily of C which is a cover of X. A ope cover of X is

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS 18th Feb, 016 Defiitio (Lipschitz fuctio). A fuctio f : R R is said to be Lipschitz if there exists a positive real umber c such that for ay x, y i the domai

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero? 2 Lebesgue Measure I Chapter 1 we defied the cocept of a set of measure zero, ad we have observed that every coutable set is of measure zero. Here are some atural questios: If a subset E of R cotais a

More information

The Boolean Ring of Intervals

The Boolean Ring of Intervals MATH 532 Lebesgue Measure Dr. Neal, WKU We ow shall apply the results obtaied about outer measure to the legth measure o the real lie. Throughout, our space X will be the set of real umbers R. Whe ecessary,

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Lecture 3 : Random variables and their distributions

Lecture 3 : Random variables and their distributions Lecture 3 : Radom variables ad their distributios 3.1 Radom variables Let (Ω, F) ad (S, S) be two measurable spaces. A map X : Ω S is measurable or a radom variable (deoted r.v.) if X 1 (A) {ω : X(ω) A}

More information

Math 341 Lecture #31 6.5: Power Series

Math 341 Lecture #31 6.5: Power Series Math 341 Lecture #31 6.5: Power Series We ow tur our attetio to a particular kid of series of fuctios, amely, power series, f(x = a x = a 0 + a 1 x + a 2 x 2 + where a R for all N. I terms of a series

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n Review of Power Series, Power Series Solutios A power series i x - a is a ifiite series of the form c (x a) =c +c (x a)+(x a) +... We also call this a power series cetered at a. Ex. (x+) is cetered at

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

MAS111 Convergence and Continuity

MAS111 Convergence and Continuity MAS Covergece ad Cotiuity Key Objectives At the ed of the course, studets should kow the followig topics ad be able to apply the basic priciples ad theorems therei to solvig various problems cocerig covergece

More information

Chapter IV Integration Theory

Chapter IV Integration Theory Chapter IV Itegratio Theory Lectures 32-33 1. Costructio of the itegral I this sectio we costruct the abstract itegral. As a matter of termiology, we defie a measure space as beig a triple (, A, µ), where

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n = 60. Ratio ad root tests 60.1. Absolutely coverget series. Defiitio 13. (Absolute covergece) A series a is called absolutely coverget if the series of absolute values a is coverget. The absolute covergece

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Chapter 0. Review of set theory. 0.1 Sets

Chapter 0. Review of set theory. 0.1 Sets Chapter 0 Review of set theory Set theory plays a cetral role i the theory of probability. Thus, we will ope this course with a quick review of those otios of set theory which will be used repeatedly.

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial. Taylor Polyomials ad Taylor Series It is ofte useful to approximate complicated fuctios usig simpler oes We cosider the task of approximatig a fuctio by a polyomial If f is at least -times differetiable

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Mathematical Methods for Physics and Engineering

Mathematical Methods for Physics and Engineering Mathematical Methods for Physics ad Egieerig Lecture otes Sergei V. Shabaov Departmet of Mathematics, Uiversity of Florida, Gaiesville, FL 326 USA CHAPTER The theory of covergece. Numerical sequeces..

More information

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems McGill Uiversity Math 354: Hoors Aalysis 3 Fall 212 Assigmet 3 Solutios to selected problems Problem 1. Lipschitz fuctios. Let Lip K be the set of all fuctios cotiuous fuctios o [, 1] satisfyig a Lipschitz

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

5 Many points of continuity

5 Many points of continuity Tel Aviv Uiversity, 2013 Measure ad category 40 5 May poits of cotiuity 5a Discotiuous derivatives.............. 40 5b Baire class 1 (classical)............... 42 5c Baire class 1 (moder)...............

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Chapter 2. Periodic points of toral. automorphisms. 2.1 General introduction

Chapter 2. Periodic points of toral. automorphisms. 2.1 General introduction Chapter 2 Periodic poits of toral automorphisms 2.1 Geeral itroductio The automorphisms of the two-dimesioal torus are rich mathematical objects possessig iterestig geometric, algebraic, topological ad

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

A REMARK ON A PROBLEM OF KLEE

A REMARK ON A PROBLEM OF KLEE C O L L O Q U I U M M A T H E M A T I C U M VOL. 71 1996 NO. 1 A REMARK ON A PROBLEM OF KLEE BY N. J. K A L T O N (COLUMBIA, MISSOURI) AND N. T. P E C K (URBANA, ILLINOIS) This paper treats a property

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Math 220A Fall 2007 Homework #2. Will Garner A

Math 220A Fall 2007 Homework #2. Will Garner A Math 0A Fall 007 Homewor # Will Garer Pg 3 #: Show that {cis : a o-egative iteger} is dese i T = {z œ : z = }. For which values of q is {cis(q): a o-egative iteger} dese i T? To show that {cis : a o-egative

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

CHAPTER I: Vector Spaces

CHAPTER I: Vector Spaces CHAPTER I: Vector Spaces Sectio 1: Itroductio ad Examples This first chapter is largely a review of topics you probably saw i your liear algebra course. So why cover it? (1) Not everyoe remembers everythig

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Chapter 10: Power Series

Chapter 10: Power Series Chapter : Power Series 57 Chapter Overview: Power Series The reaso series are part of a Calculus course is that there are fuctios which caot be itegrated. All power series, though, ca be itegrated because

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Fast Rates for Support Vector Machines

Fast Rates for Support Vector Machines Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA {igo,jcs}@lal.gov Abstract. We establish learig rates to the Bayes risk

More information

4 The Sperner property.

4 The Sperner property. 4 The Sperer property. I this sectio we cosider a surprisig applicatio of certai adjacecy matrices to some problems i extremal set theory. A importat role will also be played by fiite groups. I geeral,

More information

5 Birkhoff s Ergodic Theorem

5 Birkhoff s Ergodic Theorem 5 Birkhoff s Ergodic Theorem Amog the most useful of the various geeralizatios of KolmogorovâĂŹs strog law of large umbers are the ergodic theorems of Birkhoff ad Kigma, which exted the validity of the

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

On forward improvement iteration for stopping problems

On forward improvement iteration for stopping problems O forward improvemet iteratio for stoppig problems Mathematical Istitute, Uiversity of Kiel, Ludewig-Mey-Str. 4, D-24098 Kiel, Germay irle@math.ui-iel.de Albrecht Irle Abstract. We cosider the optimal

More information

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution EEL5: Discrete-Time Sigals ad Systems. Itroductio I this set of otes, we begi our mathematical treatmet of discrete-time s. As show i Figure, a discrete-time operates or trasforms some iput sequece x [

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Math 113, Calculus II Winter 2007 Final Exam Solutions

Math 113, Calculus II Winter 2007 Final Exam Solutions Math, Calculus II Witer 7 Fial Exam Solutios (5 poits) Use the limit defiitio of the defiite itegral ad the sum formulas to compute x x + dx The check your aswer usig the Evaluatio Theorem Solutio: I this

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

n p (Ω). This means that the

n p (Ω). This means that the Sobolev s Iequality, Poicaré Iequality ad Compactess I. Sobolev iequality ad Sobolev Embeddig Theorems Theorem (Sobolev s embeddig theorem). Give the bouded, ope set R with 3 ad p

More information

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for

More information

7 Sequences of real numbers

7 Sequences of real numbers 40 7 Sequeces of real umbers 7. Defiitios ad examples Defiitio 7... A sequece of real umbers is a real fuctio whose domai is the set N of atural umbers. Let s : N R be a sequece. The the values of s are

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

SOME GENERALIZATIONS OF OLIVIER S THEOREM

SOME GENERALIZATIONS OF OLIVIER S THEOREM SOME GENERALIZATIONS OF OLIVIER S THEOREM Alai Faisat, Sait-Étiee, Georges Grekos, Sait-Étiee, Ladislav Mišík Ostrava (Received Jauary 27, 2006) Abstract. Let a be a coverget series of positive real umbers.

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Solutions to home assignments (sketches)

Solutions to home assignments (sketches) Matematiska Istitutioe Peter Kumli 26th May 2004 TMA401 Fuctioal Aalysis MAN670 Applied Fuctioal Aalysis 4th quarter 2003/2004 All documet cocerig the course ca be foud o the course home page: http://www.math.chalmers.se/math/grudutb/cth/tma401/

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information