Fast Rates for Support Vector Machines

Size: px

Start display at page:

Download "Fast Rates for Support Vector Machines"

Horatio Richard
5 years ago
Views:

1 Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA Abstract. We establish learig rates to the Bayes risk for support vector machies (SVMs) usig a regularizatio sequece λ = α, where α (0, 1) is arbitrary. Uder a oise coditio recetly proposed by Tsybakov these rates ca become faster tha 1/. I order to deal with the approximatio error we preset a geeral cocept called the approximatio error fuctio which describes how well the ifiite sample versios of the cosidered SVMs approximate the data-geeratig distributio. I additio we discuss i some detail the relatio betwee the classical approximatio error ad the approximatio error fuctio. Fially, for distributios satisfyig a geometric oise assumptio we establish some learig rates whe the used RKHS is a Sobolev space. 1 Itroductio The goal i biary classificatio is to predict labels y Y := { 1, 1} of usee data poits x X usig a traiig set T = ( (x 1, y 1 ),..., (x, y ) ) (X Y ). As usual we assume that both the traiig samples (x i, y i ) ad the ew sample (x, y) are i.i.d. draw from a ukow distributio P o X Y. Now give a classifier C that assigs to every T a fuctio f T : X R the predictio of C for y is sig f T (x), where we choose a fixed defiitio of sig(0) { 1, 1}. I order to lear from T the decisio fuctio f T : X R should guaratee a small probability for the misclassificatio, i.e. sig f T (x) y, of the example (x, y). To make this precise the risk of a measurable fuctio f : X R is defied by R P (f) := P ( {(x, y) : sig f(x) y} ), ad the smallest achievable risk R P := if{r P (f) f : X R measurable} is kow as the Bayes risk of P. A fuctio f P attaiig this risk is called a Bayes decisio fuctio. Obviously, a good classifier should produce decisio fuctios whose risks are close to the Bayes risk with high probability. To make this precise, we say that a classifier is uiversally cosistet if E T P R P (f T ) R P 0 for. (1) Ufortuately, it is well kow that o classifier ca guaratee a covergece rate i (1) that simultaeously holds for all distributios (see [1, Thm. 7.]). However, if oe restricts cosideratios to suitable smaller classes of distributios such rates exist for various classifiers (see e.g. [, 3, 1]). Oe iterestig feature of

2 these rates is that they are ot faster tha 1/ if the cosidered distributios P are allowed to be oisy i the sese of R P > 0. O the other had, if oe restricts cosideratios to oise-free distributios P i the sese of R P = 0 the some empirical risk miimizatio (ERM) methods ca actually lear with rate 1 (see e.g. [1]). Remarkably, it was oly recetly discovered (see [4, 5]) that there also exists classes of oisy distributios which ca be leared with rates betwee 1/ ad 1. The key property of these classes is that their oise level x 1/ η(x) 1/ with η(x) := P (y = 1 x) is well-behaved i the sese of the followig defiitio. Defiitio 1. A distributio P o X Y has Tsybakov oise expoet q [0, ], if there exists a C > 0 such that for all sufficietly small t > 0 we have P X ( {x X : η(x) 1 t} ) C t q. () Obviously, all distributios have at least oise expoet 0. At the other extreme, () is satisfied for q = if ad oly if the coditioal probability η is bouded away from the critical level 1/. I particular this shows that oise-free distributios have expoet q =. The aim of this work is to establish learig rates for support vector machies (SVMs) uder Tsybakov s oise assumptio which are comparable to the rates of [4, 5]). Therefore let us ow recall these classificatio algorithms: let X be a compact metric space ad H be a RKHS over X with cotiuous kerel k. Furthermore, let l : Y R [0, ) be the hige loss which is defied by l(y, t) := max{0, 1 yt}. The give a traiig set T (X Y ) ad a regularizatio parameter λ > 0 SVMs solve the optimizatio problems or ( f T,λ, b T,λ ) := arg mi λ f H + 1 f H b R f T,λ := arg mi f H λ f H + 1 l ( y i, f(x i ) + b ), (3) i=1 l ( y i, f(x i ) ), (4) respectively. Furthermore, i order to cotrol the size of the offset we always choose b T,λ := y if all samples of T have label y. As usual we call algorithms solvig (3) L1-SVMs with offset ad algorithms solvig (4) L1-SVMs without offset. For more iformatio o these methods we refer to [6]. The rest of this work is orgaized as follows: I Sectio we first itroduce two cocepts which describe the richess of RKHSs. We the preset our mai result ad discuss it. The followig sectios are devoted to the proof of this result: I Sectio 3 we recall some results from [7] which are used for the aalysis of the estimatio error, ad i Sectio 4 we the prove our mai result. Fially, the relatio betwee the approximatio error ad ifiite sample SVMs which is of its ow iterest is discussed i the appedix. i=1

3 Defiitios ad Results For the formulatio of our results we eed two otios which deal with the richess of RKHSs. While the first otio is a complexity measure i terms of coverig umbers which is used to boud the estimatio error, the secod oe describes the approximatio properties of RKHSs with respect to distributios. I order to itroduce the complexity measure let us recall that for a Baach space E with closed uit ball B E, the coverig umbers of A E are defied by { N (A, ε, E) := mi 1 : x 1,..., x E with A i=1 } (x i +εb E ), ε > 0. Give a traiig set T = ((x 1, y 1 ),..., (x, y )) (X Y ) we deote the space of all equivalece classes of fuctios f : X Y R equipped with orm f L(T ) := ( 1 f(xi, y i ) ) 1 i=1 by L (T ). I other words, L (T ) is a L -space with respect to the empirical measure of T. Note, that for a fuctio f : X Y R a caoical represetative i L (T ) is the restrictio f T. Furthermore, we write L (T X ) for the space of all (equivalece classes of) square itegrable fuctios with respect to the empirical measure of x 1,..., x. Now our complexity measure is: Defiitio. Let H be a RKHS over X ad B H its closed uit ball. We say that H has complexity expoet 0 < p if there exists a costat c > 0 such that for all ε > 0 we have sup log N ( B H, ε, L (T X ) ) cε p. T X X By usig the theory of absolutely -summig operators oe ca show that every RKHS has complexity expoet p =. However, for meaigful rates we eed complexity expoets which are strictly smaller tha. I order to itroduce the secod otio describig the approximatio properties of RKHSs we first have to recall the ifiite sample versios of (3) ad (4). To this ed let l be the hige loss fuctio ad P be a distributio o X Y. The for f : X R the l-risk of f is defied by R l,p (f) := E (x,y) P l(y, f(x)). Now give a RKHS H over X ad λ > 0 we defie ad ( f P,λ, b P,λ ) := arg mi f H b R ( λ f H + R l,p (f + b) ( ) f P,λ := arg mi λ f H + R l,p (f) f H (see [8] for the existece of these miimizers). Note that these defiitios give the solutios ( f T,λ, b T,λ ) ad f T,λ of (3) ad (4), respectively, if P is a empirical ) (5) (6) (7)

4 distributio with respect to a traiig set T. I this case we write R l,t (f) for the (empirical) l-risk. With these otatios i mid we defie the approximatio error fuctio by a(λ) := λ f P,λ H + R l,p (f P,λ ) R l,p, λ 0, (8) where R l,p := if{r l,p (f) f : X R} deotes the smallest possible l-risk. Note that sice the obvious variat of a(.) that ivolves a offset is ot greater tha the above approximatio error fuctio, we restrict our attetio to the latter. Furthermore, we discuss the relatioship betwee a(.) ad the stadard approximatio error i the appedix. The approximatio error fuctio quatifies how well a ifiite sample L1- SVM with RKHS H approximates the miimal l-risk. It was show i [8] that if H is dese i the space of cotiuous fuctios C(X) the for all P we have a(λ) 0 if λ 0. However, i o-trivial situatios o rate of covergece which uiformly holds for all distributios P is possible. The followig defiitio characterizes distributios which guaratee certai polyomial rates: Defiitio 3. Let H be a RKHS over X ad P be a probability measure o X Y. We say that H approximates P with expoet 0 β 1 if there exists a costat C > 0 such that for all λ > 0 we have a(λ) Cλ β. Note, that H approximates every distributio P with expoet β = 0. We will see i the appedix that the other extremal case β = 1 is equivalet to the fact that the miimal l-risk ca be achieved by a elemet f l,p H. With the help of the above otatios we ca ow formulate our mai result. Theorem 1. Let H be a RKHS of a cotiuous kerel o a compact metric space X with complexity expoet 0 < p <, ad let P be a probability measure o X Y with Tsybakov oise expoet 0 q. Furthermore, assume that H approximates P with expoet 0 < β 1. We defie λ := α for some α (0, 1) ad all 1. If α < 4(q+1) (q+pq+4)(1+β) the there exists a C > 0 with Pr ( T (X Y ) : R P (f T,λ ) R P + Cx αβ) 1 e x for all 1 ad x 1. Here Pr is the outer probability of P i order to 4(q+1) avoid measurability cosideratios. Furthermore, if α (q+pq+4)(1+β) the for all ε > 0 there is a C > 0 such that for all x 1, 1 we have 4(q+1) Pr ( T (X Y ) : R P (f T,λ ) R P + Cx +α+ε) (q+pq+4) 1 e x. Fially, the same results hold for the L1-SVM with offset wheever q > 0. Remark 1. The best rates Theorem 1 ca guaratee are (up to a ε) of the form 4β(q+1) (q+pq+4)(1+β),

5 ad a easy calculatio shows that these rates are obtaied for the value α := 4(q+1) (q+pq+4)(1+β). This result has already bee aouced i [9] ad preseted i a earlier (ad substatially loger) versio of [7]. The mai differece of Theorem 1 to its predecessors is that it does ot require to choose α optimally. Fially ote that ufortuately the optimal α is i terms of both q ad β, which are i geeral ot accessible. At the momet we are ot aware of ay method which ca adaptively fid the (almost) optimal values for α. Remark. I [5] it is assumed that a Bayes classifier is cotaied i the base fuctio classes the cosidered ERM method miimizes over. This assumptio correspods to a perfect approximatio of P by H, i.e. β = 1, as we will see i the apppedix. If i this case we rescale the complexity expoet p from (0, ) to (0, 1) ad write p for the ew complexity measure our optimal rate essetially becomes q+1 q+p q+. Recall that this is exactly the form of Tsybakov s result i [5] which is kow to be optimal i a mimax sese for some specific classes of distributios. However, as far as we kow our complexity measure caot be compared to Tsybakov s ad thus the above reasoig oly idicates that our optimal rates may be optimal i a mimax sese. Let us fially preset a example which shows how the developed theory ca be used to establish learig rates for specific types of kerels ad distributios. Example 1 (SVMs usig Sobolev spaces). Let X R d be the closed uit Euclidia ball, Ω be the cetered ope ball of radius 3, ad W m (Ω) be the Sobolev space of order m N over Ω. Recall that W m (Ω) is a RKHS of a cotiuous kerel if m > d/ (see e.g. [10]). Let us write H m := {f X : f W m (Ω)} for the restrictio of W m (Ω) oto X edowed with the iduced RKHS orm. The (see agai [10]) the RKHS H m has complexity expoet p := d/m if m > d/. Now let P be a distributio o X Y which has geometric oise expoet α (0, ] i the sese of [7], ad let k σ (x, x ) := exp( σ x x ), x, x Ω, be a Gaussia RBF kerel with associated itegral operator T σ : L (Ω) L (Ω), where L (Ω) is with respect to the Lebesgue measure. The by the results i [7, Secs. 3 & 4] there exist costats c d, c α,m,d 1 such that for all σ > 0 there exists a f σ L (Ω) with f σ L(Ω) = c d σ d, R l,p ((T σ f σ ) X ) R l,p c α,m,d σ αd, ad (T σ f σ ) X Hm c α,m,d σ m d/ f σ L(Ω). This yields a costat c > 0 with a(λ) c ( λσ m+d + σ αd) for all σ > 0 ad all λ > 0. Miimizig with respect to σ the shows that αd H m approximates P with expoet β := (α+1)d+m. Cosequetly we ca use Theorem 1 to obtai learig rates for SVMs usig H m for m > d/. I particular the resultig optimal rates i the sese of Remark 1 are (essetially) of the form 4αdm(q+1) (mq+dq+4m)(αd+d+m).

6 3 Prerequisites I this sectio we recall some importat otios ad results that we require i the proof of our mai theorem. To this ed let H be a RKHS over X that has a cotiuous kerel k. The recall that every f H is cotiuous ad satisfies where we use f K f H, K := sup k(x, x). x X The rest of this sectio recalls some results from [7] which will be used to boud the estimatio error of L1-SVMs. Before we state these results we have to recall some otatio from [7]: let F be a class of bouded measurable fuctios from a set Z to R, ad let L : F Z [0, ) be a fuctio. We call L a loss fuctio if L f := L(f,.) is measurable for all f F. Moreover, if F is covex, we say that L is covex if L(., z) is covex for all z Z. Fially, L is called liecotiuous if for all z Z ad all f, ˆf F the fuctio t L(tf + (1 t) ˆf, z) is cotiuous o [0, 1]. Note that if F is a vector space the every covex L is liecotiuous. Now, give a probability measure P o Z we deote by f P,F F a miimizer of the L-risk f R L,P (f) := E z P L(f, z). If P is a empirical measure with respect to T Z we write f T,F ad R L,T (.) as usual. For simplicity, we assume throughout this sectio that f P,F ad f T,F do exist. Also ote that although there may exist multiple solutios we use a sigle symbol for them wheever o cofusio regardig the o-uiqueess of this symbol ca be expected. Furthermore, a algorithm that produces solutios f T,F for all possible T is called a empirical L-risk miimizer. Now the mai result of this sectio, show i [7], reads as follows: Theorem. Let F be a covex set of bouded measurable fuctios from Z to R ad let L : F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := { L f L f P,F : f F }. (9) Suppose we have c 0, 0 < α 1, δ 0 ad B > 0 with E P g c (E P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to. ad that there are costats a 1 ad 0 < p < with sup log N ( B 1 G, ε, L (T ) ) aε p (10) T Z for all ε > 0. The there exists a costat c p > 0 depedig oly o a ad p such that for all 1 ad all x 1 we have Pr ( ) T Z : R L,P (f T,F ) > R L,P (f P,F ) + c p ε(, B, c, δ, x) e x,

7 where ε(, B, c, δ, x) := B p p 4 α+αp c 4 α+αp 4 α+αp ( δx + ) 1 + ( cx ) 1 α + Bx. + B p δ p B +p Let us ow recall some variace bouds of the form E P g c (E P g) α + δ for SVMs proved i [7]. To this ed let H be a RKHS of a cotiuous kerel over X, λ > 0, ad l be the hige loss fuctio. We defie ad L(f, x, y) := λ f H + l ( y, f(x) ) (11) L(f, b, x, y) := λ f H + l ( y, f(x) + b ) (1) for all f H, b R, x X, ad y Y. Sice R L,T (.) ad R L,T (.,.) coicide with the objective fuctios of the L1-SVM formulatios we see that the L1-SVMs actually implemet a empirical L-risk miimizatio i the sese of Theorem. Now the first variace boud from [7] does ot require ay assumptios o P. Propositio 1. Let 0 < λ < 1, H be a RKHS over X, ad F λ 1 B H. Furthermore, let L be defied by (11), P be a probability measure ad G be defied as i (9). The for all g G we have E P g λ 1 ( + K) E P g. Fially, the followig variace boud from [7] shows that the previous boud ca be improved if oe assumes a o-trivial Tsybakov expoet for P. Propositio. Let P be a distributio o X Y with Tsybakov oise expoet 0 < q. The there exists a costat C > 0 such that for all λ > 0, all 0 < r λ 1/ satisfyig f P,λ rb H, all f rb H, ad all b R with b Kr + 1 we have E ( L (f, b) L ( f P,λ, b P,λ ) ) C(Kr + 1) q+ q+1 (E ( L (f, b) L ( f P,λ, b P,λ ) )) q q+1 + C(Kr + 1) q+ q q+1 a q+1 (λ). Furthermore, the same result holds for SVMs without offset. 4 Proof of Theorem 1 I this sectio we prove Theorem 1. To this ed we write f(x) g(x) for two fuctios f, g : D [0, ), D (0, ), if there exists a costat C > 0 such that f(x) Cg(x) holds over some rage of x which usually is implicitly defied by the cotext. However for sequeces this rage is always N. Fially we write f(x) g(x) if both f(x) g(x) ad g(x) f(x) for the same rage. Sice our variace bouds have differet forms for the cases q = 0 ad q > 0 we have to prove the theorem for these cases separately. We begi with the case q = 0 ad a importat lemma which describes a shrikig techique.

8 Lemma 1. Let H ad P be as i Theorem 1. For γ > β we defie λ := 1 1+β+γ. Now assume that there are costats 0 ρ < β ad C 1 such that Pr ( ) T (X Y ) : f T,λ Cxλ ρ 1 1 e x for all 1, x 1. The there is aother costat Ĉ 1 such that for ˆρ := mi { β, ρ+β+γ, β + γ } ad for all 1, x 1 we have Pr ( T (X Y ) : f T,λ ˆρ 1 Ĉxλ ) 1 e x. Proof. Let ˆf T,λ be a miimizer of R L,T o Cxλ ρ 1 B H, where L is defied by (11). By our assumptio we have ˆf T,λ = f T,λ with probability ot less tha 1 e x sice f T,λ is uique for every traiig set T by the strict covexity of L. We will show that for some C > 0 ad all 1, x 1 the improved boud ˆf T,λ ˆρ 1 Cxλ (13) holds with probability ot less tha 1 e x ˆρ 1. Cosequetly, f T,λ Cxλ will hold with probability ot less tha 1 e x. Obviously, the latter implies the assertio. I order to establish (13) we will apply Theorem to the modified L1-SVM classifier that produces ˆf T,λ. To this ed we first observe that the separability coditio of Theorem is satisfied sice H is separable ad cotiuously embedded ito C(X). Furthermore it was show i [7] that the coverig umber coditio holds ad by Propositio 1 we may choose c such that c xλ 1, ad δ = 0. Additioally, we ca obviously choose B λ (ρ 1)/. The term ε(, B, c, δ, x) i Theorem ca the be estimated by (ρ 1)p +p ε(, B, c, δ, x) xλ λ p +p +p + x λ ρ 1 x pρ+β+γ +p λ + x λ β+γ. +p + xλ 1 Now for ρ β + γ we have ρ+β+γ pρ+β+γ +p, ad hece we obtai ε(, B, c, δ, x) x λ ρ+β+γ + x λ β+γ. Furthermore, if ρ > β + γ we have both β + γ < pρ+β+γ +p ad β + γ < ρ+β+γ ad thus we agai fid ε(, B, c, δ, x) x λ β+γ x λ β+γ + x λ ρ+β+γ. 1, Now, i both cases Theorem gives a costat C 1 > 0 idepedet of ad x such that for all 1 ad all x 1 the estimate λ ˆf T,λ λ ˆf T,λ + R l,p ( ˆf T,λ ) R l,p λ ˆf P,λ + R l,p ( ˆf P,λ ) R l,p + C 1 x λ ρ+β+γ + C 1 x λ β+γ

9 holds with probability ot less tha 1 e x. Furthermore, by Theorem 4 we obtai f P,λ λ (ρ 1)/ Cxλ (ρ 1)/ for large which gives f P,λ = ˆf P,λ for such. With probability ot less tha 1 e x we hece have λ ˆf T,λ λ f P,λ + R l,p (f P,λ ) R l,p + C 1 x λ ρ+β+γ + C 1 x λ β+γ C λ β + C 1 x λ ρ+β+γ + C 1 x λ β+γ for some costats C 1, C > 0 idepedet of ad x. From this we easily obtai that (13) holds for all 1 with probability ot less tha 1 e x. Proof (of Theorem 1 for q = 0). We first observe that there exists a γ > β 4(q+1) with α = (q+pq+4)(1+β+γ). We fix this γ ad defie ρ 0 := 0 ad ρ i+1 := mi { β, ρi+β+γ, β + γ }. The it is easy to check that this defiitio gives { ρ i = mi β, (β + γ) i j=1 } j, β + γ = mi { β, (β + γ)(1 i ) }. Now, iteratively applyig Lemma gives a sequece of costats C i > 0 with Pr ( ) T (X Y ) : f T,λ C i xλ ρ i 1 1 e x (14) for all 1 ad all x 1. Let us first cosider the case β < γ 0. The we have ρ i = (β + γ)(1 i ), ad hece (14) shows that for all ε > 0 there exists a costat C > 0 such that Pr ( T (X Y ) : f T,λ Cxλ (1 ε)(β+γ) 1 ) 1 e x for all 1 ad all x 1. We write ρ := (1 ε)(β+γ). As i the proof of Lemma 1 we deote a miimizer of R L,T o Cxλ ρ 1 B H by ˆf T,λ. We have just see that ˆf T,λ = f T,λ with probability ot less tha 1 e x. Therefore, we oly have to apply Theorem to the modified optimizatio problem which defies ˆf T,λ. To this ed we first see as i the proof of Lemma 1 that ε(, B, c, δ, x) x pρ+β+γ +p λ + x λ β+γ x pρ+β+γ +p λ x λ β+γ ε, where i the last two estimates we used the defiitio of ρ. Furthermore, we have already see i the proof of Lemma 1 that λ ˆf P,λ + R l,p ( ˆf P,λ ) R l,p a(λ ) holds for large. Therefore, applyig Theorem ad a iequality of Zhag (see [11]) betwee the excess classificatio risk ad the excess l-risk we fid that for all 1 we have with probability ot less tha 1 e x : R P ( ˆf T,λ ) R P λ ˆf T,λ + R l,p ( ˆf T,λ ) R l,p λ ˆf P,λ + R l,p ( ˆf P,λ ) R l,p + C 1 x λ β+γ ε C λ β+γ ε, (15)

10 where C 1, C > 0 are costats idepedet of ad x. Now, from (15) we easily deduce the assertio usig the defiitio of λ ad γ. Let us fially cosider the case γ > 0. The for large itegers i we have ρ i = β, ad hece (14) gives a C > 0 such that for all 1, x 1 we have Pr ( ) T (X Y ) : f T,λ Cxλ β 1 1 e x. Proceedig as for γ 0 we get ε(, B, c, δ, x) x pβ+β+γ +p λ + x λ β+γ x λ β, from which we easily obtai the assertio usig the defiitio of λ ad γ. I the rest of this sectio we will prove Theorem 1 for q > 0. We begi with a lemma which is similar to Lemma 1. Lemma. Let H ad P be as i Theorem 1. For γ > β we defie λ := 4(q+1) (q+pq+4)(1+β+γ). Now assume that there are ρ [0, β) ad C 1 with Pr ( ) T (X Y ) : f T,λ Cxλ ρ 1 1 e x for all 1 ad all x 1. The there is aother costat Ĉ 1 such that for ˆρ := mi { β, ρ+β+γ } ad for all 1, x 1 we have Pr ( T (X Y ) ˆρ 1 ) : f T,λ Ĉxλ 1 e x. The same result holds for L1-SVM s with offset. Proof. For brevity s sake we oly prove this Lemma for L1-SVM s with offset. The proof for L1-SVM s without offset is almost idetical. Now, let L be defied by (1). Aalogously to the proof of Lemma 1 we deote a miimizer of R L,T (.,.) o Cxλ ρ 1 (B H [ K 1, K + 1]) by ( ˆf T,λ, ˆb T,λ ). By our assumptio (see [7]) we have b T,λ Cxλ ρ 1 (K + 1) with probability ot less tha 1 e x for all possible values of the offset. I additio, for such traiig sets we have ˆf T,λ = f T,λ sice the RKHS compoet f T,λ of L1-SVM solutios is uique for T by the strict covexity of L i f. Furthermore, by the above cosideratios we may defie ˆb T,λ := b T,λ for such traiig sets. As i the proof of Lemma 1 it ow suffices to show the existece of a C > 0 such that ˆf ˆρ 1 T,λ Cxλ with probability ot less tha 1 e x. To this ed we first observe by Propositio that we may choose B, c ad δ such that B xλ ρ 1, c x q+ ρ 1 q+1 λ q+ q+1, ad δ x q+ ρ 1 q+1 λ q+ q+1 + βq q+1. Some calculatios the show that ε(, B, c, δ, x) i Theorem satisfies ε(, B, c, δ, x) x λ ρ+β+γ + x (ρ+β+γ)(q+pq+4)+βq( p) 8(q+1) λ.

11 Furthermore observe that we have ρ β γ if ad oly if ρ + β+γ (ρ+β+γ)(q+pq+4)+βq( p) 4(q+1). Now let us first cosider the case ρ β γ. The the above cosideratios show ε(, a, B, c, δ, x) x λ ρ+β+γ. Furthermore, we obviously have λ β λ ρ+β+γ. As i the proof of Lemma 1 we hece fid a costat C > 0 such that for all x 1, 1 we have λ ˆf T,λ Cx λ ρ+β+γ with probability ot less tha 1 e x. O the other had if ρ > β γ we have ε(, a, B, c, δ, x) x (ρ+β+γ)(q+pq+4)+βq( p) 8(q+1) λ x λ β, so that we get λ ˆf T,λ Cx λ β i the above sese. Proof (of Theorem 1 for q > 0). By usig Lemma the proof i the case q > 0 is completely aalogous to the case q = 0. Appedix Throughout this sectio P deotes a Borel probability measure o X Y ad H deotes a RKHS of cotiuous fuctios over X. We use the shorthad for H whe o cofusio should arise. Ulike i the other sectios of this paper, here L deotes a arbitrary covex loss fuctio, that is, a cotiuous fuctio L : Y R [0, ) covex i its secod variable. The correspodig L-risk R L,P (f) of a fuctio f : X R ad its miimal value R L,P are defied i the obvious way. For simplicity we also assume R L,P (0) = 1. Note that all the requiremets are met by the hige loss fuctio. Furthermore, let us defie f P,λ by replacig R l,p by R L,P i (7). I additio we write { fp,λ = arg mi f : f arg mi f 1 λ R L,P (f ) }. (16) Of course, we eed to prove the existece ad uiqueess of fp,λ which is doe i the followig lemma. Lemma 3. Uder the above assumptios fp,λ is well defied. Proof. Let us first show that there exists a f λ 1/ B H which miimizes R L,P (.) i λ 1/ B H. To that ed cosider a sequece (f ) i λ 1/ B H such that R L,P (f ) if f λ 1/ R L,P (f). By the Eberlei-Smulya theorem we ca assume without loss of geerality that there exists a f with f λ 1/ ad f f weakly. Usig the fact that weak covergece i RKHS s imply poitwise covergece, Lebesgue s theorem ad the cotiuity of L the give R L,P (f ) R L,P (f ).

12 Hece there is a miimizer of R L,P (.) i 1 λ B H, i.e. we have { } A := f : f arg mi R L,P (f ). f 1 λ We ow show that there is exactly oe f A havig miimal orm. Existece: Let (f ) A with f if f A f for. Like i the proof establishig A, we ca show that there exists a f A with f f weakly, ad R L,P (f ) R L,P (f ). This shows f A. Furthermore, by the weak covergece we always have f lim if f = if f A f. Uiqueess: Suppose we have two such elemets f ad g with f g. By covexity we fid 1 (f + g) arg mi f 1 λ R L,P (f). However,. H is strictly covex which gives 1 (f + g) < f. I the followig we will defie the approximatio error ad the approximatio error fuctio for geeral L. I order to also treat o-uiversal kerels we first deote the miimal L-risk of fuctios i H by R L,P,H := if f H R L,P (f). Furthermore, we say that f H miimizes the L-risk i H if R L,P (f) = R L,P,H. Note that if such a miimizer exists the by Lemma 3 there actually exists a uique elemet fl,p,h H miimizig the L-risk i H with f L,P,H f for all f H miimizig the L risk i H. Moreover we have f P,λ fl,p,h for all λ > 0 sice otherwise we fid a cotradictio by Now, for λ 0 we write λ f L,P,H + R L,P (f L,P,H) < λ f P,λ + R L,P (f P,λ ). a(λ) := λ f P,λ + R L,P (f P,λ ) R L,P,H, (17) a (λ) := R L,P (f P,λ) R L,P,H. (18) Recall, that for uiversal kerels ad the hige loss fuctio we have R L,P,H = R L,P (see [8]), ad hece i this case a(.) equals the approximatio error fuctio defied i Sectio. Furthermore, for these kerels, a (λ) is the classical approximatio error of the hypothesis class λ 1/ B H. Our first theorem shows how to compare a(.) ad a (.). Theorem 3. With the above otatios we have a(0) = a (0) = 0. Furthermore, a (.) is icreasig, ad a(.) is icreasig, cocave, ad cotiuous. I additio, we have a (λ) a(λ) for all λ 0, ad for ay h : (0, ) (0, ) with a (λ) h(λ) for all λ > 0, we have a ( λh(λ) ) h(λ) for all λ > 0.

13 Proof. It is clear from the defiitios (17) ad (18) that a(0) = a (0) = 0 ad a (.) is icreasig. Sice a(.) is a ifimum over a family of liear icreasig fuctios of λ it follows that a(.) is also cocave ad icreasig. Cosequetly a(.) is cotiuous for λ > 0 (see [1, Thm. 10.1]), ad cotiuity at 0 follows from the proof of [8, Prop. 3.]. To prove the secod assertio, observe that f P,λ 1/λ implies R L,P (f P,λ ) R L,P (f P,λ ) for all λ > 0 ad hece we fid a (λ) a(λ) for all λ 0. Now let λ := h(λ) f P,λ. The we obtai λ f P, λ + R L,P (f P, λ) λ f P,λ + R L,P (f P,λ) λ f P,λ + R L,P,H + h(λ) R L,P,H + h(λ). This shows a( λ) h(λ). Furthermore we have λh(λ) fp,λ h(λ) = λ ad thus the assertio follows sice a(.) is a icreasig fuctio. Our ext goal is to show how the asymptotic behaviour of a(.), a (.) ad λ f P,λ are related to each other. Let us begi with a lemma that characterizes the existece of f L,P,H H i terms of the fuctio λ f P,λ. Lemma 4. The miimizer fl,p,h H of the L-risk i H exists if ad oly if there exists a costat c > 0 with f P,λ c for all λ > 0. I this case we additioally have lim λ 0 + f P,λ fl,p,h H = 0. Proof. Let us first assume that fl,p,h H exists. The we have already see f P,λ fl,p,h for all λ > 0, so that it remais to show the covergece. To this ed let (λ ) be a positive sequece covergig to 0. By the boudedess of (f P,λ ) there the exists a f H ad a subsequece (f P,λi ) with f P,λi f weakly. This implies R L,P (f P,λi ) R L,P (f ) as i the proof of Lemma 3. Furthermore, we always have λ i f P,λi 0 ad thus R L,P,H = lim i λ i f P,λi + R L,P (f P,λi ) = R L,P (f ), (19) where the first equality ca be show as i [8] for uiversal kerels. I other words f miimizes the L-risk i H ad hece we have f P,λi fl,p,h f lim if f P,λ j j for all i 1. This shows both f P,λi f ad f L,P,H = f, ad cosequetly we fid f L,P,H = f by (19). I additio a easy calculatio gives f P,λi f = f P,λi f P,λi, f + f f f + f = 0. Now assume that f P,λ fl,p,h. The there exists a δ > 0 ad a subsequece (f P,λj ) with f P,λj fl,p,h > δ. O the other had applyig the above reasoig to this subsequece gives a sub-subsequece covergig to fl,p,h ad hece we have foud a cotradictio. Let us ow assume f P,λ c for some c > 0 ad all λ > 0. The there exists a f H ad a sequece (f P,λ ) with f P,λ f weakly. As i the first part of the proof we easily see that f miimizes the L-risk i H.

14 Note that if H is a uiversal kerel, i.e. it is dese i C(X), P is a empirical distributio based o a traiig set T, ad L is the (squared) hige loss fuctio the fl,t,h H exists ad coicides with the hard margi SVM solutio. Cosequetly, the above lemma shows that both the L1-SVM ad the L-SVM solutios f T,λ coverge to the hard margi solutio if T is fixed ad λ 0. The followig lemma which shows that the fuctio f P,λ miimizes R L,P (.) over the ball f P,λ B H is somewhat well-kow: Lemma 5. Let λ > 0 ad γ := 1/ f P,λ. The we have f P,γ = f P,λ. Proof. We first show that f P,λ miimizes R L,P (.) over the ball f P,λ B H. To this ed assume the coverse R L,P (f P,γ ) < R L,P (f P,λ ). Sice we also have f P,γ 1/ γ = f P,λ we the fid the false iequality λ f P,γ + R L,P (f P,γ) < λ f P,λ + R L,P (f P,λ ), (0) ad cosequetly f P,λ miimizes R L,P (.) over f P,λ B H. Now assume that f P,λ fp,γ, i.e. f P,λ > fp,γ. Sice R L,P (fp,γ ) = R L,P (f P,λ ) we the agai fid (0) ad hece the assumptio f P,λ fp,γ must be false. Let us ow tur to the mai theorem of this sectio which describes asymptotic relatioships betwee the approximatio error, the approximatio error fuctio, ad the fuctio λ f P,λ. Theorem 4. The fuctio λ f P,λ is bouded o (0, ) if ad oly if a(λ) λ ad i this case we also have a(λ) λ. Moreover for all α > 0 we have a (λ) λ α if ad oly if a(λ) λ α α+1. If oe of the estimates is true we additioally have f P,λ λ 1 α+1 ad R L,P (f P,λ ) R L,P,H λ α α+1. Furthermore, if λ α+ε a (λ) λ α for some α > 0 ad ε 0 the we have both λ α (α+ε)(α+1) fp,λ λ 1 α+1 ad λ α+ε α+1 RL,P (f P,λ ) R L,P λ α ad hece i particular λ α+ε α+1 a(λ) λ α α+1. α+1, Theorem 4 shows that if a (λ) behaves essetially like λ α the the approximatio error fuctio behaves essetially like λ α α+1. Cosequetly we do ot loose iformatio whe cosiderig a(.) istead of the approximatio error a (.). Proof (of Theorem 4). If λ f P,λ is bouded o (0, ) the miimizer f L,P,H exists by Lemma 4 ad hece we fid a(λ) λ f L,P,H + R L,P (f L,P,H) R L,P,H = λ f L,P,H. Coversely, if there exists a costat c > 0 with a(λ) cλ we fid λ f P,λ a(λ) cλ which shows f P,λ c for all λ > 0. Moreover by Theorem 3 we easily fid λa(1) a(λ) for all λ > 0.

15 For the rest of the proof we observe that Theorem 3 gives a(λ) a(cλ) c a(λ) for λ > 0 ad c 1, ad c a(λ) a(cλ) a(λ) for λ > 0 ad 0 < c 1. Therefore we ca igore arisig costats by usig the otatio. Now let us assume a (λ) λ α for some α > 0. The from Theorem 3 we kow a(λ 1+α ) λ α which leads to a(λ) λ α α+1. The latter immediately implies f P,λ λ 1 α α+1. Coversely, if a(λ) λ α+1 we defie γ := fp,λ. By Lemma 5 we the obtai a (γ) = R L,P (f P,λ ) R L,P,H a(λ) λ α α+1 f P,λ α = γ α. Now, if fl,p,h does ot exists the the fuctio λ f P,λ teds to 0 if λ 0 ad thus a (λ) λ α. I additio, if fl,p,h exists the assertio is trivial. For the third assertio recall that Lemma 5 states f P,λ = fp,γ with γ := f P,λ ad hece we fid a(λ) = λ f P,λ + a ( f P,λ ). (1) Furthermore, we have already see f P,λ λ 1 α+1, ad hece we get λ α α+1 R L,P (f P,λ ) R L,P = a ( f P,λ ) f P,λ (α+ε) λ α+ε α+1. Combiig this with (1) yields the third assertio. Refereces 1. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Patter Recogitio. Spriger, New York (1996). Yag, Y.: Miimax oparametric classificatio part I ad II. IEEE Tras. Iform. Theory 45 (1999) Wu, Q., Zhou, D.X.: Aalysis of support vector machie classificatio. Tech. Report, City Uiversity of Hog Kog (003) 4. Mamme, E., Tsybakov, A.: Smooth discrimiatio aalysis. A. Statist. 7 (1999) Tsybakov, A.: Optimal aggregatio of classifiers i statistical learig. A. Statist. 3 (004) Schölkopf, B., Smola, A.: Learig with Kerels. MIT Press (00) 7. Steiwart, I., Scovel, C.: Fast rates for support vector machies usig Gaussia kerels. A. Statist. submitted (004) publicatios/a-04a.pdf. 8. Steiwart, I.: Cosistecy of support vector machies ad other regularized kerel machies. IEEE Tras. Iform. Theory 51 (005) Steiwart, I., Scovel, C.: Fast rates to bayes for kerel machies. I Saul, L.K., Weiss, Y., Bottou, L., eds.: Advaces i Neural Iformatio Processig Systems 17. MIT Press, Cambridge, MA (005) Edmuds, D., Triebel, H.: Fuctio Spaces, Etropy Numbers, Differetial Operators. Cambridge Uiversity Press (1996) 11. Zhag, T.: Statistical behaviour ad cosistecy of classificatio methods based o covex risk miimizatio. A. Statist. 3 (004) Rockafellar, R.: Covex Aalysis. Priceto Uiversity Press (1970)

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space