arxiv: v1 [math.st] 14 Aug 2007

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 14 Aug 2007"

Edgar Fleming
5 years ago
Views:

1 The Aals of Statistics 2007, Vol. 35, No. 2, DOI: / I the Public Domai arxiv: v1 [math.st] 14 Aug 2007 FAST RATES FOR SUPPORT VECTOR MACHINES USING GAUSSIAN KERNELS 1 By Igo Steiwart ad Clit Scovel Los Alamos Natioal Laboratory For biary classificatio we establish learig rates up to the order of 1 for support vector machies (SVMs) with hige loss ad Gaussia RBF kerels. These rates are i terms of two assumptios o the cosidered distributios: Tsybakov s oise assumptio to establish a small estimatio error, ad a ew geometric oise coditio which is used to boud the approximatio error. Ulike previously proposed cocepts for boudig the approximatio error, the geometric oise assumptio does ot employ ay smoothess assumptio. 1. Itroductio. I recet years support vector machies (SVMs) have bee the subject of may theoretical cosideratios. Despite this effort, their learig performace o restricted classes of distributios is still widely ukow. I particular, it is ukow uder which otrivial circumstaces SVMs ca guaratee fast learig rates. The aim of this work is to use cocepts like Tsybakov s oise assumptio ad local Rademacher averages to establish learig rates up to the order of 1 for otrivial distributios. I additio to these cocepts that are used to deal with the stochastic part of the aalysis we also itroduce a geometric assumptio for distributios that allows us to estimate the approximatio properties of Gaussia RBF kerels. Ulike may other cocepts itroduced for boudig the approximatio error, our geometric assumptio is ot i terms of smoothess but describes the cocetratio ad the oisiess of the data-geeratig distributio ear the decisio boudary. Let us formally itroduce the statistical classificatio problem. To this ed let us fix a subset X R d. We write Y := { 1,1}. Give a fiite traiig set Received December 2003; revised Jue Supported by the LDRD-ER program of the Los Alamos Natioal Laboratory. AMS 2000 subject classificatios. Primary 68Q32; secodary 62G20, 62G99, 68T05, 68T10, 41A46, 41A99. Key words ad phrases. Support vector machies, classificatio, oliear discrimiatio, learig rates, oise assumptio, Gaussia RBF kerels. This is a electroic reprit of the origial article published by the Istitute of Mathematical Statistics i The Aals of Statistics, 2007, Vol. 35, No. 2, This reprit differs from the origial i pagiatio ad typographic detail. 1

2 2 I. STEINWART AND C. SCOVEL T = ((x 1,y 1 ),...,(x,y )) (X Y ), the classificatio task is to predict the label y of a ew sample (x,y). I the stadard batch model it is assumed that the samples (x i,y i ) are i.i.d. accordig to a ukow (Borel) probability measure P o X Y. Furthermore, the ew sample (x,y) is draw from P idepedetly of T. Give a classifier C that assigs to every traiig set T a measurable fuctio f T :X R, the predictio of C for y is sig T f(x), where sig(0) := 1. The quality of such a fuctio f is measured by the classificatio risk R P (f) := P({(x,y):sigf(x) y}), which should be as small as possible. The smallest achievable risk R P := if{r P (f) f :X R measurable} is called the Bayes risk of P ad a fuctio attaiig this risk is called a Bayes decisio fuctio ad is deoted by f P. Obviously, a good classifier should at least produce decisio fuctios whose risks coverge to the Bayes risk for all distributios P. This leads to the otio of uiversally cosistet classifiers which is thoroughly treated i [14]. The ext aturally arisig questio is whether there are classifiers which guaratee a specific covergece rate for all distributios. Ufortuately, this is impossible by a result of Devroye (see [14], Theorem 7.2). However, if oe restricts cosideratio to certai smaller classes of distributios, such learig rates, for example, i the form of P (T (X Y ) : R P (f T ) R P + C(x) β ) 1 e x, 1,x 1, where β > 0 ad C(x) > 0 are costats, exist for various classifiers. Typical assumptios for such classes of distributios are either i terms of the smoothess of the fuctio η(x) := P(y = 1 x) (see, e.g., [19, 38]), or i terms of the smoothess of the decisio boudary (see, e.g., [18, 35]). Moreover, the correspodig learig rates are slower tha 1/2 if o additioal assumptios o the amout of the oise i the labels, for example, o the distributio of the radom variable (1) mi{1 η(x),η(x)} = 1 2 η(x) 1 2 aroud the critical level 1/2, are imposed. O the other had, [35] showed that ERM-type classifiers ca lear faster tha 1/2, if oe quatifies how likely the oise i (1) is close to 1/2 (see Defiitio 2.2 i the followig sectio). Ufortuately, however the ERM classifier cosidered i [35] requires substatial kowledge o how to approximate the desired Bayes decisio fuctios. Moreover, ERM classifiers are based o combiatorial optimizatio problems ad hece they are usually hard to implemet ad i geeral there exist o efficiet algorithms. O the oe had SVMs do ot share the implemetatio issues of ERM sice they are based o a covex optimizatio (see, e.g., [12, 26] for algorithmic aspects). O the other had, however, their kow learig rates are

3 FAST RATES FOR SUPPORT VECTOR MACHINES 3 rather usatisfactory sice either the assumptios o the distributios are too restrictive as i [28] or the established learig rates are too slow as i [37]. Our aim is to give SVMs a better theoretical foudatio by establishig fast learig rates for a wide class of distributios. To this ed we propose a geometric oise assumptio (see Defiitio 2.3) which describes the cocetratio of the measure 2η 1 dp X where P X is the margial distributio of P with respect to X ear the decisio boudary. This assumptio is the used to determie the approximatio properties of Gaussia kerels which are used i the SVMs we cosider. Provided that the tuig parameters are optimally chose our mai result the shows that the resultig learig rates for these classifiers ca be as fast as 1. The rest of this work is orgaized as follows: I Sectio 2 we itroduce the mai cocepts of this work ad the preset our results. I Sectio 3 we recall some basic theory o reproducig kerel Hilbert spaces ad prove a ew coverig umber boud for Gaussia kerels that describes a trade-off betwee the kerel widths ad the radii of the coverig balls. I Sectio 4 we the show the approximatio results that are related to our proposed geometric oise assumptio. The last sectios of the work cotai the actual proof of our rates: I Sectio 5 we establish a geeral boud for ERMtype classifiers ivolvig local Rademacher averages which is used to boud the estimatio error i our aalysis of SVMs. I order to apply this result we eed variace bouds for SVMs which are established i Sectio 6. Iterestigly, it turs out that sharp versios of these bouds deped o both Tsybakov s oise assumptio ad the approximatio properties of the kerel used. Fially, we prove our learig rates i Sectio Defiitios ad mai results. I this sectio we first recall some basic otios related to support vector machies which are eeded throughout this text. I Sectio 2.2, we the preset a coverig umber boud for Gaussia RBF kerels which will play a importat role i our aalysis of the estimatio error of SVMs. I Sectio 2.3 we recall Tsybakov s oise assumptio which will allow us to establish learig rates faster tha 1/2. The, i Sectio 2.4, we itroduce the ew geometric assumptio that is used to estimate the approximatio error for SVMs with Gaussia RBF kerels. Fially, we preset ad discuss our learig rates i Sectio RKHSs, SVMs ad basic defiitios. For two fuctios f ad g we use the otatio f(λ) g(λ) to mea that there exists a costat C > 0 such that f(λ) Cg(λ) over some specified rage of values of λ. We also use the otatio with similar meaig ad the otatio whe both ad hold. I particular, we use the same otatio for sequeces. If ot stated otherwise, X always deotes a compact subset of R d which is equipped with the Borel σ-algebra.

4 4 I. STEINWART AND C. SCOVEL Recall (see, e.g., [1, 6]) that every positive defiite kerel k :X X R has a uique reproducig kerel Hilbert space H (RKHS) whose uit ball is deoted by B H. Although we sometimes use geeric kerels ad RKHSs, we are maily iterested i Gaussia RBF kerels, which are the most widely used kerels i practice. Recall that these kerels are of the form k σ (x,x ) = exp( σ 2 x x 2 2), x,x X, where σ > 0 is a free parameter whose iverse 1/σ is called the width of k σ. We usually deote the correspodig RKHSs which are thoroughly described i [32] by H σ (X) or simply H σ. Let us ow recall the defiitio of SVMs. To this ed let P be a distributio o X Y ad l:y R [0, ) be the hige loss, that is, l(y,t) := max{0,1 yt}, y Y,t R. Furthermore, we defie the l-risk of a measurable fuctio f :X R by R l,p (f) := E (x,y) P l(y,f(x)). Now let H be a RKHS over X cosistig of measurable fuctios. For λ > 0 we deote a solutio of (2) arg mi(λ f 2 H + R l,p (f + b)) f H b R by ( f P,λ, b P,λ ). Recall that f P,λ is uiquely determied (see, e.g., [30]), while i some situatios this is ot true for the offset b P,λ. I geeral we thus assume that b P,λ is a arbitrary solutio. However, for the (trivial) distributios that satisfy P({y } x) = 1 P X -a.s. for some y Y we explicitly set bp,λ := y i order to cotrol the size of the offset. Furthermore, if P is a empirical distributio with respect to a traiig set T = ((x 1,y 1 ),...,(x,y )) we write R l,t (f) ad ( f T,λ, b T,λ ). Note that i this case the above coditio uder which we set b T,λ := y meas that all labels y i of T are equal to y. A algorithm that costructs ( f T,λ, b T,λ ) for every traiig set T is called a SVM with offset. Furthermore, for λ > 0 we deote the uique solutio of (3) arg mi(λ f 2 H + R l,p (f)) f H by f P,λ ad for empirical distributios based o a traiig set T we agai write f T,λ. A correspodig algorithm is called a SVM without offset. Recall that uder some assumptios o the RKHS used ad the choice of the regularizatio parameter λ it ca be show that both SVM variats are uiversally cosistet (see [29, 31, 39]); however, o satisfyig learig rates have bee established yet.

5 FAST RATES FOR SUPPORT VECTOR MACHINES 5 We also emphasize that i may theoretical papers oly SVMs without offset are cosidered sice the offset ofte causes serious techical problems i the aalysis. However, i practice usually SVMs with offset are used ad therefore we feel that these algorithms should be cosidered i theory, too. As we will see, our techiques ca be applied for both variats. The resultig rates coicide Coverig umbers for Gaussia RKHSs. I order to boud the estimatio error of SVMs we eed a complexity measure for the RKHSs used, which is itroduced i this sectio. To this ed let A E be a subset of a Baach space E. The coverig umbers of A are defied by { } N(A,ε,E) := mi 1: x 1,...,x E with A (x i + εb E ), ε > 0, where B E deotes the closed uit ball of E. Moreover, for a bouded liear operator S : E F betwee two Baach spaces E ad F, the coverig umbers are N(S,ε) := N(SB E,ε,F). Give a traiig set T = ((x 1,y 1 ),...,(x,y )) (X Y ) we deote the space of all equivalece classes of fuctios f :X Y R with orm (4) f L2 (T) := ( 1 ) 1/2 f(x i,y i ) 2 i=1 by L 2 (T). I other words, L 2 (T) is a L 2 -space with respect to the empirical measure of T. Note that for a fuctio f :X Y R a caoical represetative i L 2 (T) is its restrictio f T. I additio, L 2 (T X ) deotes the space of all (equivalece classes of) square itegrable fuctios with respect to the empirical measure of x 1,...,x. The proof of our learig rates uses the behavior of N(B Hσ(X),ε,L 2 (T X )) i ε ad σ i order to boud the estimatio error. Ufortuately, all kow results o coverig umbers for Gaussia RBF kerels emphasize the role of ε ad hece we will establish i Sectio 3 the followig result which describes a suitable trade-off betwee the ifluece of ε ad σ. Theorem 2.1. Let σ 1, X R d be a compact subset with oempty iterior, ad H σ (X) be the RKHS of the Gaussia RBF kerel k σ o X. The for all 0 0, there exists a costat c p,δ,d > 0 idepedet of σ such that for all ε > 0 we have sup log N(B Hσ(X),ε,L 2 (T X )) c p,δ,d σ (1 p/2)(1+δ)d ε p. T (X Y ) i=1

6 6 I. STEINWART AND C. SCOVEL 2.3. Tsybakov s oise assumptio. Now we recall Tsybakov s oise coditio, which describes the amout of oise i the labels. I order to motivate Tsybakov s assumptio let us first observe that by equatio (1) the fuctio 2η 1 ca be used to describe the oise i the labels of a distributio P. Ideed, i regios where this fuctio is close to 1 there is oly a small amout of oise, whereas fuctio values close to 0 oly occur i regios with a high level of oise. The followig defiitio i which we use the covetio t := 0 for t (0,1) describes the size of the latter regios: Defiitio 2.2. Let 0 q ad P be a probability measure o X Y. We say that P has Tsybakov oise expoet q if there exists a costat C > 0 such that for all sufficietly small t > 0 we have (5) P X ({x X : 2η(x) 1 t}) C t q. Obviously, P has Tsybakov oise expoet q > 0 if ad oly if 2η 1 1 L q, (P X ), where L q, deotes a Loretz space (see [5]). It is also easy to see that P has Tsybakov oise expoet q for all q < q if P has Tsybakov oise expoet q. Furthermore, all distributios obviously have oise expoet 0. I the other extreme case q = the coditioal probability η is bouded away from 1/2. I particular, oise-free distributios have expoet q =. Furthermore, for q < it is easy to check that Defiitio 2.2 is satisfied if ad oly if (5) holds for all t > 0 ad a possibly differet costat C. Fially, ote that (5) does ot make ay assumptios o the locatio of the oisy set, ad hece we prefer the otio oise coditio rather tha the ofte used term margi coditio A ew geometric assumptio for distributios. I this sectio we itroduce a coditio for distributios that will allow us to estimate the approximatio error for Gaussia RBF kerels. To this ed let l be the hige loss fuctio ad P be a distributio o X. Let R l,p := if{r l,p (f) f :X R measurable} deote the smallest possible l-risk of P. Sice fuctios achievig the miimal l-risk occur i may situatios we idicate them by f l,p if o cofusio regardig the ouiqueess of this symbol ca be expected. Furthermore, recall that f l,p has a shape similar to the Bayes decisio fuctio sigf P (see, e.g., [30]). Now, give a RKHS H over X we defie the approximatio error fuctio with respect to H ad P by (6) a(λ) := if f H (λ f 2 H + R l,p(f) R l,p ), λ 0. Note that the obvious aalogue of the approximatio error fuctio with offset is ot greater tha the above approximatio error fuctio without offset ad hece we restrict our attetio to the latter for simplicity.

7 FAST RATES FOR SUPPORT VECTOR MACHINES 7 For λ > 0, the approximatio error fuctio describes how well λ f P,λ 2 H + R l,p (f P,λ ) approximates R l,p. For example, it was show i [31] that we have lim λ 0 a(λ) = 0 for all P if X is a compact metric space ad H is dese i the space of cotiuous fuctios C(X). However, i otrivial situatios there caot exist a covergece rate which holds uiformly for all distributios P. Sice H σ (X) is dese i C(X) for compact X R d ad all σ > 0 these statemets are i particular true for the approximatio error fuctios a σ ( ) of the Gaussia RBF kerels with fixed width 1/σ. Moreover, we are ot aware of ay weak coditio o η or P that esures a σ (λ) λ β for λ 0 ad some β > 0, ad the results of [27] idicate that such behavior of a σ ( ) may actually require very restrictive coditios. I the followig we will therefore preset a coditio o P that allows us to estimate a σ (λ) by λ ad σ. I particular it will tur out that a σ (λ) 0 with a polyomial rate i λ if we relate σ to λ i a certai maer. I order to itroduce this assumptio o P we first defie the classes of P by X 1 := {x X :η(x) < 1 2 }, X 1 := {x X :η(x) > 1 2 } ad X 0 := {x X :η(x) = 1 2 } for some choice of η. Now we defie a distace fuctio x τ x by d(x,x 0 X 1 ), if x X 1, (7) τ x := d(x,x 0 X 1 ), if x X 1, 0, otherwise, where d(x,a) deotes the distace of x to a set A with respect to the Euclidea orm. Roughly speakig, τ x measures the distace of x to the decisio boudary. Now we ca preset the already aouced geometric coditio for distributios. Defiitio 2.3. Let X R d be compact ad P be a probability measure o X Y. We say that P has geometric oise expoet α > 0 if there exists a costat C > 0 such that (8) X ( 2η(x) 1 exp τ2 x t ) P X (dx) Ct αd/2, t > 0. We say that P has geometric oise expoet if it has geometric oise expoet α for all α > 0. Note that i the above defiitio we either make ay kid of smoothess assumptio or do we assume a coditio o P X i terms of absolute cotiuity with respect to the Lebesgue measure. Istead, the itegral coditio (8) describes the cocetratio of the measure 2η 1 dp X ear the decisio boudary i the sese that the less the measure is cocetrated i this regio the larger the geometric oise expoet ca be chose. The followig example illustrates this.

8 8 I. STEINWART AND C. SCOVEL Example 2.4. Sice exp( t) C α t α holds for all t > 0 ad a costat C α > 0 oly depedig o α > 0, we easily see that (8) is satisfied wheever (9) (x τ 1 x ) L αd( 2η 1 dp X ), where L αd ( 2η 1 dp X ) deotes the usual Lebesgue space of fuctios that are αd-itegrable with respect to the measure 2η 1 dp X. Now, let us suppose X 0 = for a momet. I this case τ x measures the distace to the class x does ot belog to. I particular, (9) holds for α = if ad oly if the two classes X 1 ad X 1 have strictly positive distace. Moreover, if (9) holds for some 0 < α < the two classes may touch, that is, the decisio boudary X 1 X 1 is oempty. Cosequetly, we ca easily costruct distributios P that have geometric oise expoet ad touchig classes, but also satisfy f P / H σ (X) for all σ > 0. However, ote that for such P the measure 2η 1 dp X must obviously have a very low cocetratio ear the decisio boudary. We ow describe a simple regularity coditio o η ear the decisio boudary that ca be used to guaratee a geometric oise expoet. Defiitio 2.5. Let X R d, P be a distributio o X Y ad γ > 0. We say that P has a evelope of order γ if there is a costat c γ > 0 such that for P X -almost all x X we have (10) 2η(x) 1 c γ τ γ x. Obviously, if P has a evelope of order γ the the graph of x 2η(x) 1 lies i a multiple of the evelope defied by τ γ x at the top ad by τ γ x at the bottom. Cosequetly, η ca be very irregular away from the decisio boudary but caot be discotiuous whe crossig it. The rate of covergece of η(x) 1/2 for τ x 0 is described by γ. Iterestigly, for distributios havig both a evelope of order γ ad a Tsybakov oise expoet q we ca boud the geometric oise expoet, as the followig theorem, which is proved i Sectio 4, shows. Theorem 2.6. Let X R d be compact ad P be a distributio o X Y that has a evelope of order γ > 0 ad a Tsybakov oise expoet q [0, ). The P has geometric oise expoet (q + 1)γd 1 if q 1, ad geometric oise expoet α for all α < (q + 1)γd 1 otherwise. Now the mai result of this subsectio which is proved i Sectio 4 shows that for distributios havig a otrivial geometric oise expoet we ca boud the approximatio error fuctio for Gaussia RBF kerels.

9 FAST RATES FOR SUPPORT VECTOR MACHINES 9 Theorem 2.7. Let σ > 0, X be the closed uit ball of the Euclidea space R d ad a σ ( ) be the approximatio error fuctio with respect to H σ (X). Furthermore, let P be a distributio o X Y that has geometric oise expoet 0 < α < with costat C i (8). The there is a costat c d > 0 depedig oly o the dimesio d such that for all λ > 0 we have (11) a σ (λ) c d (σ d λ + C(2d) αd/2 σ αd ). I order to let the right-had side of (11) coverge to zero it is ecessary to assume both λ 0 ad σ. A easy cosideratio shows that the fastest covergece rate is achieved if σ(λ) := λ 1/((α+1)d). I this case we have a σ(λ) (λ) λ α/(α+1). I particular, we ca obtai rates up to liear order i λ for sufficietly beig distributios. The price for this good approximatio property is, however, a icreasig complexity of the hypothesis class B Hσ(λ), as we have see i Theorem Learig rates for SVMs usig Gaussia RBF kerels. With the help of the geometric oise assumptio we ca ow preset our learig rates for SVMs usig Gaussia RBF kerels. Note agai that these polyomial rates do ot require a smoothess assumptio o P. Furthermore ote that we use the covetio a +b c +d := a c for a,c (0, ), b,d [0, ) i order to make the presetatio compact. Theorem 2.8. Let X be the closed uit ball of R d, ad P be a distributio o X Y with Tsybakov oise expoet q [0, ] ad geometric oise expoet α (0, ). We defie α 2α + 1, if α q + 2 2q, β := 2α(q + 1) 2α(q + 2) + 3q + 4, otherwise, ad λ := (α+1)/αβ ad σ := β/(αd) i both cases. The for all ε > 0 there exists a C > 0 such that for all x 1 ad 1 the SVM without offset usig the Gaussia RBF kerel k σ satisfies Pr ( T (X Y ) : R P (f T,λ ) R P + Cx 2 β+ε) 1 e x, where Pr deotes the outer probability of P i order to avoid measurability cosideratios. If α = the latter iequality holds if σ = σ is a costat with σ > 2 d. Fially, all results also hold for the SVM with offset. Remark 2.9. The above learig rates are faster tha the parametric rate 1/2 if ad oly if α > (3q + 4)/(2q). For q = the latter coditio becomes α > 3/2 ad i a itermediate case q = 1 it becomes α > 7/2.

10 10 I. STEINWART AND C. SCOVEL Remark It is importat to ote that our techiques ca also be used to establish rates for other defiitios of the sequeces (λ ) ad (σ ). I fact, Theorem 2.7 guaratees a σ (λ ) 0 (which is ecessary for our techiques to produce ay rate) if σ ad σ d λ 0. I particular, if λ := ι ad σ := κ for some ι,κ > 0 with κd < ι, these coditios are satisfied ad a coceptually easy but techically ivolved modificatio of our proof ca produce rates for certai rages of ι (ad thus κ). I order to keep the presetatio as short as possible we have omitted the details ad focused o the best possible rates. Remark Ufortuately, the choice of λ ad σ that yields the optimal rates withi our techiques, requires to kow the values of α ad q, which are typically ot available. Adaptive methods which do ot require such kowledge are still ukow. Remark Theorem 2.7 ad Theorem 2.8 establish results for all distributios havig some geometric oise expoet. However, for certai distributios of this type the resultig rates are ot satisfactory. For example cosider the distributio P o X := [ 1, 1] whose margial distributio P X equals the uiform distributio ad whose coditioal distributio η(x) := P(y = 1 x) satisfies 2η(x) 1 = x γ, x X, for some costat γ (0, ). The P obviously has Tsybakov oise expoet q := 1/γ, ad Theorem 2.6 or a simple modificatio of the proof of Theorem 2.7 shows that P has geometric oise expoet α := 1 + γ. Theorem 2.8 thus gives a rate of the form β+ε for β = 2q2 +4q+2 5q 2 +10q+4, which is ever faster tha 1/2. Though this is disappoitig at first glace, it is ot really surprisig sice the proof of Theorem 2.7 is ot tailored to distributios havig such simple decisio fuctios. We believe that sharper bouds o the approximatio error fuctio (ad thus faster learig rates) for this ad other distributios are possible, but a detailed aalysis is beyod the scope of this paper. Remark Aother iterestig but ope questio is whether the obtaied rates are optimal for the class of cosidered distributios. I order to approach this questio let us cosider the case α =, which roughly speakig describes the case of almost o approximatio error. I this case our rates are essetially of the form (q+1)/(q+2), which coicides with the rates Tsybakov (see [35]) achieved for certai ERM classifiers based o hypothesis classes of small complexity. The latter rates i tur caot be improved i a miimax sese for certai classes of distributios as was also show i [35]. This discussio idicates that the techiques used for the stochastic part of our aalysis may be strog eough to produce optimal results. However, if we cosider the case α < the the approximatio error fuctio described

11 FAST RATES FOR SUPPORT VECTOR MACHINES 11 i Theorem 2.7 ad its ifluece o the estimatio error (see our proofs, i particular Sectio 5 ad Sectio 7) have a sigificat impact o the obtaied rates. Sice the sharpess of Theorem 2.7 is uclear to us we make o cojecture regardig the optimality of our rates i the geeral case. 3. Proof of Theorem 2.1. The mai goal of this sectio is to prove Theorem 2.1, which is doe i Sectio 3.2. To this ed we provide i Sectio 3.1 some RKHS theory which is used throughout this work Some basic RKHS theory. For the proofs of this sectio we have to recall some basic facts from the theory of RKHSs. To this ed let X R d be a compact subset ad k:x X R be a cotiuous ad positive semidefiite kerel with RKHS H. The H cosists of cotiuous fuctios o X ad for f H we have f K f H, where (12) K := sup x X k(x,x). Cosequetly, if the embeddig of the RKHS H ito the space of cotiuous fuctios C(X) is deoted by (13) J H :H C(X) we have J H K. Furthermore, let us recall the represetatio of H based o Mercer s theorem (see [13]). To this ed let K X :L 2 (X) L 2 (X) be the itegral operator defied by (14) K X f(x) := k(x,x )f(x )dx, f L 2 (X),x X, X where L 2 (X) deotes the L 2 -space o X with respect to the Lebesgue measure. The it was show i [13] that the uique square root K 1/2 X of K X is a isometric isomorphism betwee L 2 (X) ad H Proof of Theorem 2.1. I order to prove Theorem 2.1 we eed the followig result which bouds the coverig umbers of H σ (X) with respect to C(X). Theorem 3.1. Let σ 1, 0 0 idepedet of σ such that for all ε > 0 we have log N(B Hσ(X),ε,C(X)) c p,d σ (1 p/4)d ε p. Proof. Let B d be the closed uit ball of the Euclidea space R d ad B d be its iterior. The there exists a r 1 such that X rb d. Now,

12 12 I. STEINWART AND C. SCOVEL it was recetly show i [32] that the restrictios H σ (rb d ) H σ (X) ad H σ (rb d ) H σ ( B d ) are both isometric isomorphisms. Cosequetly, i the followig we assume without loss of geerality that X = B d or X = B d ad do ot cocer ourselves with the distictio of both cases. Now let us write H σ := H σ (X) ad J σ := J Hσ :H σ C(X) i order to simplify otatio. Furthermore, let K σ :L 2 (X) L 2 (X) be the itegral operator of k σ defied as i (14), ad deote the orm i L 2 (X). Accordig to [13], Theorem 3, page 27, for ay f H σ, we obtai K 1 σ if h R f h 1 R K 1/2 σ f 2 = 1 R f 2 H σ, where we use the covetio K 1 σ h = if h / K σl 2 (X). Suppose ow that H L 2 (X) is a dese Hilbert space with h h H, ad that we have K σ :L 2 (X) H L 2 (X) with K σ :L 2 (X) H c σ,h < for some costat c σ,h > 0. It follows that ad hece if h H c σ,h R f h if K 1 σ h R if f h c σ,h h H R R f 2 H σ. f h 1 R f 2 H σ By [27], Theorem 3.1 it follows that f is cotaied i the real iterpolatio space (L 2 (X), H) 1/2, (see [7] for the defiitio of a iterpolatio space) ad its orm i this space satisfies f 1/2, 2 c σ,h f Hσ. Therefore we obtai a cotiuous embeddig Υ 1 :H σ (L 2 (X), H) 1/2,, with Υ 1 2 c σ,h. If i additio a subset iclusio (L 2 (X), H) 1/2, C(X) exists which defies a cotiuous embeddig Υ 2 :(L 2 (X), H) 1/2, C(X), we have a factorizatio J σ = Υ 2 Υ 1 ad ca coclude (15) log N(B Hσ(X),ε,C(X)) = log N(J σ,ε) log N ( Υ 2, ε 2 c σ,h Cosequetly, to boud log N(J σ,ε) we eed to select a H, compute c σ,h ad boud log N(Υ 2,ε). To that ed let H := W m ( X) be the Sobolev space with orm f 2 m = D α f 2, α m ).

13 FAST RATES FOR SUPPORT VECTOR MACHINES 13 where α := d i=1 α i, D α := d i=1 α i i, ad α i i deotes the α i th partial derivative i the ith coordiate of R d. By the Cauchy Schwarz iequality we obtai (16) D α K σ f 2 f 2 Dxk α σ (x, x) 2 d xdx, X X where the otatio Dx α idicates that the differetiatio takes place i the x variable. To address the term Dxk α σ (x, x) we ote that D α x(e x 2 ) = ( 1) α e x 2 /2 h α (x), where the multivariate Hermite fuctios h α (x) = d i=1 h αi (x i ) are products of the uivariate fuctios. Sice R h2 k (x)dx = 2k k! π (see, e.g., [11]) we obtai Dx(e α x 2 ) 2 dx = e x 2 h 2 α(x)dx R d R (17) d h 2 α(x)dx = 2 α α!π d/2, R d where we have used the defiitio α! := d i=1 α i!. Applyig the traslatio ivariace of k σ, we obtai Dxk α σ (x, x) 2 d x = Dxk ά σ (0, x) 2 d x = D ά σ2 x 2 x(e ) 2 d x, R d R d R d ad by a chage of variables we ca apply iequality (17) to the itegral o the right-had side, D ά σ2 x 2 x(e ) 2 d x = σ 2 α d Dx(e ά x 2 ) 2 d x σ 2 α d 2 α α!π d/2. R d R d Hece we obtai Dx α k σ(x, x) 2 d xdx θ(d)σ 2 α d 2 α α!π d/2, X X where θ(d) is the volume of X. Sice α m α! dm m! d ad K σ f 2 m = α m Dα K σ f 2 we ca therefore ifer from (16) that for σ 1 we have (18) K σ θ(d)(2d) m/2 m! d/2 σ m d/2 =: c σ,h. Now let us cosider Υ 2 :(L 2 (X),W m ( X)) 1/2, C(X). Accordig to Triebel [34], page 267, we have (L 2 (X),W m ( X)) 1/2, = (L 2 ( X),W m ( X)) 1/2, = B m/2 2, ( X) isomorphically. Furthermore (19) log N(B m/2 2, ( X) C(X),ε) c m,d ε 2d/m

14 14 I. STEINWART AND C. SCOVEL for m > d follows from a similar result of Birma ad Solomjak ([8], cf. also [34]) for Slobodeckij (i.e., fractioal Sobolev) spaces, where the costat c m,d depeds oly o m ad d. Cosequetly we obtai from (15), (18) ad (19) that ( ) ε 2d/m log N(J σ,ε) c m,d 2 c σ,h = c m,d (4c σ,h ) d/m ε 2d/m = c m,d σ d d2 /(2m) ε 2d/m for all m > d ad ew costats c m,d depedig oly o m ad d. Settig m := 2d/p completes the proof of Theorem 3.1. Proof of Theorem 2.1. As before we write H σ := H σ (X) ad J σ := J Hσ :H σ C(X) i order to simplify otatio. Furthermore recall for a traiig set T (X Y ) the space L 2 (T X ) itroduced i Sectio 2.2. Now let R TX :C(X) L 2 (T X ) be the restrictio map defied by f f TX. Obviously, we have R TX 1. Furthermore we defie I σ := R TX J σ so that I σ :H σ L 2 (T X ) is the evaluatio map. The Theorem 3.1 ad the product rule for coverig umbers imply that (20) sup log N(I σ,ε) c q,d σ (1 q/4)d ε q T Z for all 0 < q < 2. To complete the proof of Theorem 2.1 we derive aother boud o the coverig umbers ad iterpolate the two. To that ed observe that I σ :H σ L 2 (T X ) factors through C(X) with both factors J s ad R TX havig orm ot greater tha 1. Hece Propositio i [23] implies that I σ is absolutely 2-summig with 2-summig orm ot greater tha 1. By Köig s theorem ([24], Lemma 2.7.2) we obtai for the approximatio umbers (a k (I σ )) of I σ that k 1 a2 k (I σ) 1 for all σ > 0. Sice the approximatio umbers are decreasig it follows that sup k kak (I σ ) 1. Usig Carl s iequality betwee approximatio ad etropy umbers (see Theorem i [10]) we thus fid a costat c > 0 such that (21) sup log N(I σ,ε) cε 2 T Z for all ε > 0 ad all σ > 0. Let us ow iterpolate the boud (21) with the boud (20). Sice I σ :H σ L 2 (T X ) 1 we oly eed to cosider 0 < ε 1. Let 0 < q < p < 2 ad 0 < a 1. The for 0 < ε < a we have log N(I σ,ε) c q,d σ (1 q/4)d ε q c q,d σ (1 q/4)d a p q ε p, ad for a ε 1 we fid log N(I σ,ε) cε 2 ca p 2 ε p.

15 FAST RATES FOR SUPPORT VECTOR MACHINES 15 Sice σ 1 we ca set a := σ ((4 q)/(8 4q))d ad obtai log N(I σ,ε) c q,d σ (1 p/2)((8 2q)/(8 4q))d ε p, where c q,d is a costat depedig oly o q,d. The proof is completed by choosig q := 4δ 2p 1+2δ whe δ < 8 4p ad q just smaller tha p otherwise. 4. Proofs of Theorems 2.7 ad 2.6. I this sectio we prove Theorems 2.7 ad 2.6, which both deal with the geometric oise expoet Proof of Theorem 2.7. Let us begi by recallig some facts about Gaussia RBF kerels. To this ed let H σ (R d ) be the RKHS of the Gaussia RBF kerel with parameter σ. The it was show i [32] that the liear operator V σ : L 2 (R d ) H σ (R d ) defied by V σ g(x) = (2σ)d/2 π d/4 e 2σ2 x y 2 2g(y)dy, g L2 (R d ),x R d, R d is a isometric isomorphism. Cosequetly, we obtai (22) a σ (λ) = if g L 2 (R d ) λ g 2 L 2 (R d ) + R l,p(v σ g) R l,p, λ > 0. I the followig we will estimate the right-had side of (22) by a judicious choice of g. To this ed we eed the followig lemma, which i some sese elarges the support of P to esure that all balls of the form B(x,τ x ) are cotaied i the (elarged) support. This guaratee will the make it possible to cotrol the behavior of V σ g by tails of spherical Gaussia distributios [see (28) for details]. Lemma 4.1. Let X be a closed uit ball of R d ad P be a probability measure o X Y with regular coditioal probability η(x) = P(y = 1 x), x X. O X := 3X we defie η(x), ( ) if x 1, (23) ή(x) = x η, otherwise. x We also write X 1 := {x X :ή(x) < 1 2 } ad X 1 := {x X :ή(x) > 1 2 }. Fially let B(x,r) deote the ope ball of radius r about x i R d. The for x X 1 we have B(x,τ x ) X 1 ad for x X 1 we have B(x,τ x ) X 1. Proof. Let x X 1 ad x B(x,τ x ). If x X we have x x < τ x which implies η(x) > 1 2 by the defiitio of τ x. This shows x X 1. Now let us assume x > 1. By x,x x ad Pythagoras theorem we the obtai x x x 2 x x,x x 2 x 2 + x,x x 2 x 2 x = x x 2.

16 16 I. STEINWART AND C. SCOVEL Therefore, we have x x x < τ x, which implies ή(x ) = η( x x ) > 1 2. (24) Let us fially recall that Zhag showed i [39] that the hige risk satisfies R l,p (f) R l,p = E PX ( 2η 1 f f P ) for all measurable f :X [ 1,1]. Now we are ready to prove Theorem 2.7. Proof of Theorem 2.7. With the otatio of Lemma 4.1 we fix a measurable f P : X [ 1,1] that satisfies f P = 1 o X 1, fp = 1 o X 1 ad f P = 0 otherwise. For g := (σ 2 /π) d/4 fp we the immediately obtai ( 81σ 2 ) d/4 (25) g L2 (R d ) θ(d), π where θ(d) deotes the volume of X. Moreover, it is easy to see that 1 f P 1 implies 1 V σ g 1. Sice P X has support i X, (24) the yields (26) R l,p (V σ g) R l,p = E PX ( 2η 1 V σ g f P ). I order to boud V σ g(x) f P (x) for x X 1 we observe ( 2σ 2 ) d/2 V σ g(x) = e 2σ2 x y 2 2 fp (y)dy π R d ( 2σ 2 ) d/2 (27) = e 2σ2 x y 2 2( fp (y) + 1)dy 1 π R d ( 2σ 2 ) d/2 e 2σ2 x y 2 2 ( fp (y) + 1)dy 1. π B(x,τ x) Now remember that Lemma 4.1 showed B(x,τ x ) X 1 for all x X 1, so that (27) implies ( 2σ 2 ) d/2 V σ g(x) 2 e 2σ2 x y 2 2 dy 1 π B(x,τ x) (28) = 1 2P γσ ( u τ x ), where γ σ = (2σ 2 /π) d/2 e 2σ2 u 2 du is a spherical Gaussia i R d. Accordig to the tail boud [17], iequality (3.5) o page 59, we have P γσ ( u r) 4e σ2 r 2 /2d ad cosequetly we obtai 1 V σ g(x) 1 8e σ2 τ 2 x /2d, x X 1. Sice for x X 1 we ca obtai a aalogous estimate, we coclude V σ g(x) f P (x) 8e σ2 τ 2 x /2d

17 FAST RATES FOR SUPPORT VECTOR MACHINES 17 for all x X 1 X 1. Cosequetly (26) ad the geometric oise assumptio for t := 2d σ 2 yield (29) R l,p (V σ g) R l,p 8E x PX ( 2η(x) 1 e σ2 τ 2 x/2d ) 8C(2d) αd/2 σ αd, where C is the costat i (8). Combiig (29), (25) ad (22) ow yields the assertio Proof of Theorem 2.6. I this subsectio, all Lebesgue ad Loretz spaces (see, e.g., [5]) ad their orms are with respect to the measure P X. Proof of Theorem 2.6. Let us first cosider the case q 1 where we ca apply the Hölder iequality for Loretz spaces [22], which states = 1. Applyig this i- fg 1 f q, g q,1 for all f L q,, g L q,1 ad q defied by 1 q + 1 q equality gives E x PX ( 2η(x) 1 e τ2 x /t ) (30) (2η 1) 1 q, x (2η(x) 1) 2 e τ2 x /t q,1 C (2η 1) 2 e ( 2η 1 /cγ)2/γ t 1 q,1, where i the last estimate we used the Tsybakov assumptio (5) ad the fact that P has a evelope of order γ. Let us write h(x) := 2η(x) 1 1, x X, ad b := t(c γ ) 2/γ so that 2η(x) 1 2 e ( 2η 1 /cγ)2/γ t 1 = g(h(x)), where g(s) := s 2 e (s 2/γ )/b for all s 1. Now it is easy to see that g : [1, ) [0, ) is strictly icreasig if 0 τ) = P X (h > g 1 (τ)). Now for a fuctio f :X [0, ) recall the oicreasig rearragemet f (u) := if {σ 0:P X (f > σ) u}, u > 0, of f which ca be used to defie Loretz orms (see, e.g., [5]). For u > 0 equatio (31) the yields (g h) (u) = g(if{g 1 (σ):p X (h > g 1 (σ)) u}) = g h (u).

18 18 I. STEINWART AND C. SCOVEL Now, iequality (5) implies P X (h ( C u )1/q ) u for all u > 0. Therefore, we fid ( ) C 1/q h (u) if{σ 0:P X (h σ) u} u for all 0 0 the boud e x implies s 2(ˆα/γ 1) g(s) bˆα l 2 (s 2/γ b 1 ) + 1 x ˆα l 2 (x)+1 o (0, ) for s [1, ). Usig the fact that (g h) (u) = 0 holds for all u 1, we hece obtai (g h) (u) bˆα u 2/q(1 ˆα/γ) l 2 ((u/c) 2/(qγ) b 1 ) + 1 for u > 0 if we assume without loss of geerality that C 1. Let us defie ˆα := γ q+1 2. The we fid 1 q + 2 q (1 ˆα 2 γ ) = 0 ad cosequetly for b 3γ, that 2 is, t, we obtai 3γ(c γ) 2/γ (32) g h q,1 = 0 bˆα 0 u 1/q 1 (g h) (u)du u 1 l 2 du ((u/c) 2/(qγ) b 1 tγ(q+1)/2 ) + 1 by the defiitio of b. Sice we also have E PX ( 2η(x) 1 e τ2 x/t ) 1 for all t > 0, estimate (30) together the defiitio of g ad (32) yields the assertio i the case q 1. Let us ow cosider the case 0 q < 1 where the Hölder iequality i Loretz space caot be used. The for all t,τ 0 we have (33) E x PX ( 2η(x) 1 e τ2 x/t ) = 2η(x) 1 e τ2 x /t P X (dx) 2η 1 τ + 2η 1 >τ 2η(x) 1 e τ2 x/t P X (dx) ( ( ) τ 2/γ ) Cτ q+1 + exp t 1, cγ

19 FAST RATES FOR SUPPORT VECTOR MACHINES 19 where we have used the Tsybakov assumptio (5) ad the fact that P has a evelope of order γ. Let us defie τ by τ q+1 := exp( ( τ c γ ) 2/γ t 1 ). For â := (c γ ) 2/γ (q + 1) ad small t this defiitio implies (âγ τ 2 ) γ/2 ( tl 1 ât) γ/2, ad hece the assertio follows from (33) for the case 0 < q < The estimatio error of ERM-type classifiers. To boud the estimatio error i the proof of Theorem 2.8 we ow establish a cocetratio iequality for ERM-type algorithms usig a variat of Talagrad s cocetratio iequality together with local Rademacher averages (see, e.g., [2, 4, 21]). Our approach is ispired by [3]. However, due to the regularizatio term λ f 2 H i the defiitio of SVMs we eed a more geeral result tha that of [3]. This sectio is orgaized as follows: I Sectio 5.1 we preset the required modificatio of the result of [3]. The i Sectio 5.2 we boud the resultig local Rademacher averages Boudig the estimatio error for ERM-type algorithms. We first have to itroduce some otatio. To this ed let F be a class of bouded measurable fuctios from Z to R such that F is separable with respect to. Give a probability measure P o Z we defie the modulus of cotiuity of F by ω (F,ε) := ω P, (F,ε) := E T P ( sup f F, E P f 2 ε ) E P f E T f, ε > 0, where we ote that the supremum is, as a fuctio from Z to R, measurable by the separability assumptio o F. Now, a fuctio L:F Z [0, ) is called a loss fuctio if L f := L(f, ) is measurable for all f F. Give a probability measure P o Z we idicate by f P,F F a miimizer of f R L,P (f) := E z P L(f,z). Throughout this paper R L,P (f) is called the L-risk of f. If P is a empirical measure with respect to T Z we write f T,F ad R L,T ( ) as usual. For simplicity, we assume throughout this sectio that f P,F ad f T,F do exist. Furthermore, although there may be multiple solutios we use a sigle symbol for them wheever o cofusio regardig the ouiqueess of this symbol ca be expected. A algorithm that produces solutios f T,F is called a empirical L-risk miimizer. Moreover, if F is covex, we say that L is covex if L(,z) is covex for all z Z. Fially, L is called lie-cotiuous

20 20 I. STEINWART AND C. SCOVEL if for all z Z ad all f, ˆf F the fuctio t L(tf + (1 t) ˆf,z) is cotiuous o [0,1]. If F is a vector space the every covex L is lie-cotiuous. Now the mai result of this sectio reads as follows: Theorem 5.1. Let F be a covex set of bouded measurable fuctios from Z to R, ad let L:F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := {L f L f P,F :f F}. Suppose that there are costats c 0, 0 < α 1, δ 0 ad B > 0 with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to. Let 1, x 1 ad ε > 0 with (34) The we have { ε 10max ω (G,cε α + δ), δx, ( 4cx ) 1/(2 α), Bx }. Pr (T Z : R L,P (f T,F ) < R L,P (f P,F ) + ε) 1 e x. Remark 5.2. Theorem 5.1 has bee proved i [3] for δ = 0, where it was used to fid learig rates faster tha 1/2 for certai ERM-type algorithms. At first glace such fast rates are impossible if δ > 0. However, we will see later that for SVMs we have δ = a κ σ(λ) for a suitable κ > 0 depedig o both Tsybakov s ad the geometric oise expoet, ad hece we have δ 0 for. As already metioed, the proof of Theorem 5.1 is based o Talagrad s cocetratio iequality i [33] ad its refiemets i [16, 20, 25]. The versio below of this iequality is derived from Bousquet s result i [9] usig a little trick preseted i [2], Lemma 2.5. Theorem 5.3. Let P be a probability measure o Z ad H be a set of bouded measurable fuctios from Z to R which is separable with respect to ad satisfies E P h = 0 for all h H. Furthermore, let b > 0 ad τ 0 be costats with h b ad E P h 2 τ for all h H. The for all x 1 ad all 1 we have P (T Z : sup h H 2xτ E T h > 3E T P sup E T h + h H + bx ) e x. This cocetratio iequality is used to prove the followig lemma which is a geeralized versio of Lemma 13 i [3].

21 FAST RATES FOR SUPPORT VECTOR MACHINES 21 Lemma 5.4. Let P be a probability measure o Z ad G be a set of bouded measurable fuctios from Z to R which is separable with respect to. Let c 0, 0 < α 1, δ 0 ad B > 0 be costats with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that for all T Z ad all ε > 0 for which for some g G we have E T g ε/20 there exists a g G which satisfies ad E P g ε E T g ε/20 ad E P g = ε. The for all 1, x 1, ad all ε > 0 satisfyig (34), we have Pr (T Z : for all g G with E T g ε/20 we have E P g < ε) 1 e x. Proof. We defie H := {E P g g :g G,E P g = ε}. Obviously, we have E P h = 0, h 2B, ad E P h 2 = E P g 2 (E P g) 2 cε α + δ for all h H. Moreover, sice it is also easy to verify that H is separable with respect to, our assumptio o G yields Pr (T Z : g G with E T g ε/20 ad E P g ε) Pr (T Z : g G with E P g E T g 19ε/20 ad E P g = ε) ) P (T Z : sup E T h 19ε/20. h H Note that sice H is separable with respect to, the set o the last lie is actually measurable. I order to boud the last probability we will apply Theorem 5.3. To this ed we have to show Our assumptios o ε imply ( (35) ε 10E T P 19ε 20 > 3E T P sup E T h + h H sup g G, E P g 2 cε α +δ ) E P g E T g 2xτ + bx. 10E T P sup E T h. h H Furthermore, sice 10 ( )2 ad 0 < α 1 we have ( ) 4cx 1/(2 α) ( ) 60 2/(2 α) ( ) 4cx 1/(2 α) (36) ε If δ cε α a simple calculatio hece shows ε 2(cε α +δ)x. Furthermore, if δ > cε α the assumptios of the theorem show δx ε δx (cε α + δ)x. 19

22 22 I. STEINWART AND C. SCOVEL Hece we have ε 2(cε α +δ)x for all ε satisfyig the assumptios of the theorem. Now let τ := cε α + δ ad b := 2B. By (35) ad ε 10Bx we the fid 19ε 20 > 3E T P sup E T h + h H Applyig Theorem 5.3 the yields 2xτ + bx. Pr (T Z : g G with E T g ε/20 ad E P g ε) ) P (T Z : sup E T h 19ε/20 h H P (T Z : sup h H e x. 2xτ E T h > 3E T P sup E T h + h H + bx ) With the help of the above lemma we ca ow prove the mai result of this sectio, that is, Theorem 5.1. Proof of Theorem 5.1. I order to apply Lemma 5.4 to the class G it obviously suffices to show the richess coditio o G of Lemma 5.4. To this ed let f F with E T (L f L f P,F ) ε/20 ad E P (L f L f P,F ) ε. For t [0,1] we defie f t := tf + (1 t)f P,F. Sice F is covex we have f t F for all t [0,1]. By the lie-cotiuity of L ad Lebesgue s theorem we fid that the map h:t E P (L f t L f P,F ) which maps from [0,1] to [0,B] is cotiuous. Sice h(0) = 0 ad h(1) ε there is a t (0,1] with E P (L f t L f P,F ) = h(t) = ε by the itermediate value theorem. Moreover, for this t we have E T (L f t L f P,F ) E T (tl f + (1 t)l f P,F L f P,F ) ε/20. Now, let ε > 0 with ε 10max{ω (G,cε α + δ),( δx )1/2,( 4cx )1/(2 α), Bx }. The by Lemma 5.4 we fid that with probability at least 1 e x, every f F with E T (L f L f P,F ) ε/20 satisfies E P (L f L f P,F ) < ε. Sice we always have we obtai the assertio. E T (L f T,F L f P,F ) 0 < ε/20,

23 FAST RATES FOR SUPPORT VECTOR MACHINES Boudig the modulus of cotiuity. The aim of this subsectio is to boud the modulus of cotiuity of the class G i Theorem 5.1 with the help of coverig umbers. We the preset the resultig modificatio of Theorem 5.1. Let us begi by recallig the defiitio of (local) Rademacher averages. To this ed let F be a class of bouded measurable fuctios from Z to R which is separable with respect to. Furthermore, let P be a probability measure o Z ad (ε i ) be a sequece of i.i.d. Rademacher variables (i.e., symmetric { 1, 1}-valued radom variables) with respect to some probability measure µ o a set Ω. The the Rademacher average of F is 1 Rad P (F,) := Rad(F,) := E P E µ sup ε i f(z i ), f F ad for ε > 0 the local Rademacher average of F is defied by 1 Rad(F,,ε) := Rad P (F,,ε) := E P E µ sup ε f F, i f(z i ). i=1 E P f 2 ε For a give a > 0 we immediately obtai Rad(aF, ) = a Rad(F, ) ad (37) Rad(aF,,ε) = arad(f,,a 2 ε). Moreover, by symmetrizatio the modulus of cotiuity ca be estimated by the local Rademacher average. More precisely, we always have (see [36]) i=1 ω P, (F,ε) 2Rad P (F,,ε), ε > 0. Local Rademacher averages ca be estimated by coverig umbers. Without proof we state a slight modificatio of a correspodig result i [21]: Propositio 5.5. Let F be a class of measurable fuctios from Z to [ 1,1] which is separable with respect to ad let P be a probability measure o Z. Assume there are costats a > 0 ad 0 0. The there exists a costat c p > 0 depedig oly o p such that for all 1 ad all ε > 0 we have { ( a 1/2 ( a 2/(2+p) } Rad(F,,ε) c p max ε ) 1/2 p/4,. ) Usig this propositio we ca replace the modulus of cotiuity i Theorem 5.1 by a assumptio o the coverig umbers of G. Assumig that all resultig miimizers exist, the correspodig result the reads as follows:

24 24 I. STEINWART AND C. SCOVEL Theorem 5.6. Let F be a covex set of bouded measurable fuctios from Z to R ad let L:F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := {L f L f P,F :f F}. Suppose that there are costats c 0, 0 < α 1, δ 0 ad B > 0 with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to ad that there are costats a 1 ad 0 0. The there exists a costat c p > 0 depedig oly o p such that for all 1 ad all x 1 we have where Pr (T Z : R L,P (f T,F ) > R L,P (f P,F ) + c p ε(,a,b,c,δ,x)) e x, ε(,a,b,c,δ,x) := B 2p/(4 2α+αp) c (2 p)/(4 2α+αp) ( a ( ) a 2/(2+p) + B + ( ) δx cx 1/(2 α) + + Bx. ) 2/(4 2α+αp) + B p/2 δ (2 p)/4 ( a ) 1/2 Proof. By (37) ad Propositio 5.5 we fid { ( a 1/2 ( ) a 2/(2+p) } Rad(G,,ε) c p max B p/2 ε ) 1/2 p/4,b. We assume without loss of geerality that c p 5. Let ε > 0 be the largest real umber that satisfies (39) ε = 2c p B p/2 (c(ε ) α + δ) 1/2 p/4 ( a ) 1/2. Furthermore, let ε > 0 be such that { ( a 1/2 ε = 2c p max B p/2 (cε α + δ) ) (2 p)/4, ( ) a 2/(2+p) B, ( δx 4cx, ) 1/(2 α), Bx }. It is easy to see that both ε ad ε exist. Moreover, our above cosideratios show ε 10max{ω (G,cε α + δ),( δx )1/2,( 4cx }, that is, ε satisfies )1/(2 α), Bx

25 FAST RATES FOR SUPPORT VECTOR MACHINES 25 the assumptios of Theorem 5.1. I order to show the assertio it therefore suffices to boud ε from above. To this ed let us first assume that ( a 1/2 { ( ) a 2/(2+p) ( ) B p/2 (cε α +δ) ) (2 p)/4 δx 4cx 1/(2 α) max B,,, Bx }. The we have ε = 2c p B p/2 (cε α + δ) (2 p)/4 ( a )1/2. Sice ε is the largest solutio of this equatio we hece fid ε ε. This shows that we always have ( ) a 2/(2+p) ε ε + 2c p (B + ( δx 4cx + ) 1/(2 α) + Bx ). Hece it suffices to boud ε from above. To this ed let us first assume c(ε ) α δ. This implies ε 4c p B p/2 (c (ε ) α ) 1/2 p/4 ( a )1/2, ad hece we fid ε 16c 2 pb 2p/(4 2α+αp) c (2 p)/(4 2α+αp) ( a ) 2/(4 2α+αp). Coversely, if c(ε ) α < δ holds, the we immediately obtai ε < 4c p B p/2 δ (2 p)/4 ( a ) 1/2. 6. Variace bouds for SVMs. I this sectio we prove some variace bouds i the sese of Theorem 5.6 for SVMs. Let us first esure that these classifiers are ERM-type algorithms that fit ito the framework of Theorem 5.6. To this ed let H be a RKHS of a cotiuous kerel over X, λ > 0, ad l:y R [0, ) be the hige loss fuctio. We defie (40) L(f,x,y) := λ f 2 H + l(y,f(x)) ad (41) L(f,b,x,y) := λ f 2 H + l(y,f(x) + b) for all f H, b R, x X ad y Y. The R L,T ( ) ad R L,T (, ) obviously coicide with the objective fuctios of the SVM formulatios ad therefore SVMs are empirical L-risk miimizers. Furthermore ote that all above miimizers exist (see [31]) ad thus the SVM formulatios i terms of L actually fit ito the framework of Theorem 5.6. I the followig, f l,p deotes a miimizer of R l,p if o cofusio ca arise. For the shape of these miimizers which deped o η := P(y = 1 ) we refer to [39] ad [30]. Now our first result is a variace boud which ca be used whe cosiderig the empirical l-risk miimizer.

26 26 I. STEINWART AND C. SCOVEL Lemma 6.1. Let P be a distributio o X Y with Tsybakov oise expoet 0 q. The there exists a miimizer f l,p mappig ito [ 1,1] such that for all bouded measurable fuctios f :X R we have E P (l f l f l,p ) 2 C η,q ( f + 1) (q+2)/(q+1) (E P (l f l f l,p )) q/(q+1), where C η,q := (2η 1) 1 q, + 2 if q > 0 ad C η,q = 1 if q = 0. Proof. For q = 0 the assertio is trivial ad hece we oly cosider the case q > 0. Give a fixed x X we write p := P(1 x) ad t := f(x). I additio, we itroduce v(p,t) := p(l(1,t) l(1,f l,p (x))) 2 + (1 p)(l( 1,t) l( 1,f l,p (x))) 2, m(p,t) := p(l(1,t) l(1,f l,p (x))) + (1 p)(l( 1,t) l( 1,f l,p (x))). Sice Tsybakov s oise assumptio implies P X (X 0 ) = 0, we ca restrict our cosideratio to p 1/2. Now we will begi by showig ( ) 2 (42) v(p,t) t + m(p,t). 2p 1 Without loss of geerality we may assume p > 1/2. The we may set f l,p (x) := 1 ad thus we have l(1,f l,p (x)) = 0 ad l( 1,f l,p (x)) = 2. Let us first cosider the case t [ 1, 1]. The we have l(1, t) = 1 t ad l( 1,t) = 1 + t, ad therefore (42) reduces to (1 t) 2 ( t + 2 2p 1 ) (2p 1)(1 t). Obviously, the latter iequality is equivalet to 1 t (2p 1) t +2, which is always satisfied for t [ 1, 1] ad p 1/2. Now let us cosider the case t 1. We the have l(1,t) = 1 t ad l( 1, t) = 0, ad after some elemetary calculatio we hece see that (42) is satisfied if ad oly if p 2 (6 2t) p(5 3t) 2t 0. The left-had side is miimal if p = (5 3t)/(12 4t), ad thus we obtai p 2 (6 2t) p(5 3t) 2t 7t2 18t t Cosequetly, it suffices to show 7t 2 18t However, the latter is true for all t 1 sice t 7t 2 18t 25 is decreasig o (, 1]. Now let us cosider the third case, t > 1. Sice we the have l(1,t) = 0 ad l( 1,t) = 1 + t it suffices to show t 1 t + 2 2p 1.

Fast Rates for Support Vector Machines

Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA {igo,jcs}@lal.gov Abstract. We establish learig rates to the Bayes risk