arxiv: v1 [math.st] 14 Aug 2007

Size: px
Start display at page:

Download "arxiv: v1 [math.st] 14 Aug 2007"

Transcription

1 The Aals of Statistics 2007, Vol. 35, No. 2, DOI: / I the Public Domai arxiv: v1 [math.st] 14 Aug 2007 FAST RATES FOR SUPPORT VECTOR MACHINES USING GAUSSIAN KERNELS 1 By Igo Steiwart ad Clit Scovel Los Alamos Natioal Laboratory For biary classificatio we establish learig rates up to the order of 1 for support vector machies (SVMs) with hige loss ad Gaussia RBF kerels. These rates are i terms of two assumptios o the cosidered distributios: Tsybakov s oise assumptio to establish a small estimatio error, ad a ew geometric oise coditio which is used to boud the approximatio error. Ulike previously proposed cocepts for boudig the approximatio error, the geometric oise assumptio does ot employ ay smoothess assumptio. 1. Itroductio. I recet years support vector machies (SVMs) have bee the subject of may theoretical cosideratios. Despite this effort, their learig performace o restricted classes of distributios is still widely ukow. I particular, it is ukow uder which otrivial circumstaces SVMs ca guaratee fast learig rates. The aim of this work is to use cocepts like Tsybakov s oise assumptio ad local Rademacher averages to establish learig rates up to the order of 1 for otrivial distributios. I additio to these cocepts that are used to deal with the stochastic part of the aalysis we also itroduce a geometric assumptio for distributios that allows us to estimate the approximatio properties of Gaussia RBF kerels. Ulike may other cocepts itroduced for boudig the approximatio error, our geometric assumptio is ot i terms of smoothess but describes the cocetratio ad the oisiess of the data-geeratig distributio ear the decisio boudary. Let us formally itroduce the statistical classificatio problem. To this ed let us fix a subset X R d. We write Y := { 1,1}. Give a fiite traiig set Received December 2003; revised Jue Supported by the LDRD-ER program of the Los Alamos Natioal Laboratory. AMS 2000 subject classificatios. Primary 68Q32; secodary 62G20, 62G99, 68T05, 68T10, 41A46, 41A99. Key words ad phrases. Support vector machies, classificatio, oliear discrimiatio, learig rates, oise assumptio, Gaussia RBF kerels. This is a electroic reprit of the origial article published by the Istitute of Mathematical Statistics i The Aals of Statistics, 2007, Vol. 35, No. 2, This reprit differs from the origial i pagiatio ad typographic detail. 1

2 2 I. STEINWART AND C. SCOVEL T = ((x 1,y 1 ),...,(x,y )) (X Y ), the classificatio task is to predict the label y of a ew sample (x,y). I the stadard batch model it is assumed that the samples (x i,y i ) are i.i.d. accordig to a ukow (Borel) probability measure P o X Y. Furthermore, the ew sample (x,y) is draw from P idepedetly of T. Give a classifier C that assigs to every traiig set T a measurable fuctio f T :X R, the predictio of C for y is sig T f(x), where sig(0) := 1. The quality of such a fuctio f is measured by the classificatio risk R P (f) := P({(x,y):sigf(x) y}), which should be as small as possible. The smallest achievable risk R P := if{r P (f) f :X R measurable} is called the Bayes risk of P ad a fuctio attaiig this risk is called a Bayes decisio fuctio ad is deoted by f P. Obviously, a good classifier should at least produce decisio fuctios whose risks coverge to the Bayes risk for all distributios P. This leads to the otio of uiversally cosistet classifiers which is thoroughly treated i [14]. The ext aturally arisig questio is whether there are classifiers which guaratee a specific covergece rate for all distributios. Ufortuately, this is impossible by a result of Devroye (see [14], Theorem 7.2). However, if oe restricts cosideratio to certai smaller classes of distributios, such learig rates, for example, i the form of P (T (X Y ) : R P (f T ) R P + C(x) β ) 1 e x, 1,x 1, where β > 0 ad C(x) > 0 are costats, exist for various classifiers. Typical assumptios for such classes of distributios are either i terms of the smoothess of the fuctio η(x) := P(y = 1 x) (see, e.g., [19, 38]), or i terms of the smoothess of the decisio boudary (see, e.g., [18, 35]). Moreover, the correspodig learig rates are slower tha 1/2 if o additioal assumptios o the amout of the oise i the labels, for example, o the distributio of the radom variable (1) mi{1 η(x),η(x)} = 1 2 η(x) 1 2 aroud the critical level 1/2, are imposed. O the other had, [35] showed that ERM-type classifiers ca lear faster tha 1/2, if oe quatifies how likely the oise i (1) is close to 1/2 (see Defiitio 2.2 i the followig sectio). Ufortuately, however the ERM classifier cosidered i [35] requires substatial kowledge o how to approximate the desired Bayes decisio fuctios. Moreover, ERM classifiers are based o combiatorial optimizatio problems ad hece they are usually hard to implemet ad i geeral there exist o efficiet algorithms. O the oe had SVMs do ot share the implemetatio issues of ERM sice they are based o a covex optimizatio (see, e.g., [12, 26] for algorithmic aspects). O the other had, however, their kow learig rates are

3 FAST RATES FOR SUPPORT VECTOR MACHINES 3 rather usatisfactory sice either the assumptios o the distributios are too restrictive as i [28] or the established learig rates are too slow as i [37]. Our aim is to give SVMs a better theoretical foudatio by establishig fast learig rates for a wide class of distributios. To this ed we propose a geometric oise assumptio (see Defiitio 2.3) which describes the cocetratio of the measure 2η 1 dp X where P X is the margial distributio of P with respect to X ear the decisio boudary. This assumptio is the used to determie the approximatio properties of Gaussia kerels which are used i the SVMs we cosider. Provided that the tuig parameters are optimally chose our mai result the shows that the resultig learig rates for these classifiers ca be as fast as 1. The rest of this work is orgaized as follows: I Sectio 2 we itroduce the mai cocepts of this work ad the preset our results. I Sectio 3 we recall some basic theory o reproducig kerel Hilbert spaces ad prove a ew coverig umber boud for Gaussia kerels that describes a trade-off betwee the kerel widths ad the radii of the coverig balls. I Sectio 4 we the show the approximatio results that are related to our proposed geometric oise assumptio. The last sectios of the work cotai the actual proof of our rates: I Sectio 5 we establish a geeral boud for ERMtype classifiers ivolvig local Rademacher averages which is used to boud the estimatio error i our aalysis of SVMs. I order to apply this result we eed variace bouds for SVMs which are established i Sectio 6. Iterestigly, it turs out that sharp versios of these bouds deped o both Tsybakov s oise assumptio ad the approximatio properties of the kerel used. Fially, we prove our learig rates i Sectio Defiitios ad mai results. I this sectio we first recall some basic otios related to support vector machies which are eeded throughout this text. I Sectio 2.2, we the preset a coverig umber boud for Gaussia RBF kerels which will play a importat role i our aalysis of the estimatio error of SVMs. I Sectio 2.3 we recall Tsybakov s oise assumptio which will allow us to establish learig rates faster tha 1/2. The, i Sectio 2.4, we itroduce the ew geometric assumptio that is used to estimate the approximatio error for SVMs with Gaussia RBF kerels. Fially, we preset ad discuss our learig rates i Sectio RKHSs, SVMs ad basic defiitios. For two fuctios f ad g we use the otatio f(λ) g(λ) to mea that there exists a costat C > 0 such that f(λ) Cg(λ) over some specified rage of values of λ. We also use the otatio with similar meaig ad the otatio whe both ad hold. I particular, we use the same otatio for sequeces. If ot stated otherwise, X always deotes a compact subset of R d which is equipped with the Borel σ-algebra.

4 4 I. STEINWART AND C. SCOVEL Recall (see, e.g., [1, 6]) that every positive defiite kerel k :X X R has a uique reproducig kerel Hilbert space H (RKHS) whose uit ball is deoted by B H. Although we sometimes use geeric kerels ad RKHSs, we are maily iterested i Gaussia RBF kerels, which are the most widely used kerels i practice. Recall that these kerels are of the form k σ (x,x ) = exp( σ 2 x x 2 2), x,x X, where σ > 0 is a free parameter whose iverse 1/σ is called the width of k σ. We usually deote the correspodig RKHSs which are thoroughly described i [32] by H σ (X) or simply H σ. Let us ow recall the defiitio of SVMs. To this ed let P be a distributio o X Y ad l:y R [0, ) be the hige loss, that is, l(y,t) := max{0,1 yt}, y Y,t R. Furthermore, we defie the l-risk of a measurable fuctio f :X R by R l,p (f) := E (x,y) P l(y,f(x)). Now let H be a RKHS over X cosistig of measurable fuctios. For λ > 0 we deote a solutio of (2) arg mi(λ f 2 H + R l,p (f + b)) f H b R by ( f P,λ, b P,λ ). Recall that f P,λ is uiquely determied (see, e.g., [30]), while i some situatios this is ot true for the offset b P,λ. I geeral we thus assume that b P,λ is a arbitrary solutio. However, for the (trivial) distributios that satisfy P({y } x) = 1 P X -a.s. for some y Y we explicitly set bp,λ := y i order to cotrol the size of the offset. Furthermore, if P is a empirical distributio with respect to a traiig set T = ((x 1,y 1 ),...,(x,y )) we write R l,t (f) ad ( f T,λ, b T,λ ). Note that i this case the above coditio uder which we set b T,λ := y meas that all labels y i of T are equal to y. A algorithm that costructs ( f T,λ, b T,λ ) for every traiig set T is called a SVM with offset. Furthermore, for λ > 0 we deote the uique solutio of (3) arg mi(λ f 2 H + R l,p (f)) f H by f P,λ ad for empirical distributios based o a traiig set T we agai write f T,λ. A correspodig algorithm is called a SVM without offset. Recall that uder some assumptios o the RKHS used ad the choice of the regularizatio parameter λ it ca be show that both SVM variats are uiversally cosistet (see [29, 31, 39]); however, o satisfyig learig rates have bee established yet.

5 FAST RATES FOR SUPPORT VECTOR MACHINES 5 We also emphasize that i may theoretical papers oly SVMs without offset are cosidered sice the offset ofte causes serious techical problems i the aalysis. However, i practice usually SVMs with offset are used ad therefore we feel that these algorithms should be cosidered i theory, too. As we will see, our techiques ca be applied for both variats. The resultig rates coicide Coverig umbers for Gaussia RKHSs. I order to boud the estimatio error of SVMs we eed a complexity measure for the RKHSs used, which is itroduced i this sectio. To this ed let A E be a subset of a Baach space E. The coverig umbers of A are defied by { } N(A,ε,E) := mi 1: x 1,...,x E with A (x i + εb E ), ε > 0, where B E deotes the closed uit ball of E. Moreover, for a bouded liear operator S : E F betwee two Baach spaces E ad F, the coverig umbers are N(S,ε) := N(SB E,ε,F). Give a traiig set T = ((x 1,y 1 ),...,(x,y )) (X Y ) we deote the space of all equivalece classes of fuctios f :X Y R with orm (4) f L2 (T) := ( 1 ) 1/2 f(x i,y i ) 2 i=1 by L 2 (T). I other words, L 2 (T) is a L 2 -space with respect to the empirical measure of T. Note that for a fuctio f :X Y R a caoical represetative i L 2 (T) is its restrictio f T. I additio, L 2 (T X ) deotes the space of all (equivalece classes of) square itegrable fuctios with respect to the empirical measure of x 1,...,x. The proof of our learig rates uses the behavior of N(B Hσ(X),ε,L 2 (T X )) i ε ad σ i order to boud the estimatio error. Ufortuately, all kow results o coverig umbers for Gaussia RBF kerels emphasize the role of ε ad hece we will establish i Sectio 3 the followig result which describes a suitable trade-off betwee the ifluece of ε ad σ. Theorem 2.1. Let σ 1, X R d be a compact subset with oempty iterior, ad H σ (X) be the RKHS of the Gaussia RBF kerel k σ o X. The for all 0 < p 2 ad all δ > 0, there exists a costat c p,δ,d > 0 idepedet of σ such that for all ε > 0 we have sup log N(B Hσ(X),ε,L 2 (T X )) c p,δ,d σ (1 p/2)(1+δ)d ε p. T (X Y ) i=1

6 6 I. STEINWART AND C. SCOVEL 2.3. Tsybakov s oise assumptio. Now we recall Tsybakov s oise coditio, which describes the amout of oise i the labels. I order to motivate Tsybakov s assumptio let us first observe that by equatio (1) the fuctio 2η 1 ca be used to describe the oise i the labels of a distributio P. Ideed, i regios where this fuctio is close to 1 there is oly a small amout of oise, whereas fuctio values close to 0 oly occur i regios with a high level of oise. The followig defiitio i which we use the covetio t := 0 for t (0,1) describes the size of the latter regios: Defiitio 2.2. Let 0 q ad P be a probability measure o X Y. We say that P has Tsybakov oise expoet q if there exists a costat C > 0 such that for all sufficietly small t > 0 we have (5) P X ({x X : 2η(x) 1 t}) C t q. Obviously, P has Tsybakov oise expoet q > 0 if ad oly if 2η 1 1 L q, (P X ), where L q, deotes a Loretz space (see [5]). It is also easy to see that P has Tsybakov oise expoet q for all q < q if P has Tsybakov oise expoet q. Furthermore, all distributios obviously have oise expoet 0. I the other extreme case q = the coditioal probability η is bouded away from 1/2. I particular, oise-free distributios have expoet q =. Furthermore, for q < it is easy to check that Defiitio 2.2 is satisfied if ad oly if (5) holds for all t > 0 ad a possibly differet costat C. Fially, ote that (5) does ot make ay assumptios o the locatio of the oisy set, ad hece we prefer the otio oise coditio rather tha the ofte used term margi coditio A ew geometric assumptio for distributios. I this sectio we itroduce a coditio for distributios that will allow us to estimate the approximatio error for Gaussia RBF kerels. To this ed let l be the hige loss fuctio ad P be a distributio o X. Let R l,p := if{r l,p (f) f :X R measurable} deote the smallest possible l-risk of P. Sice fuctios achievig the miimal l-risk occur i may situatios we idicate them by f l,p if o cofusio regardig the ouiqueess of this symbol ca be expected. Furthermore, recall that f l,p has a shape similar to the Bayes decisio fuctio sigf P (see, e.g., [30]). Now, give a RKHS H over X we defie the approximatio error fuctio with respect to H ad P by (6) a(λ) := if f H (λ f 2 H + R l,p(f) R l,p ), λ 0. Note that the obvious aalogue of the approximatio error fuctio with offset is ot greater tha the above approximatio error fuctio without offset ad hece we restrict our attetio to the latter for simplicity.

7 FAST RATES FOR SUPPORT VECTOR MACHINES 7 For λ > 0, the approximatio error fuctio describes how well λ f P,λ 2 H + R l,p (f P,λ ) approximates R l,p. For example, it was show i [31] that we have lim λ 0 a(λ) = 0 for all P if X is a compact metric space ad H is dese i the space of cotiuous fuctios C(X). However, i otrivial situatios there caot exist a covergece rate which holds uiformly for all distributios P. Sice H σ (X) is dese i C(X) for compact X R d ad all σ > 0 these statemets are i particular true for the approximatio error fuctios a σ ( ) of the Gaussia RBF kerels with fixed width 1/σ. Moreover, we are ot aware of ay weak coditio o η or P that esures a σ (λ) λ β for λ 0 ad some β > 0, ad the results of [27] idicate that such behavior of a σ ( ) may actually require very restrictive coditios. I the followig we will therefore preset a coditio o P that allows us to estimate a σ (λ) by λ ad σ. I particular it will tur out that a σ (λ) 0 with a polyomial rate i λ if we relate σ to λ i a certai maer. I order to itroduce this assumptio o P we first defie the classes of P by X 1 := {x X :η(x) < 1 2 }, X 1 := {x X :η(x) > 1 2 } ad X 0 := {x X :η(x) = 1 2 } for some choice of η. Now we defie a distace fuctio x τ x by d(x,x 0 X 1 ), if x X 1, (7) τ x := d(x,x 0 X 1 ), if x X 1, 0, otherwise, where d(x,a) deotes the distace of x to a set A with respect to the Euclidea orm. Roughly speakig, τ x measures the distace of x to the decisio boudary. Now we ca preset the already aouced geometric coditio for distributios. Defiitio 2.3. Let X R d be compact ad P be a probability measure o X Y. We say that P has geometric oise expoet α > 0 if there exists a costat C > 0 such that (8) X ( 2η(x) 1 exp τ2 x t ) P X (dx) Ct αd/2, t > 0. We say that P has geometric oise expoet if it has geometric oise expoet α for all α > 0. Note that i the above defiitio we either make ay kid of smoothess assumptio or do we assume a coditio o P X i terms of absolute cotiuity with respect to the Lebesgue measure. Istead, the itegral coditio (8) describes the cocetratio of the measure 2η 1 dp X ear the decisio boudary i the sese that the less the measure is cocetrated i this regio the larger the geometric oise expoet ca be chose. The followig example illustrates this.

8 8 I. STEINWART AND C. SCOVEL Example 2.4. Sice exp( t) C α t α holds for all t > 0 ad a costat C α > 0 oly depedig o α > 0, we easily see that (8) is satisfied wheever (9) (x τ 1 x ) L αd( 2η 1 dp X ), where L αd ( 2η 1 dp X ) deotes the usual Lebesgue space of fuctios that are αd-itegrable with respect to the measure 2η 1 dp X. Now, let us suppose X 0 = for a momet. I this case τ x measures the distace to the class x does ot belog to. I particular, (9) holds for α = if ad oly if the two classes X 1 ad X 1 have strictly positive distace. Moreover, if (9) holds for some 0 < α < the two classes may touch, that is, the decisio boudary X 1 X 1 is oempty. Cosequetly, we ca easily costruct distributios P that have geometric oise expoet ad touchig classes, but also satisfy f P / H σ (X) for all σ > 0. However, ote that for such P the measure 2η 1 dp X must obviously have a very low cocetratio ear the decisio boudary. We ow describe a simple regularity coditio o η ear the decisio boudary that ca be used to guaratee a geometric oise expoet. Defiitio 2.5. Let X R d, P be a distributio o X Y ad γ > 0. We say that P has a evelope of order γ if there is a costat c γ > 0 such that for P X -almost all x X we have (10) 2η(x) 1 c γ τ γ x. Obviously, if P has a evelope of order γ the the graph of x 2η(x) 1 lies i a multiple of the evelope defied by τ γ x at the top ad by τ γ x at the bottom. Cosequetly, η ca be very irregular away from the decisio boudary but caot be discotiuous whe crossig it. The rate of covergece of η(x) 1/2 for τ x 0 is described by γ. Iterestigly, for distributios havig both a evelope of order γ ad a Tsybakov oise expoet q we ca boud the geometric oise expoet, as the followig theorem, which is proved i Sectio 4, shows. Theorem 2.6. Let X R d be compact ad P be a distributio o X Y that has a evelope of order γ > 0 ad a Tsybakov oise expoet q [0, ). The P has geometric oise expoet (q + 1)γd 1 if q 1, ad geometric oise expoet α for all α < (q + 1)γd 1 otherwise. Now the mai result of this subsectio which is proved i Sectio 4 shows that for distributios havig a otrivial geometric oise expoet we ca boud the approximatio error fuctio for Gaussia RBF kerels.

9 FAST RATES FOR SUPPORT VECTOR MACHINES 9 Theorem 2.7. Let σ > 0, X be the closed uit ball of the Euclidea space R d ad a σ ( ) be the approximatio error fuctio with respect to H σ (X). Furthermore, let P be a distributio o X Y that has geometric oise expoet 0 < α < with costat C i (8). The there is a costat c d > 0 depedig oly o the dimesio d such that for all λ > 0 we have (11) a σ (λ) c d (σ d λ + C(2d) αd/2 σ αd ). I order to let the right-had side of (11) coverge to zero it is ecessary to assume both λ 0 ad σ. A easy cosideratio shows that the fastest covergece rate is achieved if σ(λ) := λ 1/((α+1)d). I this case we have a σ(λ) (λ) λ α/(α+1). I particular, we ca obtai rates up to liear order i λ for sufficietly beig distributios. The price for this good approximatio property is, however, a icreasig complexity of the hypothesis class B Hσ(λ), as we have see i Theorem Learig rates for SVMs usig Gaussia RBF kerels. With the help of the geometric oise assumptio we ca ow preset our learig rates for SVMs usig Gaussia RBF kerels. Note agai that these polyomial rates do ot require a smoothess assumptio o P. Furthermore ote that we use the covetio a +b c +d := a c for a,c (0, ), b,d [0, ) i order to make the presetatio compact. Theorem 2.8. Let X be the closed uit ball of R d, ad P be a distributio o X Y with Tsybakov oise expoet q [0, ] ad geometric oise expoet α (0, ). We defie α 2α + 1, if α q + 2 2q, β := 2α(q + 1) 2α(q + 2) + 3q + 4, otherwise, ad λ := (α+1)/αβ ad σ := β/(αd) i both cases. The for all ε > 0 there exists a C > 0 such that for all x 1 ad 1 the SVM without offset usig the Gaussia RBF kerel k σ satisfies Pr ( T (X Y ) : R P (f T,λ ) R P + Cx 2 β+ε) 1 e x, where Pr deotes the outer probability of P i order to avoid measurability cosideratios. If α = the latter iequality holds if σ = σ is a costat with σ > 2 d. Fially, all results also hold for the SVM with offset. Remark 2.9. The above learig rates are faster tha the parametric rate 1/2 if ad oly if α > (3q + 4)/(2q). For q = the latter coditio becomes α > 3/2 ad i a itermediate case q = 1 it becomes α > 7/2.

10 10 I. STEINWART AND C. SCOVEL Remark It is importat to ote that our techiques ca also be used to establish rates for other defiitios of the sequeces (λ ) ad (σ ). I fact, Theorem 2.7 guaratees a σ (λ ) 0 (which is ecessary for our techiques to produce ay rate) if σ ad σ d λ 0. I particular, if λ := ι ad σ := κ for some ι,κ > 0 with κd < ι, these coditios are satisfied ad a coceptually easy but techically ivolved modificatio of our proof ca produce rates for certai rages of ι (ad thus κ). I order to keep the presetatio as short as possible we have omitted the details ad focused o the best possible rates. Remark Ufortuately, the choice of λ ad σ that yields the optimal rates withi our techiques, requires to kow the values of α ad q, which are typically ot available. Adaptive methods which do ot require such kowledge are still ukow. Remark Theorem 2.7 ad Theorem 2.8 establish results for all distributios havig some geometric oise expoet. However, for certai distributios of this type the resultig rates are ot satisfactory. For example cosider the distributio P o X := [ 1, 1] whose margial distributio P X equals the uiform distributio ad whose coditioal distributio η(x) := P(y = 1 x) satisfies 2η(x) 1 = x γ, x X, for some costat γ (0, ). The P obviously has Tsybakov oise expoet q := 1/γ, ad Theorem 2.6 or a simple modificatio of the proof of Theorem 2.7 shows that P has geometric oise expoet α := 1 + γ. Theorem 2.8 thus gives a rate of the form β+ε for β = 2q2 +4q+2 5q 2 +10q+4, which is ever faster tha 1/2. Though this is disappoitig at first glace, it is ot really surprisig sice the proof of Theorem 2.7 is ot tailored to distributios havig such simple decisio fuctios. We believe that sharper bouds o the approximatio error fuctio (ad thus faster learig rates) for this ad other distributios are possible, but a detailed aalysis is beyod the scope of this paper. Remark Aother iterestig but ope questio is whether the obtaied rates are optimal for the class of cosidered distributios. I order to approach this questio let us cosider the case α =, which roughly speakig describes the case of almost o approximatio error. I this case our rates are essetially of the form (q+1)/(q+2), which coicides with the rates Tsybakov (see [35]) achieved for certai ERM classifiers based o hypothesis classes of small complexity. The latter rates i tur caot be improved i a miimax sese for certai classes of distributios as was also show i [35]. This discussio idicates that the techiques used for the stochastic part of our aalysis may be strog eough to produce optimal results. However, if we cosider the case α < the the approximatio error fuctio described

11 FAST RATES FOR SUPPORT VECTOR MACHINES 11 i Theorem 2.7 ad its ifluece o the estimatio error (see our proofs, i particular Sectio 5 ad Sectio 7) have a sigificat impact o the obtaied rates. Sice the sharpess of Theorem 2.7 is uclear to us we make o cojecture regardig the optimality of our rates i the geeral case. 3. Proof of Theorem 2.1. The mai goal of this sectio is to prove Theorem 2.1, which is doe i Sectio 3.2. To this ed we provide i Sectio 3.1 some RKHS theory which is used throughout this work Some basic RKHS theory. For the proofs of this sectio we have to recall some basic facts from the theory of RKHSs. To this ed let X R d be a compact subset ad k:x X R be a cotiuous ad positive semidefiite kerel with RKHS H. The H cosists of cotiuous fuctios o X ad for f H we have f K f H, where (12) K := sup x X k(x,x). Cosequetly, if the embeddig of the RKHS H ito the space of cotiuous fuctios C(X) is deoted by (13) J H :H C(X) we have J H K. Furthermore, let us recall the represetatio of H based o Mercer s theorem (see [13]). To this ed let K X :L 2 (X) L 2 (X) be the itegral operator defied by (14) K X f(x) := k(x,x )f(x )dx, f L 2 (X),x X, X where L 2 (X) deotes the L 2 -space o X with respect to the Lebesgue measure. The it was show i [13] that the uique square root K 1/2 X of K X is a isometric isomorphism betwee L 2 (X) ad H Proof of Theorem 2.1. I order to prove Theorem 2.1 we eed the followig result which bouds the coverig umbers of H σ (X) with respect to C(X). Theorem 3.1. Let σ 1, 0 < p < 2 ad X R d be a compact subset with oempty iterior. The there is a costat c p,d > 0 idepedet of σ such that for all ε > 0 we have log N(B Hσ(X),ε,C(X)) c p,d σ (1 p/4)d ε p. Proof. Let B d be the closed uit ball of the Euclidea space R d ad B d be its iterior. The there exists a r 1 such that X rb d. Now,

12 12 I. STEINWART AND C. SCOVEL it was recetly show i [32] that the restrictios H σ (rb d ) H σ (X) ad H σ (rb d ) H σ ( B d ) are both isometric isomorphisms. Cosequetly, i the followig we assume without loss of geerality that X = B d or X = B d ad do ot cocer ourselves with the distictio of both cases. Now let us write H σ := H σ (X) ad J σ := J Hσ :H σ C(X) i order to simplify otatio. Furthermore, let K σ :L 2 (X) L 2 (X) be the itegral operator of k σ defied as i (14), ad deote the orm i L 2 (X). Accordig to [13], Theorem 3, page 27, for ay f H σ, we obtai K 1 σ if h R f h 1 R K 1/2 σ f 2 = 1 R f 2 H σ, where we use the covetio K 1 σ h = if h / K σl 2 (X). Suppose ow that H L 2 (X) is a dese Hilbert space with h h H, ad that we have K σ :L 2 (X) H L 2 (X) with K σ :L 2 (X) H c σ,h < for some costat c σ,h > 0. It follows that ad hece if h H c σ,h R f h if K 1 σ h R if f h c σ,h h H R R f 2 H σ. f h 1 R f 2 H σ By [27], Theorem 3.1 it follows that f is cotaied i the real iterpolatio space (L 2 (X), H) 1/2, (see [7] for the defiitio of a iterpolatio space) ad its orm i this space satisfies f 1/2, 2 c σ,h f Hσ. Therefore we obtai a cotiuous embeddig Υ 1 :H σ (L 2 (X), H) 1/2,, with Υ 1 2 c σ,h. If i additio a subset iclusio (L 2 (X), H) 1/2, C(X) exists which defies a cotiuous embeddig Υ 2 :(L 2 (X), H) 1/2, C(X), we have a factorizatio J σ = Υ 2 Υ 1 ad ca coclude (15) log N(B Hσ(X),ε,C(X)) = log N(J σ,ε) log N ( Υ 2, ε 2 c σ,h Cosequetly, to boud log N(J σ,ε) we eed to select a H, compute c σ,h ad boud log N(Υ 2,ε). To that ed let H := W m ( X) be the Sobolev space with orm f 2 m = D α f 2, α m ).

13 FAST RATES FOR SUPPORT VECTOR MACHINES 13 where α := d i=1 α i, D α := d i=1 α i i, ad α i i deotes the α i th partial derivative i the ith coordiate of R d. By the Cauchy Schwarz iequality we obtai (16) D α K σ f 2 f 2 Dxk α σ (x, x) 2 d xdx, X X where the otatio Dx α idicates that the differetiatio takes place i the x variable. To address the term Dxk α σ (x, x) we ote that D α x(e x 2 ) = ( 1) α e x 2 /2 h α (x), where the multivariate Hermite fuctios h α (x) = d i=1 h αi (x i ) are products of the uivariate fuctios. Sice R h2 k (x)dx = 2k k! π (see, e.g., [11]) we obtai Dx(e α x 2 ) 2 dx = e x 2 h 2 α(x)dx R d R (17) d h 2 α(x)dx = 2 α α!π d/2, R d where we have used the defiitio α! := d i=1 α i!. Applyig the traslatio ivariace of k σ, we obtai Dxk α σ (x, x) 2 d x = Dxk ά σ (0, x) 2 d x = D ά σ2 x 2 x(e ) 2 d x, R d R d R d ad by a chage of variables we ca apply iequality (17) to the itegral o the right-had side, D ά σ2 x 2 x(e ) 2 d x = σ 2 α d Dx(e ά x 2 ) 2 d x σ 2 α d 2 α α!π d/2. R d R d Hece we obtai Dx α k σ(x, x) 2 d xdx θ(d)σ 2 α d 2 α α!π d/2, X X where θ(d) is the volume of X. Sice α m α! dm m! d ad K σ f 2 m = α m Dα K σ f 2 we ca therefore ifer from (16) that for σ 1 we have (18) K σ θ(d)(2d) m/2 m! d/2 σ m d/2 =: c σ,h. Now let us cosider Υ 2 :(L 2 (X),W m ( X)) 1/2, C(X). Accordig to Triebel [34], page 267, we have (L 2 (X),W m ( X)) 1/2, = (L 2 ( X),W m ( X)) 1/2, = B m/2 2, ( X) isomorphically. Furthermore (19) log N(B m/2 2, ( X) C(X),ε) c m,d ε 2d/m

14 14 I. STEINWART AND C. SCOVEL for m > d follows from a similar result of Birma ad Solomjak ([8], cf. also [34]) for Slobodeckij (i.e., fractioal Sobolev) spaces, where the costat c m,d depeds oly o m ad d. Cosequetly we obtai from (15), (18) ad (19) that ( ) ε 2d/m log N(J σ,ε) c m,d 2 c σ,h = c m,d (4c σ,h ) d/m ε 2d/m = c m,d σ d d2 /(2m) ε 2d/m for all m > d ad ew costats c m,d depedig oly o m ad d. Settig m := 2d/p completes the proof of Theorem 3.1. Proof of Theorem 2.1. As before we write H σ := H σ (X) ad J σ := J Hσ :H σ C(X) i order to simplify otatio. Furthermore recall for a traiig set T (X Y ) the space L 2 (T X ) itroduced i Sectio 2.2. Now let R TX :C(X) L 2 (T X ) be the restrictio map defied by f f TX. Obviously, we have R TX 1. Furthermore we defie I σ := R TX J σ so that I σ :H σ L 2 (T X ) is the evaluatio map. The Theorem 3.1 ad the product rule for coverig umbers imply that (20) sup log N(I σ,ε) c q,d σ (1 q/4)d ε q T Z for all 0 < q < 2. To complete the proof of Theorem 2.1 we derive aother boud o the coverig umbers ad iterpolate the two. To that ed observe that I σ :H σ L 2 (T X ) factors through C(X) with both factors J s ad R TX havig orm ot greater tha 1. Hece Propositio i [23] implies that I σ is absolutely 2-summig with 2-summig orm ot greater tha 1. By Köig s theorem ([24], Lemma 2.7.2) we obtai for the approximatio umbers (a k (I σ )) of I σ that k 1 a2 k (I σ) 1 for all σ > 0. Sice the approximatio umbers are decreasig it follows that sup k kak (I σ ) 1. Usig Carl s iequality betwee approximatio ad etropy umbers (see Theorem i [10]) we thus fid a costat c > 0 such that (21) sup log N(I σ,ε) cε 2 T Z for all ε > 0 ad all σ > 0. Let us ow iterpolate the boud (21) with the boud (20). Sice I σ :H σ L 2 (T X ) 1 we oly eed to cosider 0 < ε 1. Let 0 < q < p < 2 ad 0 < a 1. The for 0 < ε < a we have log N(I σ,ε) c q,d σ (1 q/4)d ε q c q,d σ (1 q/4)d a p q ε p, ad for a ε 1 we fid log N(I σ,ε) cε 2 ca p 2 ε p.

15 FAST RATES FOR SUPPORT VECTOR MACHINES 15 Sice σ 1 we ca set a := σ ((4 q)/(8 4q))d ad obtai log N(I σ,ε) c q,d σ (1 p/2)((8 2q)/(8 4q))d ε p, where c q,d is a costat depedig oly o q,d. The proof is completed by choosig q := 4δ 2p 1+2δ whe δ < 8 4p ad q just smaller tha p otherwise. 4. Proofs of Theorems 2.7 ad 2.6. I this sectio we prove Theorems 2.7 ad 2.6, which both deal with the geometric oise expoet Proof of Theorem 2.7. Let us begi by recallig some facts about Gaussia RBF kerels. To this ed let H σ (R d ) be the RKHS of the Gaussia RBF kerel with parameter σ. The it was show i [32] that the liear operator V σ : L 2 (R d ) H σ (R d ) defied by V σ g(x) = (2σ)d/2 π d/4 e 2σ2 x y 2 2g(y)dy, g L2 (R d ),x R d, R d is a isometric isomorphism. Cosequetly, we obtai (22) a σ (λ) = if g L 2 (R d ) λ g 2 L 2 (R d ) + R l,p(v σ g) R l,p, λ > 0. I the followig we will estimate the right-had side of (22) by a judicious choice of g. To this ed we eed the followig lemma, which i some sese elarges the support of P to esure that all balls of the form B(x,τ x ) are cotaied i the (elarged) support. This guaratee will the make it possible to cotrol the behavior of V σ g by tails of spherical Gaussia distributios [see (28) for details]. Lemma 4.1. Let X be a closed uit ball of R d ad P be a probability measure o X Y with regular coditioal probability η(x) = P(y = 1 x), x X. O X := 3X we defie η(x), ( ) if x 1, (23) ή(x) = x η, otherwise. x We also write X 1 := {x X :ή(x) < 1 2 } ad X 1 := {x X :ή(x) > 1 2 }. Fially let B(x,r) deote the ope ball of radius r about x i R d. The for x X 1 we have B(x,τ x ) X 1 ad for x X 1 we have B(x,τ x ) X 1. Proof. Let x X 1 ad x B(x,τ x ). If x X we have x x < τ x which implies η(x) > 1 2 by the defiitio of τ x. This shows x X 1. Now let us assume x > 1. By x,x x ad Pythagoras theorem we the obtai x x x 2 x x,x x 2 x 2 + x,x x 2 x 2 x = x x 2.

16 16 I. STEINWART AND C. SCOVEL Therefore, we have x x x < τ x, which implies ή(x ) = η( x x ) > 1 2. (24) Let us fially recall that Zhag showed i [39] that the hige risk satisfies R l,p (f) R l,p = E PX ( 2η 1 f f P ) for all measurable f :X [ 1,1]. Now we are ready to prove Theorem 2.7. Proof of Theorem 2.7. With the otatio of Lemma 4.1 we fix a measurable f P : X [ 1,1] that satisfies f P = 1 o X 1, fp = 1 o X 1 ad f P = 0 otherwise. For g := (σ 2 /π) d/4 fp we the immediately obtai ( 81σ 2 ) d/4 (25) g L2 (R d ) θ(d), π where θ(d) deotes the volume of X. Moreover, it is easy to see that 1 f P 1 implies 1 V σ g 1. Sice P X has support i X, (24) the yields (26) R l,p (V σ g) R l,p = E PX ( 2η 1 V σ g f P ). I order to boud V σ g(x) f P (x) for x X 1 we observe ( 2σ 2 ) d/2 V σ g(x) = e 2σ2 x y 2 2 fp (y)dy π R d ( 2σ 2 ) d/2 (27) = e 2σ2 x y 2 2( fp (y) + 1)dy 1 π R d ( 2σ 2 ) d/2 e 2σ2 x y 2 2 ( fp (y) + 1)dy 1. π B(x,τ x) Now remember that Lemma 4.1 showed B(x,τ x ) X 1 for all x X 1, so that (27) implies ( 2σ 2 ) d/2 V σ g(x) 2 e 2σ2 x y 2 2 dy 1 π B(x,τ x) (28) = 1 2P γσ ( u τ x ), where γ σ = (2σ 2 /π) d/2 e 2σ2 u 2 du is a spherical Gaussia i R d. Accordig to the tail boud [17], iequality (3.5) o page 59, we have P γσ ( u r) 4e σ2 r 2 /2d ad cosequetly we obtai 1 V σ g(x) 1 8e σ2 τ 2 x /2d, x X 1. Sice for x X 1 we ca obtai a aalogous estimate, we coclude V σ g(x) f P (x) 8e σ2 τ 2 x /2d

17 FAST RATES FOR SUPPORT VECTOR MACHINES 17 for all x X 1 X 1. Cosequetly (26) ad the geometric oise assumptio for t := 2d σ 2 yield (29) R l,p (V σ g) R l,p 8E x PX ( 2η(x) 1 e σ2 τ 2 x/2d ) 8C(2d) αd/2 σ αd, where C is the costat i (8). Combiig (29), (25) ad (22) ow yields the assertio Proof of Theorem 2.6. I this subsectio, all Lebesgue ad Loretz spaces (see, e.g., [5]) ad their orms are with respect to the measure P X. Proof of Theorem 2.6. Let us first cosider the case q 1 where we ca apply the Hölder iequality for Loretz spaces [22], which states = 1. Applyig this i- fg 1 f q, g q,1 for all f L q,, g L q,1 ad q defied by 1 q + 1 q equality gives E x PX ( 2η(x) 1 e τ2 x /t ) (30) (2η 1) 1 q, x (2η(x) 1) 2 e τ2 x /t q,1 C (2η 1) 2 e ( 2η 1 /cγ)2/γ t 1 q,1, where i the last estimate we used the Tsybakov assumptio (5) ad the fact that P has a evelope of order γ. Let us write h(x) := 2η(x) 1 1, x X, ad b := t(c γ ) 2/γ so that 2η(x) 1 2 e ( 2η 1 /cγ)2/γ t 1 = g(h(x)), where g(s) := s 2 e (s 2/γ )/b for all s 1. Now it is easy to see that g : [1, ) [0, ) is strictly icreasig if 0 < b 2 3γ, ad hece we ca exted g to a strictly icreasig, cotiuous ad ivertible fuctio o [0, ) i this case. Let such a extesio also be deoted by g. The for this extesio we have (31) P X (g h > τ) = P X (h > g 1 (τ)). Now for a fuctio f :X [0, ) recall the oicreasig rearragemet f (u) := if {σ 0:P X (f > σ) u}, u > 0, of f which ca be used to defie Loretz orms (see, e.g., [5]). For u > 0 equatio (31) the yields (g h) (u) = g(if{g 1 (σ):p X (h > g 1 (σ)) u}) = g h (u).

18 18 I. STEINWART AND C. SCOVEL Now, iequality (5) implies P X (h ( C u )1/q ) u for all u > 0. Therefore, we fid ( ) C 1/q h (u) if{σ 0:P X (h σ) u} u for all 0 < u < 1. Sice (g h) = g h ad g is icreasig we hece have (( ) C 1/q ) (g h) (u) g u for all 0 < u < 1. Now, for fixed ˆα > 0 the boud e x implies s 2(ˆα/γ 1) g(s) bˆα l 2 (s 2/γ b 1 ) + 1 x ˆα l 2 (x)+1 o (0, ) for s [1, ). Usig the fact that (g h) (u) = 0 holds for all u 1, we hece obtai (g h) (u) bˆα u 2/q(1 ˆα/γ) l 2 ((u/c) 2/(qγ) b 1 ) + 1 for u > 0 if we assume without loss of geerality that C 1. Let us defie ˆα := γ q+1 2. The we fid 1 q + 2 q (1 ˆα 2 γ ) = 0 ad cosequetly for b 3γ, that 2 is, t, we obtai 3γ(c γ) 2/γ (32) g h q,1 = 0 bˆα 0 u 1/q 1 (g h) (u)du u 1 l 2 du ((u/c) 2/(qγ) b 1 tγ(q+1)/2 ) + 1 by the defiitio of b. Sice we also have E PX ( 2η(x) 1 e τ2 x/t ) 1 for all t > 0, estimate (30) together the defiitio of g ad (32) yields the assertio i the case q 1. Let us ow cosider the case 0 q < 1 where the Hölder iequality i Loretz space caot be used. The for all t,τ 0 we have (33) E x PX ( 2η(x) 1 e τ2 x/t ) = 2η(x) 1 e τ2 x /t P X (dx) 2η 1 τ + 2η 1 >τ 2η(x) 1 e τ2 x/t P X (dx) ( ( ) τ 2/γ ) Cτ q+1 + exp t 1, cγ

19 FAST RATES FOR SUPPORT VECTOR MACHINES 19 where we have used the Tsybakov assumptio (5) ad the fact that P has a evelope of order γ. Let us defie τ by τ q+1 := exp( ( τ c γ ) 2/γ t 1 ). For â := (c γ ) 2/γ (q + 1) ad small t this defiitio implies (âγ τ 2 ) γ/2 ( tl 1 ât) γ/2, ad hece the assertio follows from (33) for the case 0 < q < The estimatio error of ERM-type classifiers. To boud the estimatio error i the proof of Theorem 2.8 we ow establish a cocetratio iequality for ERM-type algorithms usig a variat of Talagrad s cocetratio iequality together with local Rademacher averages (see, e.g., [2, 4, 21]). Our approach is ispired by [3]. However, due to the regularizatio term λ f 2 H i the defiitio of SVMs we eed a more geeral result tha that of [3]. This sectio is orgaized as follows: I Sectio 5.1 we preset the required modificatio of the result of [3]. The i Sectio 5.2 we boud the resultig local Rademacher averages Boudig the estimatio error for ERM-type algorithms. We first have to itroduce some otatio. To this ed let F be a class of bouded measurable fuctios from Z to R such that F is separable with respect to. Give a probability measure P o Z we defie the modulus of cotiuity of F by ω (F,ε) := ω P, (F,ε) := E T P ( sup f F, E P f 2 ε ) E P f E T f, ε > 0, where we ote that the supremum is, as a fuctio from Z to R, measurable by the separability assumptio o F. Now, a fuctio L:F Z [0, ) is called a loss fuctio if L f := L(f, ) is measurable for all f F. Give a probability measure P o Z we idicate by f P,F F a miimizer of f R L,P (f) := E z P L(f,z). Throughout this paper R L,P (f) is called the L-risk of f. If P is a empirical measure with respect to T Z we write f T,F ad R L,T ( ) as usual. For simplicity, we assume throughout this sectio that f P,F ad f T,F do exist. Furthermore, although there may be multiple solutios we use a sigle symbol for them wheever o cofusio regardig the ouiqueess of this symbol ca be expected. A algorithm that produces solutios f T,F is called a empirical L-risk miimizer. Moreover, if F is covex, we say that L is covex if L(,z) is covex for all z Z. Fially, L is called lie-cotiuous

20 20 I. STEINWART AND C. SCOVEL if for all z Z ad all f, ˆf F the fuctio t L(tf + (1 t) ˆf,z) is cotiuous o [0,1]. If F is a vector space the every covex L is lie-cotiuous. Now the mai result of this sectio reads as follows: Theorem 5.1. Let F be a covex set of bouded measurable fuctios from Z to R, ad let L:F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := {L f L f P,F :f F}. Suppose that there are costats c 0, 0 < α 1, δ 0 ad B > 0 with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to. Let 1, x 1 ad ε > 0 with (34) The we have { ε 10max ω (G,cε α + δ), δx, ( 4cx ) 1/(2 α), Bx }. Pr (T Z : R L,P (f T,F ) < R L,P (f P,F ) + ε) 1 e x. Remark 5.2. Theorem 5.1 has bee proved i [3] for δ = 0, where it was used to fid learig rates faster tha 1/2 for certai ERM-type algorithms. At first glace such fast rates are impossible if δ > 0. However, we will see later that for SVMs we have δ = a κ σ(λ) for a suitable κ > 0 depedig o both Tsybakov s ad the geometric oise expoet, ad hece we have δ 0 for. As already metioed, the proof of Theorem 5.1 is based o Talagrad s cocetratio iequality i [33] ad its refiemets i [16, 20, 25]. The versio below of this iequality is derived from Bousquet s result i [9] usig a little trick preseted i [2], Lemma 2.5. Theorem 5.3. Let P be a probability measure o Z ad H be a set of bouded measurable fuctios from Z to R which is separable with respect to ad satisfies E P h = 0 for all h H. Furthermore, let b > 0 ad τ 0 be costats with h b ad E P h 2 τ for all h H. The for all x 1 ad all 1 we have P (T Z : sup h H 2xτ E T h > 3E T P sup E T h + h H + bx ) e x. This cocetratio iequality is used to prove the followig lemma which is a geeralized versio of Lemma 13 i [3].

21 FAST RATES FOR SUPPORT VECTOR MACHINES 21 Lemma 5.4. Let P be a probability measure o Z ad G be a set of bouded measurable fuctios from Z to R which is separable with respect to. Let c 0, 0 < α 1, δ 0 ad B > 0 be costats with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that for all T Z ad all ε > 0 for which for some g G we have E T g ε/20 there exists a g G which satisfies ad E P g ε E T g ε/20 ad E P g = ε. The for all 1, x 1, ad all ε > 0 satisfyig (34), we have Pr (T Z : for all g G with E T g ε/20 we have E P g < ε) 1 e x. Proof. We defie H := {E P g g :g G,E P g = ε}. Obviously, we have E P h = 0, h 2B, ad E P h 2 = E P g 2 (E P g) 2 cε α + δ for all h H. Moreover, sice it is also easy to verify that H is separable with respect to, our assumptio o G yields Pr (T Z : g G with E T g ε/20 ad E P g ε) Pr (T Z : g G with E P g E T g 19ε/20 ad E P g = ε) ) P (T Z : sup E T h 19ε/20. h H Note that sice H is separable with respect to, the set o the last lie is actually measurable. I order to boud the last probability we will apply Theorem 5.3. To this ed we have to show Our assumptios o ε imply ( (35) ε 10E T P 19ε 20 > 3E T P sup E T h + h H sup g G, E P g 2 cε α +δ ) E P g E T g 2xτ + bx. 10E T P sup E T h. h H Furthermore, sice 10 ( )2 ad 0 < α 1 we have ( ) 4cx 1/(2 α) ( ) 60 2/(2 α) ( ) 4cx 1/(2 α) (36) ε If δ cε α a simple calculatio hece shows ε 2(cε α +δ)x. Furthermore, if δ > cε α the assumptios of the theorem show δx ε δx (cε α + δ)x. 19

22 22 I. STEINWART AND C. SCOVEL Hece we have ε 2(cε α +δ)x for all ε satisfyig the assumptios of the theorem. Now let τ := cε α + δ ad b := 2B. By (35) ad ε 10Bx we the fid 19ε 20 > 3E T P sup E T h + h H Applyig Theorem 5.3 the yields 2xτ + bx. Pr (T Z : g G with E T g ε/20 ad E P g ε) ) P (T Z : sup E T h 19ε/20 h H P (T Z : sup h H e x. 2xτ E T h > 3E T P sup E T h + h H + bx ) With the help of the above lemma we ca ow prove the mai result of this sectio, that is, Theorem 5.1. Proof of Theorem 5.1. I order to apply Lemma 5.4 to the class G it obviously suffices to show the richess coditio o G of Lemma 5.4. To this ed let f F with E T (L f L f P,F ) ε/20 ad E P (L f L f P,F ) ε. For t [0,1] we defie f t := tf + (1 t)f P,F. Sice F is covex we have f t F for all t [0,1]. By the lie-cotiuity of L ad Lebesgue s theorem we fid that the map h:t E P (L f t L f P,F ) which maps from [0,1] to [0,B] is cotiuous. Sice h(0) = 0 ad h(1) ε there is a t (0,1] with E P (L f t L f P,F ) = h(t) = ε by the itermediate value theorem. Moreover, for this t we have E T (L f t L f P,F ) E T (tl f + (1 t)l f P,F L f P,F ) ε/20. Now, let ε > 0 with ε 10max{ω (G,cε α + δ),( δx )1/2,( 4cx )1/(2 α), Bx }. The by Lemma 5.4 we fid that with probability at least 1 e x, every f F with E T (L f L f P,F ) ε/20 satisfies E P (L f L f P,F ) < ε. Sice we always have we obtai the assertio. E T (L f T,F L f P,F ) 0 < ε/20,

23 FAST RATES FOR SUPPORT VECTOR MACHINES Boudig the modulus of cotiuity. The aim of this subsectio is to boud the modulus of cotiuity of the class G i Theorem 5.1 with the help of coverig umbers. We the preset the resultig modificatio of Theorem 5.1. Let us begi by recallig the defiitio of (local) Rademacher averages. To this ed let F be a class of bouded measurable fuctios from Z to R which is separable with respect to. Furthermore, let P be a probability measure o Z ad (ε i ) be a sequece of i.i.d. Rademacher variables (i.e., symmetric { 1, 1}-valued radom variables) with respect to some probability measure µ o a set Ω. The the Rademacher average of F is 1 Rad P (F,) := Rad(F,) := E P E µ sup ε i f(z i ), f F ad for ε > 0 the local Rademacher average of F is defied by 1 Rad(F,,ε) := Rad P (F,,ε) := E P E µ sup ε f F, i f(z i ). i=1 E P f 2 ε For a give a > 0 we immediately obtai Rad(aF, ) = a Rad(F, ) ad (37) Rad(aF,,ε) = arad(f,,a 2 ε). Moreover, by symmetrizatio the modulus of cotiuity ca be estimated by the local Rademacher average. More precisely, we always have (see [36]) i=1 ω P, (F,ε) 2Rad P (F,,ε), ε > 0. Local Rademacher averages ca be estimated by coverig umbers. Without proof we state a slight modificatio of a correspodig result i [21]: Propositio 5.5. Let F be a class of measurable fuctios from Z to [ 1,1] which is separable with respect to ad let P be a probability measure o Z. Assume there are costats a > 0 ad 0 < p < 2 with sup log N(F,ε,L 2 (T)) aε p T Z for all ε > 0. The there exists a costat c p > 0 depedig oly o p such that for all 1 ad all ε > 0 we have { ( a 1/2 ( a 2/(2+p) } Rad(F,,ε) c p max ε ) 1/2 p/4,. ) Usig this propositio we ca replace the modulus of cotiuity i Theorem 5.1 by a assumptio o the coverig umbers of G. Assumig that all resultig miimizers exist, the correspodig result the reads as follows:

24 24 I. STEINWART AND C. SCOVEL Theorem 5.6. Let F be a covex set of bouded measurable fuctios from Z to R ad let L:F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := {L f L f P,F :f F}. Suppose that there are costats c 0, 0 < α 1, δ 0 ad B > 0 with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to ad that there are costats a 1 ad 0 < p < 2 with (38) sup log N(B 1 G,ε,L 2 (T)) aε p T Z for all ε > 0. The there exists a costat c p > 0 depedig oly o p such that for all 1 ad all x 1 we have where Pr (T Z : R L,P (f T,F ) > R L,P (f P,F ) + c p ε(,a,b,c,δ,x)) e x, ε(,a,b,c,δ,x) := B 2p/(4 2α+αp) c (2 p)/(4 2α+αp) ( a ( ) a 2/(2+p) + B + ( ) δx cx 1/(2 α) + + Bx. ) 2/(4 2α+αp) + B p/2 δ (2 p)/4 ( a ) 1/2 Proof. By (37) ad Propositio 5.5 we fid { ( a 1/2 ( ) a 2/(2+p) } Rad(G,,ε) c p max B p/2 ε ) 1/2 p/4,b. We assume without loss of geerality that c p 5. Let ε > 0 be the largest real umber that satisfies (39) ε = 2c p B p/2 (c(ε ) α + δ) 1/2 p/4 ( a ) 1/2. Furthermore, let ε > 0 be such that { ( a 1/2 ε = 2c p max B p/2 (cε α + δ) ) (2 p)/4, ( ) a 2/(2+p) B, ( δx 4cx, ) 1/(2 α), Bx }. It is easy to see that both ε ad ε exist. Moreover, our above cosideratios show ε 10max{ω (G,cε α + δ),( δx )1/2,( 4cx }, that is, ε satisfies )1/(2 α), Bx

25 FAST RATES FOR SUPPORT VECTOR MACHINES 25 the assumptios of Theorem 5.1. I order to show the assertio it therefore suffices to boud ε from above. To this ed let us first assume that ( a 1/2 { ( ) a 2/(2+p) ( ) B p/2 (cε α +δ) ) (2 p)/4 δx 4cx 1/(2 α) max B,,, Bx }. The we have ε = 2c p B p/2 (cε α + δ) (2 p)/4 ( a )1/2. Sice ε is the largest solutio of this equatio we hece fid ε ε. This shows that we always have ( ) a 2/(2+p) ε ε + 2c p (B + ( δx 4cx + ) 1/(2 α) + Bx ). Hece it suffices to boud ε from above. To this ed let us first assume c(ε ) α δ. This implies ε 4c p B p/2 (c (ε ) α ) 1/2 p/4 ( a )1/2, ad hece we fid ε 16c 2 pb 2p/(4 2α+αp) c (2 p)/(4 2α+αp) ( a ) 2/(4 2α+αp). Coversely, if c(ε ) α < δ holds, the we immediately obtai ε < 4c p B p/2 δ (2 p)/4 ( a ) 1/2. 6. Variace bouds for SVMs. I this sectio we prove some variace bouds i the sese of Theorem 5.6 for SVMs. Let us first esure that these classifiers are ERM-type algorithms that fit ito the framework of Theorem 5.6. To this ed let H be a RKHS of a cotiuous kerel over X, λ > 0, ad l:y R [0, ) be the hige loss fuctio. We defie (40) L(f,x,y) := λ f 2 H + l(y,f(x)) ad (41) L(f,b,x,y) := λ f 2 H + l(y,f(x) + b) for all f H, b R, x X ad y Y. The R L,T ( ) ad R L,T (, ) obviously coicide with the objective fuctios of the SVM formulatios ad therefore SVMs are empirical L-risk miimizers. Furthermore ote that all above miimizers exist (see [31]) ad thus the SVM formulatios i terms of L actually fit ito the framework of Theorem 5.6. I the followig, f l,p deotes a miimizer of R l,p if o cofusio ca arise. For the shape of these miimizers which deped o η := P(y = 1 ) we refer to [39] ad [30]. Now our first result is a variace boud which ca be used whe cosiderig the empirical l-risk miimizer.

26 26 I. STEINWART AND C. SCOVEL Lemma 6.1. Let P be a distributio o X Y with Tsybakov oise expoet 0 q. The there exists a miimizer f l,p mappig ito [ 1,1] such that for all bouded measurable fuctios f :X R we have E P (l f l f l,p ) 2 C η,q ( f + 1) (q+2)/(q+1) (E P (l f l f l,p )) q/(q+1), where C η,q := (2η 1) 1 q, + 2 if q > 0 ad C η,q = 1 if q = 0. Proof. For q = 0 the assertio is trivial ad hece we oly cosider the case q > 0. Give a fixed x X we write p := P(1 x) ad t := f(x). I additio, we itroduce v(p,t) := p(l(1,t) l(1,f l,p (x))) 2 + (1 p)(l( 1,t) l( 1,f l,p (x))) 2, m(p,t) := p(l(1,t) l(1,f l,p (x))) + (1 p)(l( 1,t) l( 1,f l,p (x))). Sice Tsybakov s oise assumptio implies P X (X 0 ) = 0, we ca restrict our cosideratio to p 1/2. Now we will begi by showig ( ) 2 (42) v(p,t) t + m(p,t). 2p 1 Without loss of geerality we may assume p > 1/2. The we may set f l,p (x) := 1 ad thus we have l(1,f l,p (x)) = 0 ad l( 1,f l,p (x)) = 2. Let us first cosider the case t [ 1, 1]. The we have l(1, t) = 1 t ad l( 1,t) = 1 + t, ad therefore (42) reduces to (1 t) 2 ( t + 2 2p 1 ) (2p 1)(1 t). Obviously, the latter iequality is equivalet to 1 t (2p 1) t +2, which is always satisfied for t [ 1, 1] ad p 1/2. Now let us cosider the case t 1. We the have l(1,t) = 1 t ad l( 1, t) = 0, ad after some elemetary calculatio we hece see that (42) is satisfied if ad oly if p 2 (6 2t) p(5 3t) 2t 0. The left-had side is miimal if p = (5 3t)/(12 4t), ad thus we obtai p 2 (6 2t) p(5 3t) 2t 7t2 18t t Cosequetly, it suffices to show 7t 2 18t However, the latter is true for all t 1 sice t 7t 2 18t 25 is decreasig o (, 1]. Now let us cosider the third case, t > 1. Sice we the have l(1,t) = 0 ad l( 1,t) = 1 + t it suffices to show t 1 t + 2 2p 1.

Fast Rates for Support Vector Machines

Fast Rates for Support Vector Machines Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA {igo,jcs}@lal.gov Abstract. We establish learig rates to the Bayes risk

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS 18th Feb, 016 Defiitio (Lipschitz fuctio). A fuctio f : R R is said to be Lipschitz if there exists a positive real umber c such that for ay x, y i the domai

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

n p (Ω). This means that the

n p (Ω). This means that the Sobolev s Iequality, Poicaré Iequality ad Compactess I. Sobolev iequality ad Sobolev Embeddig Theorems Theorem (Sobolev s embeddig theorem). Give the bouded, ope set R with 3 ad p

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Empirical Processes: Glivenko Cantelli Theorems

Empirical Processes: Glivenko Cantelli Theorems Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3

More information

Lesson 10: Limits and Continuity

Lesson 10: Limits and Continuity www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems McGill Uiversity Math 354: Hoors Aalysis 3 Fall 212 Assigmet 3 Solutios to selected problems Problem 1. Lipschitz fuctios. Let Lip K be the set of all fuctios cotiuous fuctios o [, 1] satisfyig a Lipschitz

More information

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 00, Number 0, Pages 000 000 S 0002-9939(XX0000-0 ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS MARCH T. BOEDIHARDJO AND WILLIAM B. JOHNSON 2

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Analytic Continuation

Analytic Continuation Aalytic Cotiuatio The stadard example of this is give by Example Let h (z) = 1 + z + z 2 + z 3 +... kow to coverge oly for z < 1. I fact h (z) = 1/ (1 z) for such z. Yet H (z) = 1/ (1 z) is defied for

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

The Choquet Integral with Respect to Fuzzy-Valued Set Functions

The Choquet Integral with Respect to Fuzzy-Valued Set Functions The Choquet Itegral with Respect to Fuzzy-Valued Set Fuctios Weiwei Zhag Abstract The Choquet itegral with respect to real-valued oadditive set fuctios, such as siged efficiecy measures, has bee used i

More information

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function. MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size. Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure

More information

PRELIM PROBLEM SOLUTIONS

PRELIM PROBLEM SOLUTIONS PRELIM PROBLEM SOLUTIONS THE GRAD STUDENTS + KEN Cotets. Complex Aalysis Practice Problems 2. 2. Real Aalysis Practice Problems 2. 4 3. Algebra Practice Problems 2. 8. Complex Aalysis Practice Problems

More information

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

f n (x) f m (x) < ɛ/3 for all x A. By continuity of f n and f m we can find δ > 0 such that d(x, x 0 ) < δ implies that

f n (x) f m (x) < ɛ/3 for all x A. By continuity of f n and f m we can find δ > 0 such that d(x, x 0 ) < δ implies that Lecture 15 We have see that a sequece of cotiuous fuctios which is uiformly coverget produces a limit fuctio which is also cotiuous. We shall stregthe this result ow. Theorem 1 Let f : X R or (C) be a

More information

Approximation by Superpositions of a Sigmoidal Function

Approximation by Superpositions of a Sigmoidal Function Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 22 (2003, No. 2, 463 470 Approximatio by Superpositios of a Sigmoidal Fuctio G. Lewicki ad G. Mario Abstract. We geeralize

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Boundaries and the James theorem

Boundaries and the James theorem Boudaries ad the James theorem L. Vesely 1. Itroductio The followig theorem is importat ad well kow. All spaces cosidered here are real ormed or Baach spaces. Give a ormed space X, we deote by B X ad S

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Beurling Integers: Part 2

Beurling Integers: Part 2 Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula Joural of Multivariate Aalysis 102 (2011) 1315 1319 Cotets lists available at ScieceDirect Joural of Multivariate Aalysis joural homepage: www.elsevier.com/locate/jmva Superefficiet estimatio of the margials

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

MAS111 Convergence and Continuity

MAS111 Convergence and Continuity MAS Covergece ad Cotiuity Key Objectives At the ed of the course, studets should kow the followig topics ad be able to apply the basic priciples ad theorems therei to solvig various problems cocerig covergece

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Chapter 10: Power Series

Chapter 10: Power Series Chapter : Power Series 57 Chapter Overview: Power Series The reaso series are part of a Calculus course is that there are fuctios which caot be itegrated. All power series, though, ca be itegrated because

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and MATH01 Real Aalysis (2008 Fall) Tutorial Note #7 Sequece ad Series of fuctio 1: Poitwise Covergece ad Uiform Covergece Part I: Poitwise Covergece Defiitio of poitwise covergece: A sequece of fuctios f

More information

Ω ). Then the following inequality takes place:

Ω ). Then the following inequality takes place: Lecture 8 Lemma 5. Let f : R R be a cotiuously differetiable covex fuctio. Choose a costat δ > ad cosider the subset Ωδ = { R f δ } R. Let Ωδ ad assume that f < δ, i.e., is ot o the boudary of f = δ, i.e.,

More information

1 Approximating Integrals using Taylor Polynomials

1 Approximating Integrals using Taylor Polynomials Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Chapter IV Integration Theory

Chapter IV Integration Theory Chapter IV Itegratio Theory Lectures 32-33 1. Costructio of the itegral I this sectio we costruct the abstract itegral. As a matter of termiology, we defie a measure space as beig a triple (, A, µ), where

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information