arxiv: v1 [math.st] 14 Aug 2007
|
|
- Edgar Fleming
- 5 years ago
- Views:
Transcription
1 The Aals of Statistics 2007, Vol. 35, No. 2, DOI: / I the Public Domai arxiv: v1 [math.st] 14 Aug 2007 FAST RATES FOR SUPPORT VECTOR MACHINES USING GAUSSIAN KERNELS 1 By Igo Steiwart ad Clit Scovel Los Alamos Natioal Laboratory For biary classificatio we establish learig rates up to the order of 1 for support vector machies (SVMs) with hige loss ad Gaussia RBF kerels. These rates are i terms of two assumptios o the cosidered distributios: Tsybakov s oise assumptio to establish a small estimatio error, ad a ew geometric oise coditio which is used to boud the approximatio error. Ulike previously proposed cocepts for boudig the approximatio error, the geometric oise assumptio does ot employ ay smoothess assumptio. 1. Itroductio. I recet years support vector machies (SVMs) have bee the subject of may theoretical cosideratios. Despite this effort, their learig performace o restricted classes of distributios is still widely ukow. I particular, it is ukow uder which otrivial circumstaces SVMs ca guaratee fast learig rates. The aim of this work is to use cocepts like Tsybakov s oise assumptio ad local Rademacher averages to establish learig rates up to the order of 1 for otrivial distributios. I additio to these cocepts that are used to deal with the stochastic part of the aalysis we also itroduce a geometric assumptio for distributios that allows us to estimate the approximatio properties of Gaussia RBF kerels. Ulike may other cocepts itroduced for boudig the approximatio error, our geometric assumptio is ot i terms of smoothess but describes the cocetratio ad the oisiess of the data-geeratig distributio ear the decisio boudary. Let us formally itroduce the statistical classificatio problem. To this ed let us fix a subset X R d. We write Y := { 1,1}. Give a fiite traiig set Received December 2003; revised Jue Supported by the LDRD-ER program of the Los Alamos Natioal Laboratory. AMS 2000 subject classificatios. Primary 68Q32; secodary 62G20, 62G99, 68T05, 68T10, 41A46, 41A99. Key words ad phrases. Support vector machies, classificatio, oliear discrimiatio, learig rates, oise assumptio, Gaussia RBF kerels. This is a electroic reprit of the origial article published by the Istitute of Mathematical Statistics i The Aals of Statistics, 2007, Vol. 35, No. 2, This reprit differs from the origial i pagiatio ad typographic detail. 1
2 2 I. STEINWART AND C. SCOVEL T = ((x 1,y 1 ),...,(x,y )) (X Y ), the classificatio task is to predict the label y of a ew sample (x,y). I the stadard batch model it is assumed that the samples (x i,y i ) are i.i.d. accordig to a ukow (Borel) probability measure P o X Y. Furthermore, the ew sample (x,y) is draw from P idepedetly of T. Give a classifier C that assigs to every traiig set T a measurable fuctio f T :X R, the predictio of C for y is sig T f(x), where sig(0) := 1. The quality of such a fuctio f is measured by the classificatio risk R P (f) := P({(x,y):sigf(x) y}), which should be as small as possible. The smallest achievable risk R P := if{r P (f) f :X R measurable} is called the Bayes risk of P ad a fuctio attaiig this risk is called a Bayes decisio fuctio ad is deoted by f P. Obviously, a good classifier should at least produce decisio fuctios whose risks coverge to the Bayes risk for all distributios P. This leads to the otio of uiversally cosistet classifiers which is thoroughly treated i [14]. The ext aturally arisig questio is whether there are classifiers which guaratee a specific covergece rate for all distributios. Ufortuately, this is impossible by a result of Devroye (see [14], Theorem 7.2). However, if oe restricts cosideratio to certai smaller classes of distributios, such learig rates, for example, i the form of P (T (X Y ) : R P (f T ) R P + C(x) β ) 1 e x, 1,x 1, where β > 0 ad C(x) > 0 are costats, exist for various classifiers. Typical assumptios for such classes of distributios are either i terms of the smoothess of the fuctio η(x) := P(y = 1 x) (see, e.g., [19, 38]), or i terms of the smoothess of the decisio boudary (see, e.g., [18, 35]). Moreover, the correspodig learig rates are slower tha 1/2 if o additioal assumptios o the amout of the oise i the labels, for example, o the distributio of the radom variable (1) mi{1 η(x),η(x)} = 1 2 η(x) 1 2 aroud the critical level 1/2, are imposed. O the other had, [35] showed that ERM-type classifiers ca lear faster tha 1/2, if oe quatifies how likely the oise i (1) is close to 1/2 (see Defiitio 2.2 i the followig sectio). Ufortuately, however the ERM classifier cosidered i [35] requires substatial kowledge o how to approximate the desired Bayes decisio fuctios. Moreover, ERM classifiers are based o combiatorial optimizatio problems ad hece they are usually hard to implemet ad i geeral there exist o efficiet algorithms. O the oe had SVMs do ot share the implemetatio issues of ERM sice they are based o a covex optimizatio (see, e.g., [12, 26] for algorithmic aspects). O the other had, however, their kow learig rates are
3 FAST RATES FOR SUPPORT VECTOR MACHINES 3 rather usatisfactory sice either the assumptios o the distributios are too restrictive as i [28] or the established learig rates are too slow as i [37]. Our aim is to give SVMs a better theoretical foudatio by establishig fast learig rates for a wide class of distributios. To this ed we propose a geometric oise assumptio (see Defiitio 2.3) which describes the cocetratio of the measure 2η 1 dp X where P X is the margial distributio of P with respect to X ear the decisio boudary. This assumptio is the used to determie the approximatio properties of Gaussia kerels which are used i the SVMs we cosider. Provided that the tuig parameters are optimally chose our mai result the shows that the resultig learig rates for these classifiers ca be as fast as 1. The rest of this work is orgaized as follows: I Sectio 2 we itroduce the mai cocepts of this work ad the preset our results. I Sectio 3 we recall some basic theory o reproducig kerel Hilbert spaces ad prove a ew coverig umber boud for Gaussia kerels that describes a trade-off betwee the kerel widths ad the radii of the coverig balls. I Sectio 4 we the show the approximatio results that are related to our proposed geometric oise assumptio. The last sectios of the work cotai the actual proof of our rates: I Sectio 5 we establish a geeral boud for ERMtype classifiers ivolvig local Rademacher averages which is used to boud the estimatio error i our aalysis of SVMs. I order to apply this result we eed variace bouds for SVMs which are established i Sectio 6. Iterestigly, it turs out that sharp versios of these bouds deped o both Tsybakov s oise assumptio ad the approximatio properties of the kerel used. Fially, we prove our learig rates i Sectio Defiitios ad mai results. I this sectio we first recall some basic otios related to support vector machies which are eeded throughout this text. I Sectio 2.2, we the preset a coverig umber boud for Gaussia RBF kerels which will play a importat role i our aalysis of the estimatio error of SVMs. I Sectio 2.3 we recall Tsybakov s oise assumptio which will allow us to establish learig rates faster tha 1/2. The, i Sectio 2.4, we itroduce the ew geometric assumptio that is used to estimate the approximatio error for SVMs with Gaussia RBF kerels. Fially, we preset ad discuss our learig rates i Sectio RKHSs, SVMs ad basic defiitios. For two fuctios f ad g we use the otatio f(λ) g(λ) to mea that there exists a costat C > 0 such that f(λ) Cg(λ) over some specified rage of values of λ. We also use the otatio with similar meaig ad the otatio whe both ad hold. I particular, we use the same otatio for sequeces. If ot stated otherwise, X always deotes a compact subset of R d which is equipped with the Borel σ-algebra.
4 4 I. STEINWART AND C. SCOVEL Recall (see, e.g., [1, 6]) that every positive defiite kerel k :X X R has a uique reproducig kerel Hilbert space H (RKHS) whose uit ball is deoted by B H. Although we sometimes use geeric kerels ad RKHSs, we are maily iterested i Gaussia RBF kerels, which are the most widely used kerels i practice. Recall that these kerels are of the form k σ (x,x ) = exp( σ 2 x x 2 2), x,x X, where σ > 0 is a free parameter whose iverse 1/σ is called the width of k σ. We usually deote the correspodig RKHSs which are thoroughly described i [32] by H σ (X) or simply H σ. Let us ow recall the defiitio of SVMs. To this ed let P be a distributio o X Y ad l:y R [0, ) be the hige loss, that is, l(y,t) := max{0,1 yt}, y Y,t R. Furthermore, we defie the l-risk of a measurable fuctio f :X R by R l,p (f) := E (x,y) P l(y,f(x)). Now let H be a RKHS over X cosistig of measurable fuctios. For λ > 0 we deote a solutio of (2) arg mi(λ f 2 H + R l,p (f + b)) f H b R by ( f P,λ, b P,λ ). Recall that f P,λ is uiquely determied (see, e.g., [30]), while i some situatios this is ot true for the offset b P,λ. I geeral we thus assume that b P,λ is a arbitrary solutio. However, for the (trivial) distributios that satisfy P({y } x) = 1 P X -a.s. for some y Y we explicitly set bp,λ := y i order to cotrol the size of the offset. Furthermore, if P is a empirical distributio with respect to a traiig set T = ((x 1,y 1 ),...,(x,y )) we write R l,t (f) ad ( f T,λ, b T,λ ). Note that i this case the above coditio uder which we set b T,λ := y meas that all labels y i of T are equal to y. A algorithm that costructs ( f T,λ, b T,λ ) for every traiig set T is called a SVM with offset. Furthermore, for λ > 0 we deote the uique solutio of (3) arg mi(λ f 2 H + R l,p (f)) f H by f P,λ ad for empirical distributios based o a traiig set T we agai write f T,λ. A correspodig algorithm is called a SVM without offset. Recall that uder some assumptios o the RKHS used ad the choice of the regularizatio parameter λ it ca be show that both SVM variats are uiversally cosistet (see [29, 31, 39]); however, o satisfyig learig rates have bee established yet.
5 FAST RATES FOR SUPPORT VECTOR MACHINES 5 We also emphasize that i may theoretical papers oly SVMs without offset are cosidered sice the offset ofte causes serious techical problems i the aalysis. However, i practice usually SVMs with offset are used ad therefore we feel that these algorithms should be cosidered i theory, too. As we will see, our techiques ca be applied for both variats. The resultig rates coicide Coverig umbers for Gaussia RKHSs. I order to boud the estimatio error of SVMs we eed a complexity measure for the RKHSs used, which is itroduced i this sectio. To this ed let A E be a subset of a Baach space E. The coverig umbers of A are defied by { } N(A,ε,E) := mi 1: x 1,...,x E with A (x i + εb E ), ε > 0, where B E deotes the closed uit ball of E. Moreover, for a bouded liear operator S : E F betwee two Baach spaces E ad F, the coverig umbers are N(S,ε) := N(SB E,ε,F). Give a traiig set T = ((x 1,y 1 ),...,(x,y )) (X Y ) we deote the space of all equivalece classes of fuctios f :X Y R with orm (4) f L2 (T) := ( 1 ) 1/2 f(x i,y i ) 2 i=1 by L 2 (T). I other words, L 2 (T) is a L 2 -space with respect to the empirical measure of T. Note that for a fuctio f :X Y R a caoical represetative i L 2 (T) is its restrictio f T. I additio, L 2 (T X ) deotes the space of all (equivalece classes of) square itegrable fuctios with respect to the empirical measure of x 1,...,x. The proof of our learig rates uses the behavior of N(B Hσ(X),ε,L 2 (T X )) i ε ad σ i order to boud the estimatio error. Ufortuately, all kow results o coverig umbers for Gaussia RBF kerels emphasize the role of ε ad hece we will establish i Sectio 3 the followig result which describes a suitable trade-off betwee the ifluece of ε ad σ. Theorem 2.1. Let σ 1, X R d be a compact subset with oempty iterior, ad H σ (X) be the RKHS of the Gaussia RBF kerel k σ o X. The for all 0 < p 2 ad all δ > 0, there exists a costat c p,δ,d > 0 idepedet of σ such that for all ε > 0 we have sup log N(B Hσ(X),ε,L 2 (T X )) c p,δ,d σ (1 p/2)(1+δ)d ε p. T (X Y ) i=1
6 6 I. STEINWART AND C. SCOVEL 2.3. Tsybakov s oise assumptio. Now we recall Tsybakov s oise coditio, which describes the amout of oise i the labels. I order to motivate Tsybakov s assumptio let us first observe that by equatio (1) the fuctio 2η 1 ca be used to describe the oise i the labels of a distributio P. Ideed, i regios where this fuctio is close to 1 there is oly a small amout of oise, whereas fuctio values close to 0 oly occur i regios with a high level of oise. The followig defiitio i which we use the covetio t := 0 for t (0,1) describes the size of the latter regios: Defiitio 2.2. Let 0 q ad P be a probability measure o X Y. We say that P has Tsybakov oise expoet q if there exists a costat C > 0 such that for all sufficietly small t > 0 we have (5) P X ({x X : 2η(x) 1 t}) C t q. Obviously, P has Tsybakov oise expoet q > 0 if ad oly if 2η 1 1 L q, (P X ), where L q, deotes a Loretz space (see [5]). It is also easy to see that P has Tsybakov oise expoet q for all q < q if P has Tsybakov oise expoet q. Furthermore, all distributios obviously have oise expoet 0. I the other extreme case q = the coditioal probability η is bouded away from 1/2. I particular, oise-free distributios have expoet q =. Furthermore, for q < it is easy to check that Defiitio 2.2 is satisfied if ad oly if (5) holds for all t > 0 ad a possibly differet costat C. Fially, ote that (5) does ot make ay assumptios o the locatio of the oisy set, ad hece we prefer the otio oise coditio rather tha the ofte used term margi coditio A ew geometric assumptio for distributios. I this sectio we itroduce a coditio for distributios that will allow us to estimate the approximatio error for Gaussia RBF kerels. To this ed let l be the hige loss fuctio ad P be a distributio o X. Let R l,p := if{r l,p (f) f :X R measurable} deote the smallest possible l-risk of P. Sice fuctios achievig the miimal l-risk occur i may situatios we idicate them by f l,p if o cofusio regardig the ouiqueess of this symbol ca be expected. Furthermore, recall that f l,p has a shape similar to the Bayes decisio fuctio sigf P (see, e.g., [30]). Now, give a RKHS H over X we defie the approximatio error fuctio with respect to H ad P by (6) a(λ) := if f H (λ f 2 H + R l,p(f) R l,p ), λ 0. Note that the obvious aalogue of the approximatio error fuctio with offset is ot greater tha the above approximatio error fuctio without offset ad hece we restrict our attetio to the latter for simplicity.
7 FAST RATES FOR SUPPORT VECTOR MACHINES 7 For λ > 0, the approximatio error fuctio describes how well λ f P,λ 2 H + R l,p (f P,λ ) approximates R l,p. For example, it was show i [31] that we have lim λ 0 a(λ) = 0 for all P if X is a compact metric space ad H is dese i the space of cotiuous fuctios C(X). However, i otrivial situatios there caot exist a covergece rate which holds uiformly for all distributios P. Sice H σ (X) is dese i C(X) for compact X R d ad all σ > 0 these statemets are i particular true for the approximatio error fuctios a σ ( ) of the Gaussia RBF kerels with fixed width 1/σ. Moreover, we are ot aware of ay weak coditio o η or P that esures a σ (λ) λ β for λ 0 ad some β > 0, ad the results of [27] idicate that such behavior of a σ ( ) may actually require very restrictive coditios. I the followig we will therefore preset a coditio o P that allows us to estimate a σ (λ) by λ ad σ. I particular it will tur out that a σ (λ) 0 with a polyomial rate i λ if we relate σ to λ i a certai maer. I order to itroduce this assumptio o P we first defie the classes of P by X 1 := {x X :η(x) < 1 2 }, X 1 := {x X :η(x) > 1 2 } ad X 0 := {x X :η(x) = 1 2 } for some choice of η. Now we defie a distace fuctio x τ x by d(x,x 0 X 1 ), if x X 1, (7) τ x := d(x,x 0 X 1 ), if x X 1, 0, otherwise, where d(x,a) deotes the distace of x to a set A with respect to the Euclidea orm. Roughly speakig, τ x measures the distace of x to the decisio boudary. Now we ca preset the already aouced geometric coditio for distributios. Defiitio 2.3. Let X R d be compact ad P be a probability measure o X Y. We say that P has geometric oise expoet α > 0 if there exists a costat C > 0 such that (8) X ( 2η(x) 1 exp τ2 x t ) P X (dx) Ct αd/2, t > 0. We say that P has geometric oise expoet if it has geometric oise expoet α for all α > 0. Note that i the above defiitio we either make ay kid of smoothess assumptio or do we assume a coditio o P X i terms of absolute cotiuity with respect to the Lebesgue measure. Istead, the itegral coditio (8) describes the cocetratio of the measure 2η 1 dp X ear the decisio boudary i the sese that the less the measure is cocetrated i this regio the larger the geometric oise expoet ca be chose. The followig example illustrates this.
8 8 I. STEINWART AND C. SCOVEL Example 2.4. Sice exp( t) C α t α holds for all t > 0 ad a costat C α > 0 oly depedig o α > 0, we easily see that (8) is satisfied wheever (9) (x τ 1 x ) L αd( 2η 1 dp X ), where L αd ( 2η 1 dp X ) deotes the usual Lebesgue space of fuctios that are αd-itegrable with respect to the measure 2η 1 dp X. Now, let us suppose X 0 = for a momet. I this case τ x measures the distace to the class x does ot belog to. I particular, (9) holds for α = if ad oly if the two classes X 1 ad X 1 have strictly positive distace. Moreover, if (9) holds for some 0 < α < the two classes may touch, that is, the decisio boudary X 1 X 1 is oempty. Cosequetly, we ca easily costruct distributios P that have geometric oise expoet ad touchig classes, but also satisfy f P / H σ (X) for all σ > 0. However, ote that for such P the measure 2η 1 dp X must obviously have a very low cocetratio ear the decisio boudary. We ow describe a simple regularity coditio o η ear the decisio boudary that ca be used to guaratee a geometric oise expoet. Defiitio 2.5. Let X R d, P be a distributio o X Y ad γ > 0. We say that P has a evelope of order γ if there is a costat c γ > 0 such that for P X -almost all x X we have (10) 2η(x) 1 c γ τ γ x. Obviously, if P has a evelope of order γ the the graph of x 2η(x) 1 lies i a multiple of the evelope defied by τ γ x at the top ad by τ γ x at the bottom. Cosequetly, η ca be very irregular away from the decisio boudary but caot be discotiuous whe crossig it. The rate of covergece of η(x) 1/2 for τ x 0 is described by γ. Iterestigly, for distributios havig both a evelope of order γ ad a Tsybakov oise expoet q we ca boud the geometric oise expoet, as the followig theorem, which is proved i Sectio 4, shows. Theorem 2.6. Let X R d be compact ad P be a distributio o X Y that has a evelope of order γ > 0 ad a Tsybakov oise expoet q [0, ). The P has geometric oise expoet (q + 1)γd 1 if q 1, ad geometric oise expoet α for all α < (q + 1)γd 1 otherwise. Now the mai result of this subsectio which is proved i Sectio 4 shows that for distributios havig a otrivial geometric oise expoet we ca boud the approximatio error fuctio for Gaussia RBF kerels.
9 FAST RATES FOR SUPPORT VECTOR MACHINES 9 Theorem 2.7. Let σ > 0, X be the closed uit ball of the Euclidea space R d ad a σ ( ) be the approximatio error fuctio with respect to H σ (X). Furthermore, let P be a distributio o X Y that has geometric oise expoet 0 < α < with costat C i (8). The there is a costat c d > 0 depedig oly o the dimesio d such that for all λ > 0 we have (11) a σ (λ) c d (σ d λ + C(2d) αd/2 σ αd ). I order to let the right-had side of (11) coverge to zero it is ecessary to assume both λ 0 ad σ. A easy cosideratio shows that the fastest covergece rate is achieved if σ(λ) := λ 1/((α+1)d). I this case we have a σ(λ) (λ) λ α/(α+1). I particular, we ca obtai rates up to liear order i λ for sufficietly beig distributios. The price for this good approximatio property is, however, a icreasig complexity of the hypothesis class B Hσ(λ), as we have see i Theorem Learig rates for SVMs usig Gaussia RBF kerels. With the help of the geometric oise assumptio we ca ow preset our learig rates for SVMs usig Gaussia RBF kerels. Note agai that these polyomial rates do ot require a smoothess assumptio o P. Furthermore ote that we use the covetio a +b c +d := a c for a,c (0, ), b,d [0, ) i order to make the presetatio compact. Theorem 2.8. Let X be the closed uit ball of R d, ad P be a distributio o X Y with Tsybakov oise expoet q [0, ] ad geometric oise expoet α (0, ). We defie α 2α + 1, if α q + 2 2q, β := 2α(q + 1) 2α(q + 2) + 3q + 4, otherwise, ad λ := (α+1)/αβ ad σ := β/(αd) i both cases. The for all ε > 0 there exists a C > 0 such that for all x 1 ad 1 the SVM without offset usig the Gaussia RBF kerel k σ satisfies Pr ( T (X Y ) : R P (f T,λ ) R P + Cx 2 β+ε) 1 e x, where Pr deotes the outer probability of P i order to avoid measurability cosideratios. If α = the latter iequality holds if σ = σ is a costat with σ > 2 d. Fially, all results also hold for the SVM with offset. Remark 2.9. The above learig rates are faster tha the parametric rate 1/2 if ad oly if α > (3q + 4)/(2q). For q = the latter coditio becomes α > 3/2 ad i a itermediate case q = 1 it becomes α > 7/2.
10 10 I. STEINWART AND C. SCOVEL Remark It is importat to ote that our techiques ca also be used to establish rates for other defiitios of the sequeces (λ ) ad (σ ). I fact, Theorem 2.7 guaratees a σ (λ ) 0 (which is ecessary for our techiques to produce ay rate) if σ ad σ d λ 0. I particular, if λ := ι ad σ := κ for some ι,κ > 0 with κd < ι, these coditios are satisfied ad a coceptually easy but techically ivolved modificatio of our proof ca produce rates for certai rages of ι (ad thus κ). I order to keep the presetatio as short as possible we have omitted the details ad focused o the best possible rates. Remark Ufortuately, the choice of λ ad σ that yields the optimal rates withi our techiques, requires to kow the values of α ad q, which are typically ot available. Adaptive methods which do ot require such kowledge are still ukow. Remark Theorem 2.7 ad Theorem 2.8 establish results for all distributios havig some geometric oise expoet. However, for certai distributios of this type the resultig rates are ot satisfactory. For example cosider the distributio P o X := [ 1, 1] whose margial distributio P X equals the uiform distributio ad whose coditioal distributio η(x) := P(y = 1 x) satisfies 2η(x) 1 = x γ, x X, for some costat γ (0, ). The P obviously has Tsybakov oise expoet q := 1/γ, ad Theorem 2.6 or a simple modificatio of the proof of Theorem 2.7 shows that P has geometric oise expoet α := 1 + γ. Theorem 2.8 thus gives a rate of the form β+ε for β = 2q2 +4q+2 5q 2 +10q+4, which is ever faster tha 1/2. Though this is disappoitig at first glace, it is ot really surprisig sice the proof of Theorem 2.7 is ot tailored to distributios havig such simple decisio fuctios. We believe that sharper bouds o the approximatio error fuctio (ad thus faster learig rates) for this ad other distributios are possible, but a detailed aalysis is beyod the scope of this paper. Remark Aother iterestig but ope questio is whether the obtaied rates are optimal for the class of cosidered distributios. I order to approach this questio let us cosider the case α =, which roughly speakig describes the case of almost o approximatio error. I this case our rates are essetially of the form (q+1)/(q+2), which coicides with the rates Tsybakov (see [35]) achieved for certai ERM classifiers based o hypothesis classes of small complexity. The latter rates i tur caot be improved i a miimax sese for certai classes of distributios as was also show i [35]. This discussio idicates that the techiques used for the stochastic part of our aalysis may be strog eough to produce optimal results. However, if we cosider the case α < the the approximatio error fuctio described
11 FAST RATES FOR SUPPORT VECTOR MACHINES 11 i Theorem 2.7 ad its ifluece o the estimatio error (see our proofs, i particular Sectio 5 ad Sectio 7) have a sigificat impact o the obtaied rates. Sice the sharpess of Theorem 2.7 is uclear to us we make o cojecture regardig the optimality of our rates i the geeral case. 3. Proof of Theorem 2.1. The mai goal of this sectio is to prove Theorem 2.1, which is doe i Sectio 3.2. To this ed we provide i Sectio 3.1 some RKHS theory which is used throughout this work Some basic RKHS theory. For the proofs of this sectio we have to recall some basic facts from the theory of RKHSs. To this ed let X R d be a compact subset ad k:x X R be a cotiuous ad positive semidefiite kerel with RKHS H. The H cosists of cotiuous fuctios o X ad for f H we have f K f H, where (12) K := sup x X k(x,x). Cosequetly, if the embeddig of the RKHS H ito the space of cotiuous fuctios C(X) is deoted by (13) J H :H C(X) we have J H K. Furthermore, let us recall the represetatio of H based o Mercer s theorem (see [13]). To this ed let K X :L 2 (X) L 2 (X) be the itegral operator defied by (14) K X f(x) := k(x,x )f(x )dx, f L 2 (X),x X, X where L 2 (X) deotes the L 2 -space o X with respect to the Lebesgue measure. The it was show i [13] that the uique square root K 1/2 X of K X is a isometric isomorphism betwee L 2 (X) ad H Proof of Theorem 2.1. I order to prove Theorem 2.1 we eed the followig result which bouds the coverig umbers of H σ (X) with respect to C(X). Theorem 3.1. Let σ 1, 0 < p < 2 ad X R d be a compact subset with oempty iterior. The there is a costat c p,d > 0 idepedet of σ such that for all ε > 0 we have log N(B Hσ(X),ε,C(X)) c p,d σ (1 p/4)d ε p. Proof. Let B d be the closed uit ball of the Euclidea space R d ad B d be its iterior. The there exists a r 1 such that X rb d. Now,
12 12 I. STEINWART AND C. SCOVEL it was recetly show i [32] that the restrictios H σ (rb d ) H σ (X) ad H σ (rb d ) H σ ( B d ) are both isometric isomorphisms. Cosequetly, i the followig we assume without loss of geerality that X = B d or X = B d ad do ot cocer ourselves with the distictio of both cases. Now let us write H σ := H σ (X) ad J σ := J Hσ :H σ C(X) i order to simplify otatio. Furthermore, let K σ :L 2 (X) L 2 (X) be the itegral operator of k σ defied as i (14), ad deote the orm i L 2 (X). Accordig to [13], Theorem 3, page 27, for ay f H σ, we obtai K 1 σ if h R f h 1 R K 1/2 σ f 2 = 1 R f 2 H σ, where we use the covetio K 1 σ h = if h / K σl 2 (X). Suppose ow that H L 2 (X) is a dese Hilbert space with h h H, ad that we have K σ :L 2 (X) H L 2 (X) with K σ :L 2 (X) H c σ,h < for some costat c σ,h > 0. It follows that ad hece if h H c σ,h R f h if K 1 σ h R if f h c σ,h h H R R f 2 H σ. f h 1 R f 2 H σ By [27], Theorem 3.1 it follows that f is cotaied i the real iterpolatio space (L 2 (X), H) 1/2, (see [7] for the defiitio of a iterpolatio space) ad its orm i this space satisfies f 1/2, 2 c σ,h f Hσ. Therefore we obtai a cotiuous embeddig Υ 1 :H σ (L 2 (X), H) 1/2,, with Υ 1 2 c σ,h. If i additio a subset iclusio (L 2 (X), H) 1/2, C(X) exists which defies a cotiuous embeddig Υ 2 :(L 2 (X), H) 1/2, C(X), we have a factorizatio J σ = Υ 2 Υ 1 ad ca coclude (15) log N(B Hσ(X),ε,C(X)) = log N(J σ,ε) log N ( Υ 2, ε 2 c σ,h Cosequetly, to boud log N(J σ,ε) we eed to select a H, compute c σ,h ad boud log N(Υ 2,ε). To that ed let H := W m ( X) be the Sobolev space with orm f 2 m = D α f 2, α m ).
13 FAST RATES FOR SUPPORT VECTOR MACHINES 13 where α := d i=1 α i, D α := d i=1 α i i, ad α i i deotes the α i th partial derivative i the ith coordiate of R d. By the Cauchy Schwarz iequality we obtai (16) D α K σ f 2 f 2 Dxk α σ (x, x) 2 d xdx, X X where the otatio Dx α idicates that the differetiatio takes place i the x variable. To address the term Dxk α σ (x, x) we ote that D α x(e x 2 ) = ( 1) α e x 2 /2 h α (x), where the multivariate Hermite fuctios h α (x) = d i=1 h αi (x i ) are products of the uivariate fuctios. Sice R h2 k (x)dx = 2k k! π (see, e.g., [11]) we obtai Dx(e α x 2 ) 2 dx = e x 2 h 2 α(x)dx R d R (17) d h 2 α(x)dx = 2 α α!π d/2, R d where we have used the defiitio α! := d i=1 α i!. Applyig the traslatio ivariace of k σ, we obtai Dxk α σ (x, x) 2 d x = Dxk ά σ (0, x) 2 d x = D ά σ2 x 2 x(e ) 2 d x, R d R d R d ad by a chage of variables we ca apply iequality (17) to the itegral o the right-had side, D ά σ2 x 2 x(e ) 2 d x = σ 2 α d Dx(e ά x 2 ) 2 d x σ 2 α d 2 α α!π d/2. R d R d Hece we obtai Dx α k σ(x, x) 2 d xdx θ(d)σ 2 α d 2 α α!π d/2, X X where θ(d) is the volume of X. Sice α m α! dm m! d ad K σ f 2 m = α m Dα K σ f 2 we ca therefore ifer from (16) that for σ 1 we have (18) K σ θ(d)(2d) m/2 m! d/2 σ m d/2 =: c σ,h. Now let us cosider Υ 2 :(L 2 (X),W m ( X)) 1/2, C(X). Accordig to Triebel [34], page 267, we have (L 2 (X),W m ( X)) 1/2, = (L 2 ( X),W m ( X)) 1/2, = B m/2 2, ( X) isomorphically. Furthermore (19) log N(B m/2 2, ( X) C(X),ε) c m,d ε 2d/m
14 14 I. STEINWART AND C. SCOVEL for m > d follows from a similar result of Birma ad Solomjak ([8], cf. also [34]) for Slobodeckij (i.e., fractioal Sobolev) spaces, where the costat c m,d depeds oly o m ad d. Cosequetly we obtai from (15), (18) ad (19) that ( ) ε 2d/m log N(J σ,ε) c m,d 2 c σ,h = c m,d (4c σ,h ) d/m ε 2d/m = c m,d σ d d2 /(2m) ε 2d/m for all m > d ad ew costats c m,d depedig oly o m ad d. Settig m := 2d/p completes the proof of Theorem 3.1. Proof of Theorem 2.1. As before we write H σ := H σ (X) ad J σ := J Hσ :H σ C(X) i order to simplify otatio. Furthermore recall for a traiig set T (X Y ) the space L 2 (T X ) itroduced i Sectio 2.2. Now let R TX :C(X) L 2 (T X ) be the restrictio map defied by f f TX. Obviously, we have R TX 1. Furthermore we defie I σ := R TX J σ so that I σ :H σ L 2 (T X ) is the evaluatio map. The Theorem 3.1 ad the product rule for coverig umbers imply that (20) sup log N(I σ,ε) c q,d σ (1 q/4)d ε q T Z for all 0 < q < 2. To complete the proof of Theorem 2.1 we derive aother boud o the coverig umbers ad iterpolate the two. To that ed observe that I σ :H σ L 2 (T X ) factors through C(X) with both factors J s ad R TX havig orm ot greater tha 1. Hece Propositio i [23] implies that I σ is absolutely 2-summig with 2-summig orm ot greater tha 1. By Köig s theorem ([24], Lemma 2.7.2) we obtai for the approximatio umbers (a k (I σ )) of I σ that k 1 a2 k (I σ) 1 for all σ > 0. Sice the approximatio umbers are decreasig it follows that sup k kak (I σ ) 1. Usig Carl s iequality betwee approximatio ad etropy umbers (see Theorem i [10]) we thus fid a costat c > 0 such that (21) sup log N(I σ,ε) cε 2 T Z for all ε > 0 ad all σ > 0. Let us ow iterpolate the boud (21) with the boud (20). Sice I σ :H σ L 2 (T X ) 1 we oly eed to cosider 0 < ε 1. Let 0 < q < p < 2 ad 0 < a 1. The for 0 < ε < a we have log N(I σ,ε) c q,d σ (1 q/4)d ε q c q,d σ (1 q/4)d a p q ε p, ad for a ε 1 we fid log N(I σ,ε) cε 2 ca p 2 ε p.
15 FAST RATES FOR SUPPORT VECTOR MACHINES 15 Sice σ 1 we ca set a := σ ((4 q)/(8 4q))d ad obtai log N(I σ,ε) c q,d σ (1 p/2)((8 2q)/(8 4q))d ε p, where c q,d is a costat depedig oly o q,d. The proof is completed by choosig q := 4δ 2p 1+2δ whe δ < 8 4p ad q just smaller tha p otherwise. 4. Proofs of Theorems 2.7 ad 2.6. I this sectio we prove Theorems 2.7 ad 2.6, which both deal with the geometric oise expoet Proof of Theorem 2.7. Let us begi by recallig some facts about Gaussia RBF kerels. To this ed let H σ (R d ) be the RKHS of the Gaussia RBF kerel with parameter σ. The it was show i [32] that the liear operator V σ : L 2 (R d ) H σ (R d ) defied by V σ g(x) = (2σ)d/2 π d/4 e 2σ2 x y 2 2g(y)dy, g L2 (R d ),x R d, R d is a isometric isomorphism. Cosequetly, we obtai (22) a σ (λ) = if g L 2 (R d ) λ g 2 L 2 (R d ) + R l,p(v σ g) R l,p, λ > 0. I the followig we will estimate the right-had side of (22) by a judicious choice of g. To this ed we eed the followig lemma, which i some sese elarges the support of P to esure that all balls of the form B(x,τ x ) are cotaied i the (elarged) support. This guaratee will the make it possible to cotrol the behavior of V σ g by tails of spherical Gaussia distributios [see (28) for details]. Lemma 4.1. Let X be a closed uit ball of R d ad P be a probability measure o X Y with regular coditioal probability η(x) = P(y = 1 x), x X. O X := 3X we defie η(x), ( ) if x 1, (23) ή(x) = x η, otherwise. x We also write X 1 := {x X :ή(x) < 1 2 } ad X 1 := {x X :ή(x) > 1 2 }. Fially let B(x,r) deote the ope ball of radius r about x i R d. The for x X 1 we have B(x,τ x ) X 1 ad for x X 1 we have B(x,τ x ) X 1. Proof. Let x X 1 ad x B(x,τ x ). If x X we have x x < τ x which implies η(x) > 1 2 by the defiitio of τ x. This shows x X 1. Now let us assume x > 1. By x,x x ad Pythagoras theorem we the obtai x x x 2 x x,x x 2 x 2 + x,x x 2 x 2 x = x x 2.
16 16 I. STEINWART AND C. SCOVEL Therefore, we have x x x < τ x, which implies ή(x ) = η( x x ) > 1 2. (24) Let us fially recall that Zhag showed i [39] that the hige risk satisfies R l,p (f) R l,p = E PX ( 2η 1 f f P ) for all measurable f :X [ 1,1]. Now we are ready to prove Theorem 2.7. Proof of Theorem 2.7. With the otatio of Lemma 4.1 we fix a measurable f P : X [ 1,1] that satisfies f P = 1 o X 1, fp = 1 o X 1 ad f P = 0 otherwise. For g := (σ 2 /π) d/4 fp we the immediately obtai ( 81σ 2 ) d/4 (25) g L2 (R d ) θ(d), π where θ(d) deotes the volume of X. Moreover, it is easy to see that 1 f P 1 implies 1 V σ g 1. Sice P X has support i X, (24) the yields (26) R l,p (V σ g) R l,p = E PX ( 2η 1 V σ g f P ). I order to boud V σ g(x) f P (x) for x X 1 we observe ( 2σ 2 ) d/2 V σ g(x) = e 2σ2 x y 2 2 fp (y)dy π R d ( 2σ 2 ) d/2 (27) = e 2σ2 x y 2 2( fp (y) + 1)dy 1 π R d ( 2σ 2 ) d/2 e 2σ2 x y 2 2 ( fp (y) + 1)dy 1. π B(x,τ x) Now remember that Lemma 4.1 showed B(x,τ x ) X 1 for all x X 1, so that (27) implies ( 2σ 2 ) d/2 V σ g(x) 2 e 2σ2 x y 2 2 dy 1 π B(x,τ x) (28) = 1 2P γσ ( u τ x ), where γ σ = (2σ 2 /π) d/2 e 2σ2 u 2 du is a spherical Gaussia i R d. Accordig to the tail boud [17], iequality (3.5) o page 59, we have P γσ ( u r) 4e σ2 r 2 /2d ad cosequetly we obtai 1 V σ g(x) 1 8e σ2 τ 2 x /2d, x X 1. Sice for x X 1 we ca obtai a aalogous estimate, we coclude V σ g(x) f P (x) 8e σ2 τ 2 x /2d
17 FAST RATES FOR SUPPORT VECTOR MACHINES 17 for all x X 1 X 1. Cosequetly (26) ad the geometric oise assumptio for t := 2d σ 2 yield (29) R l,p (V σ g) R l,p 8E x PX ( 2η(x) 1 e σ2 τ 2 x/2d ) 8C(2d) αd/2 σ αd, where C is the costat i (8). Combiig (29), (25) ad (22) ow yields the assertio Proof of Theorem 2.6. I this subsectio, all Lebesgue ad Loretz spaces (see, e.g., [5]) ad their orms are with respect to the measure P X. Proof of Theorem 2.6. Let us first cosider the case q 1 where we ca apply the Hölder iequality for Loretz spaces [22], which states = 1. Applyig this i- fg 1 f q, g q,1 for all f L q,, g L q,1 ad q defied by 1 q + 1 q equality gives E x PX ( 2η(x) 1 e τ2 x /t ) (30) (2η 1) 1 q, x (2η(x) 1) 2 e τ2 x /t q,1 C (2η 1) 2 e ( 2η 1 /cγ)2/γ t 1 q,1, where i the last estimate we used the Tsybakov assumptio (5) ad the fact that P has a evelope of order γ. Let us write h(x) := 2η(x) 1 1, x X, ad b := t(c γ ) 2/γ so that 2η(x) 1 2 e ( 2η 1 /cγ)2/γ t 1 = g(h(x)), where g(s) := s 2 e (s 2/γ )/b for all s 1. Now it is easy to see that g : [1, ) [0, ) is strictly icreasig if 0 < b 2 3γ, ad hece we ca exted g to a strictly icreasig, cotiuous ad ivertible fuctio o [0, ) i this case. Let such a extesio also be deoted by g. The for this extesio we have (31) P X (g h > τ) = P X (h > g 1 (τ)). Now for a fuctio f :X [0, ) recall the oicreasig rearragemet f (u) := if {σ 0:P X (f > σ) u}, u > 0, of f which ca be used to defie Loretz orms (see, e.g., [5]). For u > 0 equatio (31) the yields (g h) (u) = g(if{g 1 (σ):p X (h > g 1 (σ)) u}) = g h (u).
18 18 I. STEINWART AND C. SCOVEL Now, iequality (5) implies P X (h ( C u )1/q ) u for all u > 0. Therefore, we fid ( ) C 1/q h (u) if{σ 0:P X (h σ) u} u for all 0 < u < 1. Sice (g h) = g h ad g is icreasig we hece have (( ) C 1/q ) (g h) (u) g u for all 0 < u < 1. Now, for fixed ˆα > 0 the boud e x implies s 2(ˆα/γ 1) g(s) bˆα l 2 (s 2/γ b 1 ) + 1 x ˆα l 2 (x)+1 o (0, ) for s [1, ). Usig the fact that (g h) (u) = 0 holds for all u 1, we hece obtai (g h) (u) bˆα u 2/q(1 ˆα/γ) l 2 ((u/c) 2/(qγ) b 1 ) + 1 for u > 0 if we assume without loss of geerality that C 1. Let us defie ˆα := γ q+1 2. The we fid 1 q + 2 q (1 ˆα 2 γ ) = 0 ad cosequetly for b 3γ, that 2 is, t, we obtai 3γ(c γ) 2/γ (32) g h q,1 = 0 bˆα 0 u 1/q 1 (g h) (u)du u 1 l 2 du ((u/c) 2/(qγ) b 1 tγ(q+1)/2 ) + 1 by the defiitio of b. Sice we also have E PX ( 2η(x) 1 e τ2 x/t ) 1 for all t > 0, estimate (30) together the defiitio of g ad (32) yields the assertio i the case q 1. Let us ow cosider the case 0 q < 1 where the Hölder iequality i Loretz space caot be used. The for all t,τ 0 we have (33) E x PX ( 2η(x) 1 e τ2 x/t ) = 2η(x) 1 e τ2 x /t P X (dx) 2η 1 τ + 2η 1 >τ 2η(x) 1 e τ2 x/t P X (dx) ( ( ) τ 2/γ ) Cτ q+1 + exp t 1, cγ
19 FAST RATES FOR SUPPORT VECTOR MACHINES 19 where we have used the Tsybakov assumptio (5) ad the fact that P has a evelope of order γ. Let us defie τ by τ q+1 := exp( ( τ c γ ) 2/γ t 1 ). For â := (c γ ) 2/γ (q + 1) ad small t this defiitio implies (âγ τ 2 ) γ/2 ( tl 1 ât) γ/2, ad hece the assertio follows from (33) for the case 0 < q < The estimatio error of ERM-type classifiers. To boud the estimatio error i the proof of Theorem 2.8 we ow establish a cocetratio iequality for ERM-type algorithms usig a variat of Talagrad s cocetratio iequality together with local Rademacher averages (see, e.g., [2, 4, 21]). Our approach is ispired by [3]. However, due to the regularizatio term λ f 2 H i the defiitio of SVMs we eed a more geeral result tha that of [3]. This sectio is orgaized as follows: I Sectio 5.1 we preset the required modificatio of the result of [3]. The i Sectio 5.2 we boud the resultig local Rademacher averages Boudig the estimatio error for ERM-type algorithms. We first have to itroduce some otatio. To this ed let F be a class of bouded measurable fuctios from Z to R such that F is separable with respect to. Give a probability measure P o Z we defie the modulus of cotiuity of F by ω (F,ε) := ω P, (F,ε) := E T P ( sup f F, E P f 2 ε ) E P f E T f, ε > 0, where we ote that the supremum is, as a fuctio from Z to R, measurable by the separability assumptio o F. Now, a fuctio L:F Z [0, ) is called a loss fuctio if L f := L(f, ) is measurable for all f F. Give a probability measure P o Z we idicate by f P,F F a miimizer of f R L,P (f) := E z P L(f,z). Throughout this paper R L,P (f) is called the L-risk of f. If P is a empirical measure with respect to T Z we write f T,F ad R L,T ( ) as usual. For simplicity, we assume throughout this sectio that f P,F ad f T,F do exist. Furthermore, although there may be multiple solutios we use a sigle symbol for them wheever o cofusio regardig the ouiqueess of this symbol ca be expected. A algorithm that produces solutios f T,F is called a empirical L-risk miimizer. Moreover, if F is covex, we say that L is covex if L(,z) is covex for all z Z. Fially, L is called lie-cotiuous
20 20 I. STEINWART AND C. SCOVEL if for all z Z ad all f, ˆf F the fuctio t L(tf + (1 t) ˆf,z) is cotiuous o [0,1]. If F is a vector space the every covex L is lie-cotiuous. Now the mai result of this sectio reads as follows: Theorem 5.1. Let F be a covex set of bouded measurable fuctios from Z to R, ad let L:F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := {L f L f P,F :f F}. Suppose that there are costats c 0, 0 < α 1, δ 0 ad B > 0 with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to. Let 1, x 1 ad ε > 0 with (34) The we have { ε 10max ω (G,cε α + δ), δx, ( 4cx ) 1/(2 α), Bx }. Pr (T Z : R L,P (f T,F ) < R L,P (f P,F ) + ε) 1 e x. Remark 5.2. Theorem 5.1 has bee proved i [3] for δ = 0, where it was used to fid learig rates faster tha 1/2 for certai ERM-type algorithms. At first glace such fast rates are impossible if δ > 0. However, we will see later that for SVMs we have δ = a κ σ(λ) for a suitable κ > 0 depedig o both Tsybakov s ad the geometric oise expoet, ad hece we have δ 0 for. As already metioed, the proof of Theorem 5.1 is based o Talagrad s cocetratio iequality i [33] ad its refiemets i [16, 20, 25]. The versio below of this iequality is derived from Bousquet s result i [9] usig a little trick preseted i [2], Lemma 2.5. Theorem 5.3. Let P be a probability measure o Z ad H be a set of bouded measurable fuctios from Z to R which is separable with respect to ad satisfies E P h = 0 for all h H. Furthermore, let b > 0 ad τ 0 be costats with h b ad E P h 2 τ for all h H. The for all x 1 ad all 1 we have P (T Z : sup h H 2xτ E T h > 3E T P sup E T h + h H + bx ) e x. This cocetratio iequality is used to prove the followig lemma which is a geeralized versio of Lemma 13 i [3].
21 FAST RATES FOR SUPPORT VECTOR MACHINES 21 Lemma 5.4. Let P be a probability measure o Z ad G be a set of bouded measurable fuctios from Z to R which is separable with respect to. Let c 0, 0 < α 1, δ 0 ad B > 0 be costats with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that for all T Z ad all ε > 0 for which for some g G we have E T g ε/20 there exists a g G which satisfies ad E P g ε E T g ε/20 ad E P g = ε. The for all 1, x 1, ad all ε > 0 satisfyig (34), we have Pr (T Z : for all g G with E T g ε/20 we have E P g < ε) 1 e x. Proof. We defie H := {E P g g :g G,E P g = ε}. Obviously, we have E P h = 0, h 2B, ad E P h 2 = E P g 2 (E P g) 2 cε α + δ for all h H. Moreover, sice it is also easy to verify that H is separable with respect to, our assumptio o G yields Pr (T Z : g G with E T g ε/20 ad E P g ε) Pr (T Z : g G with E P g E T g 19ε/20 ad E P g = ε) ) P (T Z : sup E T h 19ε/20. h H Note that sice H is separable with respect to, the set o the last lie is actually measurable. I order to boud the last probability we will apply Theorem 5.3. To this ed we have to show Our assumptios o ε imply ( (35) ε 10E T P 19ε 20 > 3E T P sup E T h + h H sup g G, E P g 2 cε α +δ ) E P g E T g 2xτ + bx. 10E T P sup E T h. h H Furthermore, sice 10 ( )2 ad 0 < α 1 we have ( ) 4cx 1/(2 α) ( ) 60 2/(2 α) ( ) 4cx 1/(2 α) (36) ε If δ cε α a simple calculatio hece shows ε 2(cε α +δ)x. Furthermore, if δ > cε α the assumptios of the theorem show δx ε δx (cε α + δ)x. 19
22 22 I. STEINWART AND C. SCOVEL Hece we have ε 2(cε α +δ)x for all ε satisfyig the assumptios of the theorem. Now let τ := cε α + δ ad b := 2B. By (35) ad ε 10Bx we the fid 19ε 20 > 3E T P sup E T h + h H Applyig Theorem 5.3 the yields 2xτ + bx. Pr (T Z : g G with E T g ε/20 ad E P g ε) ) P (T Z : sup E T h 19ε/20 h H P (T Z : sup h H e x. 2xτ E T h > 3E T P sup E T h + h H + bx ) With the help of the above lemma we ca ow prove the mai result of this sectio, that is, Theorem 5.1. Proof of Theorem 5.1. I order to apply Lemma 5.4 to the class G it obviously suffices to show the richess coditio o G of Lemma 5.4. To this ed let f F with E T (L f L f P,F ) ε/20 ad E P (L f L f P,F ) ε. For t [0,1] we defie f t := tf + (1 t)f P,F. Sice F is covex we have f t F for all t [0,1]. By the lie-cotiuity of L ad Lebesgue s theorem we fid that the map h:t E P (L f t L f P,F ) which maps from [0,1] to [0,B] is cotiuous. Sice h(0) = 0 ad h(1) ε there is a t (0,1] with E P (L f t L f P,F ) = h(t) = ε by the itermediate value theorem. Moreover, for this t we have E T (L f t L f P,F ) E T (tl f + (1 t)l f P,F L f P,F ) ε/20. Now, let ε > 0 with ε 10max{ω (G,cε α + δ),( δx )1/2,( 4cx )1/(2 α), Bx }. The by Lemma 5.4 we fid that with probability at least 1 e x, every f F with E T (L f L f P,F ) ε/20 satisfies E P (L f L f P,F ) < ε. Sice we always have we obtai the assertio. E T (L f T,F L f P,F ) 0 < ε/20,
23 FAST RATES FOR SUPPORT VECTOR MACHINES Boudig the modulus of cotiuity. The aim of this subsectio is to boud the modulus of cotiuity of the class G i Theorem 5.1 with the help of coverig umbers. We the preset the resultig modificatio of Theorem 5.1. Let us begi by recallig the defiitio of (local) Rademacher averages. To this ed let F be a class of bouded measurable fuctios from Z to R which is separable with respect to. Furthermore, let P be a probability measure o Z ad (ε i ) be a sequece of i.i.d. Rademacher variables (i.e., symmetric { 1, 1}-valued radom variables) with respect to some probability measure µ o a set Ω. The the Rademacher average of F is 1 Rad P (F,) := Rad(F,) := E P E µ sup ε i f(z i ), f F ad for ε > 0 the local Rademacher average of F is defied by 1 Rad(F,,ε) := Rad P (F,,ε) := E P E µ sup ε f F, i f(z i ). i=1 E P f 2 ε For a give a > 0 we immediately obtai Rad(aF, ) = a Rad(F, ) ad (37) Rad(aF,,ε) = arad(f,,a 2 ε). Moreover, by symmetrizatio the modulus of cotiuity ca be estimated by the local Rademacher average. More precisely, we always have (see [36]) i=1 ω P, (F,ε) 2Rad P (F,,ε), ε > 0. Local Rademacher averages ca be estimated by coverig umbers. Without proof we state a slight modificatio of a correspodig result i [21]: Propositio 5.5. Let F be a class of measurable fuctios from Z to [ 1,1] which is separable with respect to ad let P be a probability measure o Z. Assume there are costats a > 0 ad 0 < p < 2 with sup log N(F,ε,L 2 (T)) aε p T Z for all ε > 0. The there exists a costat c p > 0 depedig oly o p such that for all 1 ad all ε > 0 we have { ( a 1/2 ( a 2/(2+p) } Rad(F,,ε) c p max ε ) 1/2 p/4,. ) Usig this propositio we ca replace the modulus of cotiuity i Theorem 5.1 by a assumptio o the coverig umbers of G. Assumig that all resultig miimizers exist, the correspodig result the reads as follows:
24 24 I. STEINWART AND C. SCOVEL Theorem 5.6. Let F be a covex set of bouded measurable fuctios from Z to R ad let L:F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := {L f L f P,F :f F}. Suppose that there are costats c 0, 0 < α 1, δ 0 ad B > 0 with E P g 2 c(e P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to ad that there are costats a 1 ad 0 < p < 2 with (38) sup log N(B 1 G,ε,L 2 (T)) aε p T Z for all ε > 0. The there exists a costat c p > 0 depedig oly o p such that for all 1 ad all x 1 we have where Pr (T Z : R L,P (f T,F ) > R L,P (f P,F ) + c p ε(,a,b,c,δ,x)) e x, ε(,a,b,c,δ,x) := B 2p/(4 2α+αp) c (2 p)/(4 2α+αp) ( a ( ) a 2/(2+p) + B + ( ) δx cx 1/(2 α) + + Bx. ) 2/(4 2α+αp) + B p/2 δ (2 p)/4 ( a ) 1/2 Proof. By (37) ad Propositio 5.5 we fid { ( a 1/2 ( ) a 2/(2+p) } Rad(G,,ε) c p max B p/2 ε ) 1/2 p/4,b. We assume without loss of geerality that c p 5. Let ε > 0 be the largest real umber that satisfies (39) ε = 2c p B p/2 (c(ε ) α + δ) 1/2 p/4 ( a ) 1/2. Furthermore, let ε > 0 be such that { ( a 1/2 ε = 2c p max B p/2 (cε α + δ) ) (2 p)/4, ( ) a 2/(2+p) B, ( δx 4cx, ) 1/(2 α), Bx }. It is easy to see that both ε ad ε exist. Moreover, our above cosideratios show ε 10max{ω (G,cε α + δ),( δx )1/2,( 4cx }, that is, ε satisfies )1/(2 α), Bx
25 FAST RATES FOR SUPPORT VECTOR MACHINES 25 the assumptios of Theorem 5.1. I order to show the assertio it therefore suffices to boud ε from above. To this ed let us first assume that ( a 1/2 { ( ) a 2/(2+p) ( ) B p/2 (cε α +δ) ) (2 p)/4 δx 4cx 1/(2 α) max B,,, Bx }. The we have ε = 2c p B p/2 (cε α + δ) (2 p)/4 ( a )1/2. Sice ε is the largest solutio of this equatio we hece fid ε ε. This shows that we always have ( ) a 2/(2+p) ε ε + 2c p (B + ( δx 4cx + ) 1/(2 α) + Bx ). Hece it suffices to boud ε from above. To this ed let us first assume c(ε ) α δ. This implies ε 4c p B p/2 (c (ε ) α ) 1/2 p/4 ( a )1/2, ad hece we fid ε 16c 2 pb 2p/(4 2α+αp) c (2 p)/(4 2α+αp) ( a ) 2/(4 2α+αp). Coversely, if c(ε ) α < δ holds, the we immediately obtai ε < 4c p B p/2 δ (2 p)/4 ( a ) 1/2. 6. Variace bouds for SVMs. I this sectio we prove some variace bouds i the sese of Theorem 5.6 for SVMs. Let us first esure that these classifiers are ERM-type algorithms that fit ito the framework of Theorem 5.6. To this ed let H be a RKHS of a cotiuous kerel over X, λ > 0, ad l:y R [0, ) be the hige loss fuctio. We defie (40) L(f,x,y) := λ f 2 H + l(y,f(x)) ad (41) L(f,b,x,y) := λ f 2 H + l(y,f(x) + b) for all f H, b R, x X ad y Y. The R L,T ( ) ad R L,T (, ) obviously coicide with the objective fuctios of the SVM formulatios ad therefore SVMs are empirical L-risk miimizers. Furthermore ote that all above miimizers exist (see [31]) ad thus the SVM formulatios i terms of L actually fit ito the framework of Theorem 5.6. I the followig, f l,p deotes a miimizer of R l,p if o cofusio ca arise. For the shape of these miimizers which deped o η := P(y = 1 ) we refer to [39] ad [30]. Now our first result is a variace boud which ca be used whe cosiderig the empirical l-risk miimizer.
26 26 I. STEINWART AND C. SCOVEL Lemma 6.1. Let P be a distributio o X Y with Tsybakov oise expoet 0 q. The there exists a miimizer f l,p mappig ito [ 1,1] such that for all bouded measurable fuctios f :X R we have E P (l f l f l,p ) 2 C η,q ( f + 1) (q+2)/(q+1) (E P (l f l f l,p )) q/(q+1), where C η,q := (2η 1) 1 q, + 2 if q > 0 ad C η,q = 1 if q = 0. Proof. For q = 0 the assertio is trivial ad hece we oly cosider the case q > 0. Give a fixed x X we write p := P(1 x) ad t := f(x). I additio, we itroduce v(p,t) := p(l(1,t) l(1,f l,p (x))) 2 + (1 p)(l( 1,t) l( 1,f l,p (x))) 2, m(p,t) := p(l(1,t) l(1,f l,p (x))) + (1 p)(l( 1,t) l( 1,f l,p (x))). Sice Tsybakov s oise assumptio implies P X (X 0 ) = 0, we ca restrict our cosideratio to p 1/2. Now we will begi by showig ( ) 2 (42) v(p,t) t + m(p,t). 2p 1 Without loss of geerality we may assume p > 1/2. The we may set f l,p (x) := 1 ad thus we have l(1,f l,p (x)) = 0 ad l( 1,f l,p (x)) = 2. Let us first cosider the case t [ 1, 1]. The we have l(1, t) = 1 t ad l( 1,t) = 1 + t, ad therefore (42) reduces to (1 t) 2 ( t + 2 2p 1 ) (2p 1)(1 t). Obviously, the latter iequality is equivalet to 1 t (2p 1) t +2, which is always satisfied for t [ 1, 1] ad p 1/2. Now let us cosider the case t 1. We the have l(1,t) = 1 t ad l( 1, t) = 0, ad after some elemetary calculatio we hece see that (42) is satisfied if ad oly if p 2 (6 2t) p(5 3t) 2t 0. The left-had side is miimal if p = (5 3t)/(12 4t), ad thus we obtai p 2 (6 2t) p(5 3t) 2t 7t2 18t t Cosequetly, it suffices to show 7t 2 18t However, the latter is true for all t 1 sice t 7t 2 18t 25 is decreasig o (, 1]. Now let us cosider the third case, t > 1. Sice we the have l(1,t) = 0 ad l( 1,t) = 1 + t it suffices to show t 1 t + 2 2p 1.
Fast Rates for Support Vector Machines
Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA {igo,jcs}@lal.gov Abstract. We establish learig rates to the Bayes risk
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More informationA survey on penalized empirical risk minimization Sara A. van de Geer
A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More informationLecture Notes for Analysis Class
Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationDefinition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.
4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationChapter 6 Infinite Series
Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat
More informationREAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS
REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS 18th Feb, 016 Defiitio (Lipschitz fuctio). A fuctio f : R R is said to be Lipschitz if there exists a positive real umber c such that for ay x, y i the domai
More informationProduct measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.
Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the
More informationSingular Continuous Measures by Michael Pejic 5/14/10
Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable
More informationSieve Estimators: Consistency and Rates of Convergence
EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationRates of Convergence by Moduli of Continuity
Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity
More informationLecture 3 The Lebesgue Integral
Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified
More informationMath Solutions to homework 6
Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS
MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.
More informationLecture 2. The Lovász Local Lemma
Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio
More informationRademacher Complexity
EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for
More informationSequences and Series of Functions
Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges
More informationMeasure and Measurable Functions
3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies
More informationIntegrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number
MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More informationAdvanced Stochastic Processes.
Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.
More informationLecture 10 October Minimaxity and least favorable prior sequences
STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least
More informationOn Random Line Segments in the Unit Square
O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,
More information(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3
MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationn p (Ω). This means that the
Sobolev s Iequality, Poicaré Iequality ad Compactess I. Sobolev iequality ad Sobolev Embeddig Theorems Theorem (Sobolev s embeddig theorem). Give the bouded, ope set R with 3 ad p
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric
More informationEmpirical Processes: Glivenko Cantelli Theorems
Empirical Processes: Gliveko Catelli Theorems Mouliath Baerjee Jue 6, 200 Gliveko Catelli classes of fuctios The reader is referred to Chapter.6 of Weller s Torgo otes, Chapter??? of VDVW ad Chapter 8.3
More informationLesson 10: Limits and Continuity
www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals
More informationSeunghee Ye Ma 8: Week 5 Oct 28
Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value
More informationFall 2013 MTH431/531 Real analysis Section Notes
Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More information7.1 Convergence of sequences of random variables
Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite
More informationMcGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems
McGill Uiversity Math 354: Hoors Aalysis 3 Fall 212 Assigmet 3 Solutios to selected problems Problem 1. Lipschitz fuctios. Let Lip K be the set of all fuctios cotiuous fuctios o [, 1] satisfyig a Lipschitz
More informationON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS
PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 00, Number 0, Pages 000 000 S 0002-9939(XX0000-0 ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS MARCH T. BOEDIHARDJO AND WILLIAM B. JOHNSON 2
More information62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +
62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of
More informationAnalytic Continuation
Aalytic Cotiuatio The stadard example of this is give by Example Let h (z) = 1 + z + z 2 + z 3 +... kow to coverge oly for z < 1. I fact h (z) = 1/ (1 z) for such z. Yet H (z) = 1/ (1 z) is defied for
More information1 Review and Overview
CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More information6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.
6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio
More informationThe random version of Dvoretzky s theorem in l n
The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationMath 61CM - Solutions to homework 3
Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig
More informationMAT1026 Calculus II Basic Convergence Tests for Series
MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real
More informationChapter 7 Isoperimetric problem
Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated
More informationThe Choquet Integral with Respect to Fuzzy-Valued Set Functions
The Choquet Itegral with Respect to Fuzzy-Valued Set Fuctios Weiwei Zhag Abstract The Choquet itegral with respect to real-valued oadditive set fuctios, such as siged efficiecy measures, has bee used i
More informationIt is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.
MATH 532 Measurable Fuctios Dr. Neal, WKU Throughout, let ( X, F, µ) be a measure space ad let (!, F, P ) deote the special case of a probability space. We shall ow begi to study real-valued fuctios defied
More informationA Proof of Birkhoff s Ergodic Theorem
A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete
More informationAn Introduction to Randomized Algorithms
A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis
More informationThe Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.
Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure
More informationPRELIM PROBLEM SOLUTIONS
PRELIM PROBLEM SOLUTIONS THE GRAD STUDENTS + KEN Cotets. Complex Aalysis Practice Problems 2. 2. Real Aalysis Practice Problems 2. 4 3. Algebra Practice Problems 2. 8. Complex Aalysis Practice Problems
More informationLecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound
Lecture 7 Ageda for the lecture Gaussia chael with average power costraits Capacity of additive Gaussia oise chael ad the sphere packig boud 7. Additive Gaussia oise chael Up to this poit, we have bee
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet
More informationRandom Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.
Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)
More informationf n (x) f m (x) < ɛ/3 for all x A. By continuity of f n and f m we can find δ > 0 such that d(x, x 0 ) < δ implies that
Lecture 15 We have see that a sequece of cotiuous fuctios which is uiformly coverget produces a limit fuctio which is also cotiuous. We shall stregthe this result ow. Theorem 1 Let f : X R or (C) be a
More informationApproximation by Superpositions of a Sigmoidal Function
Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 22 (2003, No. 2, 463 470 Approximatio by Superpositios of a Sigmoidal Fuctio G. Lewicki ad G. Mario Abstract. We geeralize
More informationAda Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities
CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We
More informationBoundaries and the James theorem
Boudaries ad the James theorem L. Vesely 1. Itroductio The followig theorem is importat ad well kow. All spaces cosidered here are real ormed or Baach spaces. Give a ormed space X, we deote by B X ad S
More informationRiesz-Fischer Sequences and Lower Frame Bounds
Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.
More informationJacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3
No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies
More informationLecture 12: September 27
36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.
More informationlim za n n = z lim a n n.
Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget
More informationSolution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1
Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity
More informationLecture 15: Learning Theory: Concentration Inequalities
STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that
More informationBeurling Integers: Part 2
Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers
More informationMath 113 Exam 3 Practice
Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This
More informationLecture 3: August 31
36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,
More information1 Convergence in Probability and the Weak Law of Large Numbers
36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec
More informationJournal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula
Joural of Multivariate Aalysis 102 (2011) 1315 1319 Cotets lists available at ScieceDirect Joural of Multivariate Aalysis joural homepage: www.elsevier.com/locate/jmva Superefficiet estimatio of the margials
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationBinary classification, Part 1
Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y
More informationMAS111 Convergence and Continuity
MAS Covergece ad Cotiuity Key Objectives At the ed of the course, studets should kow the followig topics ad be able to apply the basic priciples ad theorems therei to solvig various problems cocerig covergece
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece
More informationChapter 10: Power Series
Chapter : Power Series 57 Chapter Overview: Power Series The reaso series are part of a Calculus course is that there are fuctios which caot be itegrated. All power series, though, ca be itegrated because
More information4.3 Growth Rates of Solutions to Recurrences
4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.
More informationMATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and
MATH01 Real Aalysis (2008 Fall) Tutorial Note #7 Sequece ad Series of fuctio 1: Poitwise Covergece ad Uiform Covergece Part I: Poitwise Covergece Defiitio of poitwise covergece: A sequece of fuctios f
More informationΩ ). Then the following inequality takes place:
Lecture 8 Lemma 5. Let f : R R be a cotiuously differetiable covex fuctio. Choose a costat δ > ad cosider the subset Ωδ = { R f δ } R. Let Ωδ ad assume that f < δ, i.e., is ot o the boudary of f = δ, i.e.,
More information1 Approximating Integrals using Taylor Polynomials
Seughee Ye Ma 8: Week 7 Nov Week 7 Summary This week, we will lear how we ca approximate itegrals usig Taylor series ad umerical methods. Topics Page Approximatig Itegrals usig Taylor Polyomials. Defiitios................................................
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 3
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe
More informationChapter IV Integration Theory
Chapter IV Itegratio Theory Lectures 32-33 1. Costructio of the itegral I this sectio we costruct the abstract itegral. As a matter of termiology, we defie a measure space as beig a triple (, A, µ), where
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More information