Fast Rates for Support Vector Machines

Size: px
Start display at page:

Download "Fast Rates for Support Vector Machines"

Transcription

1 Fast Rates for Support Vector Machies Igo Steiwart ad Clit Scovel CCS-3, Los Alamos Natioal Laboratory, Los Alamos NM 87545, USA Abstract. We establish learig rates to the Bayes risk for support vector machies (SVMs) usig a regularizatio sequece λ = α, where α (0, 1) is arbitrary. Uder a oise coditio recetly proposed by Tsybakov these rates ca become faster tha 1/. I order to deal with the approximatio error we preset a geeral cocept called the approximatio error fuctio which describes how well the ifiite sample versios of the cosidered SVMs approximate the data-geeratig distributio. I additio we discuss i some detail the relatio betwee the classical approximatio error ad the approximatio error fuctio. Fially, for distributios satisfyig a geometric oise assumptio we establish some learig rates whe the used RKHS is a Sobolev space. 1 Itroductio The goal i biary classificatio is to predict labels y Y := { 1, 1} of usee data poits x X usig a traiig set T = ( (x 1, y 1 ),..., (x, y ) ) (X Y ). As usual we assume that both the traiig samples (x i, y i ) ad the ew sample (x, y) are i.i.d. draw from a ukow distributio P o X Y. Now give a classifier C that assigs to every T a fuctio f T : X R the predictio of C for y is sig f T (x), where we choose a fixed defiitio of sig(0) { 1, 1}. I order to lear from T the decisio fuctio f T : X R should guaratee a small probability for the misclassificatio, i.e. sig f T (x) y, of the example (x, y). To make this precise the risk of a measurable fuctio f : X R is defied by R P (f) := P ( {(x, y) : sig f(x) y} ), ad the smallest achievable risk R P := if{r P (f) f : X R measurable} is kow as the Bayes risk of P. A fuctio f P attaiig this risk is called a Bayes decisio fuctio. Obviously, a good classifier should produce decisio fuctios whose risks are close to the Bayes risk with high probability. To make this precise, we say that a classifier is uiversally cosistet if E T P R P (f T ) R P 0 for. (1) Ufortuately, it is well kow that o classifier ca guaratee a covergece rate i (1) that simultaeously holds for all distributios (see [1, Thm. 7.]). However, if oe restricts cosideratios to suitable smaller classes of distributios such rates exist for various classifiers (see e.g. [, 3, 1]). Oe iterestig feature of

2 these rates is that they are ot faster tha 1/ if the cosidered distributios P are allowed to be oisy i the sese of R P > 0. O the other had, if oe restricts cosideratios to oise-free distributios P i the sese of R P = 0 the some empirical risk miimizatio (ERM) methods ca actually lear with rate 1 (see e.g. [1]). Remarkably, it was oly recetly discovered (see [4, 5]) that there also exists classes of oisy distributios which ca be leared with rates betwee 1/ ad 1. The key property of these classes is that their oise level x 1/ η(x) 1/ with η(x) := P (y = 1 x) is well-behaved i the sese of the followig defiitio. Defiitio 1. A distributio P o X Y has Tsybakov oise expoet q [0, ], if there exists a C > 0 such that for all sufficietly small t > 0 we have P X ( {x X : η(x) 1 t} ) C t q. () Obviously, all distributios have at least oise expoet 0. At the other extreme, () is satisfied for q = if ad oly if the coditioal probability η is bouded away from the critical level 1/. I particular this shows that oise-free distributios have expoet q =. The aim of this work is to establish learig rates for support vector machies (SVMs) uder Tsybakov s oise assumptio which are comparable to the rates of [4, 5]). Therefore let us ow recall these classificatio algorithms: let X be a compact metric space ad H be a RKHS over X with cotiuous kerel k. Furthermore, let l : Y R [0, ) be the hige loss which is defied by l(y, t) := max{0, 1 yt}. The give a traiig set T (X Y ) ad a regularizatio parameter λ > 0 SVMs solve the optimizatio problems or ( f T,λ, b T,λ ) := arg mi λ f H + 1 f H b R f T,λ := arg mi f H λ f H + 1 l ( y i, f(x i ) + b ), (3) i=1 l ( y i, f(x i ) ), (4) respectively. Furthermore, i order to cotrol the size of the offset we always choose b T,λ := y if all samples of T have label y. As usual we call algorithms solvig (3) L1-SVMs with offset ad algorithms solvig (4) L1-SVMs without offset. For more iformatio o these methods we refer to [6]. The rest of this work is orgaized as follows: I Sectio we first itroduce two cocepts which describe the richess of RKHSs. We the preset our mai result ad discuss it. The followig sectios are devoted to the proof of this result: I Sectio 3 we recall some results from [7] which are used for the aalysis of the estimatio error, ad i Sectio 4 we the prove our mai result. Fially, the relatio betwee the approximatio error ad ifiite sample SVMs which is of its ow iterest is discussed i the appedix. i=1

3 Defiitios ad Results For the formulatio of our results we eed two otios which deal with the richess of RKHSs. While the first otio is a complexity measure i terms of coverig umbers which is used to boud the estimatio error, the secod oe describes the approximatio properties of RKHSs with respect to distributios. I order to itroduce the complexity measure let us recall that for a Baach space E with closed uit ball B E, the coverig umbers of A E are defied by { N (A, ε, E) := mi 1 : x 1,..., x E with A i=1 } (x i +εb E ), ε > 0. Give a traiig set T = ((x 1, y 1 ),..., (x, y )) (X Y ) we deote the space of all equivalece classes of fuctios f : X Y R equipped with orm f L(T ) := ( 1 f(xi, y i ) ) 1 i=1 by L (T ). I other words, L (T ) is a L -space with respect to the empirical measure of T. Note, that for a fuctio f : X Y R a caoical represetative i L (T ) is the restrictio f T. Furthermore, we write L (T X ) for the space of all (equivalece classes of) square itegrable fuctios with respect to the empirical measure of x 1,..., x. Now our complexity measure is: Defiitio. Let H be a RKHS over X ad B H its closed uit ball. We say that H has complexity expoet 0 < p if there exists a costat c > 0 such that for all ε > 0 we have sup log N ( B H, ε, L (T X ) ) cε p. T X X By usig the theory of absolutely -summig operators oe ca show that every RKHS has complexity expoet p =. However, for meaigful rates we eed complexity expoets which are strictly smaller tha. I order to itroduce the secod otio describig the approximatio properties of RKHSs we first have to recall the ifiite sample versios of (3) ad (4). To this ed let l be the hige loss fuctio ad P be a distributio o X Y. The for f : X R the l-risk of f is defied by R l,p (f) := E (x,y) P l(y, f(x)). Now give a RKHS H over X ad λ > 0 we defie ad ( f P,λ, b P,λ ) := arg mi f H b R ( λ f H + R l,p (f + b) ( ) f P,λ := arg mi λ f H + R l,p (f) f H (see [8] for the existece of these miimizers). Note that these defiitios give the solutios ( f T,λ, b T,λ ) ad f T,λ of (3) ad (4), respectively, if P is a empirical ) (5) (6) (7)

4 distributio with respect to a traiig set T. I this case we write R l,t (f) for the (empirical) l-risk. With these otatios i mid we defie the approximatio error fuctio by a(λ) := λ f P,λ H + R l,p (f P,λ ) R l,p, λ 0, (8) where R l,p := if{r l,p (f) f : X R} deotes the smallest possible l-risk. Note that sice the obvious variat of a(.) that ivolves a offset is ot greater tha the above approximatio error fuctio, we restrict our attetio to the latter. Furthermore, we discuss the relatioship betwee a(.) ad the stadard approximatio error i the appedix. The approximatio error fuctio quatifies how well a ifiite sample L1- SVM with RKHS H approximates the miimal l-risk. It was show i [8] that if H is dese i the space of cotiuous fuctios C(X) the for all P we have a(λ) 0 if λ 0. However, i o-trivial situatios o rate of covergece which uiformly holds for all distributios P is possible. The followig defiitio characterizes distributios which guaratee certai polyomial rates: Defiitio 3. Let H be a RKHS over X ad P be a probability measure o X Y. We say that H approximates P with expoet 0 β 1 if there exists a costat C > 0 such that for all λ > 0 we have a(λ) Cλ β. Note, that H approximates every distributio P with expoet β = 0. We will see i the appedix that the other extremal case β = 1 is equivalet to the fact that the miimal l-risk ca be achieved by a elemet f l,p H. With the help of the above otatios we ca ow formulate our mai result. Theorem 1. Let H be a RKHS of a cotiuous kerel o a compact metric space X with complexity expoet 0 < p <, ad let P be a probability measure o X Y with Tsybakov oise expoet 0 q. Furthermore, assume that H approximates P with expoet 0 < β 1. We defie λ := α for some α (0, 1) ad all 1. If α < 4(q+1) (q+pq+4)(1+β) the there exists a C > 0 with Pr ( T (X Y ) : R P (f T,λ ) R P + Cx αβ) 1 e x for all 1 ad x 1. Here Pr is the outer probability of P i order to 4(q+1) avoid measurability cosideratios. Furthermore, if α (q+pq+4)(1+β) the for all ε > 0 there is a C > 0 such that for all x 1, 1 we have 4(q+1) Pr ( T (X Y ) : R P (f T,λ ) R P + Cx +α+ε) (q+pq+4) 1 e x. Fially, the same results hold for the L1-SVM with offset wheever q > 0. Remark 1. The best rates Theorem 1 ca guaratee are (up to a ε) of the form 4β(q+1) (q+pq+4)(1+β),

5 ad a easy calculatio shows that these rates are obtaied for the value α := 4(q+1) (q+pq+4)(1+β). This result has already bee aouced i [9] ad preseted i a earlier (ad substatially loger) versio of [7]. The mai differece of Theorem 1 to its predecessors is that it does ot require to choose α optimally. Fially ote that ufortuately the optimal α is i terms of both q ad β, which are i geeral ot accessible. At the momet we are ot aware of ay method which ca adaptively fid the (almost) optimal values for α. Remark. I [5] it is assumed that a Bayes classifier is cotaied i the base fuctio classes the cosidered ERM method miimizes over. This assumptio correspods to a perfect approximatio of P by H, i.e. β = 1, as we will see i the apppedix. If i this case we rescale the complexity expoet p from (0, ) to (0, 1) ad write p for the ew complexity measure our optimal rate essetially becomes q+1 q+p q+. Recall that this is exactly the form of Tsybakov s result i [5] which is kow to be optimal i a mimax sese for some specific classes of distributios. However, as far as we kow our complexity measure caot be compared to Tsybakov s ad thus the above reasoig oly idicates that our optimal rates may be optimal i a mimax sese. Let us fially preset a example which shows how the developed theory ca be used to establish learig rates for specific types of kerels ad distributios. Example 1 (SVMs usig Sobolev spaces). Let X R d be the closed uit Euclidia ball, Ω be the cetered ope ball of radius 3, ad W m (Ω) be the Sobolev space of order m N over Ω. Recall that W m (Ω) is a RKHS of a cotiuous kerel if m > d/ (see e.g. [10]). Let us write H m := {f X : f W m (Ω)} for the restrictio of W m (Ω) oto X edowed with the iduced RKHS orm. The (see agai [10]) the RKHS H m has complexity expoet p := d/m if m > d/. Now let P be a distributio o X Y which has geometric oise expoet α (0, ] i the sese of [7], ad let k σ (x, x ) := exp( σ x x ), x, x Ω, be a Gaussia RBF kerel with associated itegral operator T σ : L (Ω) L (Ω), where L (Ω) is with respect to the Lebesgue measure. The by the results i [7, Secs. 3 & 4] there exist costats c d, c α,m,d 1 such that for all σ > 0 there exists a f σ L (Ω) with f σ L(Ω) = c d σ d, R l,p ((T σ f σ ) X ) R l,p c α,m,d σ αd, ad (T σ f σ ) X Hm c α,m,d σ m d/ f σ L(Ω). This yields a costat c > 0 with a(λ) c ( λσ m+d + σ αd) for all σ > 0 ad all λ > 0. Miimizig with respect to σ the shows that αd H m approximates P with expoet β := (α+1)d+m. Cosequetly we ca use Theorem 1 to obtai learig rates for SVMs usig H m for m > d/. I particular the resultig optimal rates i the sese of Remark 1 are (essetially) of the form 4αdm(q+1) (mq+dq+4m)(αd+d+m).

6 3 Prerequisites I this sectio we recall some importat otios ad results that we require i the proof of our mai theorem. To this ed let H be a RKHS over X that has a cotiuous kerel k. The recall that every f H is cotiuous ad satisfies where we use f K f H, K := sup k(x, x). x X The rest of this sectio recalls some results from [7] which will be used to boud the estimatio error of L1-SVMs. Before we state these results we have to recall some otatio from [7]: let F be a class of bouded measurable fuctios from a set Z to R, ad let L : F Z [0, ) be a fuctio. We call L a loss fuctio if L f := L(f,.) is measurable for all f F. Moreover, if F is covex, we say that L is covex if L(., z) is covex for all z Z. Fially, L is called liecotiuous if for all z Z ad all f, ˆf F the fuctio t L(tf + (1 t) ˆf, z) is cotiuous o [0, 1]. Note that if F is a vector space the every covex L is liecotiuous. Now, give a probability measure P o Z we deote by f P,F F a miimizer of the L-risk f R L,P (f) := E z P L(f, z). If P is a empirical measure with respect to T Z we write f T,F ad R L,T (.) as usual. For simplicity, we assume throughout this sectio that f P,F ad f T,F do exist. Also ote that although there may exist multiple solutios we use a sigle symbol for them wheever o cofusio regardig the o-uiqueess of this symbol ca be expected. Furthermore, a algorithm that produces solutios f T,F for all possible T is called a empirical L-risk miimizer. Now the mai result of this sectio, show i [7], reads as follows: Theorem. Let F be a covex set of bouded measurable fuctios from Z to R ad let L : F Z [0, ) be a covex ad lie-cotiuous loss fuctio. For a probability measure P o Z we defie G := { L f L f P,F : f F }. (9) Suppose we have c 0, 0 < α 1, δ 0 ad B > 0 with E P g c (E P g) α + δ ad g B for all g G. Furthermore, assume that G is separable with respect to. ad that there are costats a 1 ad 0 < p < with sup log N ( B 1 G, ε, L (T ) ) aε p (10) T Z for all ε > 0. The there exists a costat c p > 0 depedig oly o a ad p such that for all 1 ad all x 1 we have Pr ( ) T Z : R L,P (f T,F ) > R L,P (f P,F ) + c p ε(, B, c, δ, x) e x,

7 where ε(, B, c, δ, x) := B p p 4 α+αp c 4 α+αp 4 α+αp ( δx + ) 1 + ( cx ) 1 α + Bx. + B p δ p B +p Let us ow recall some variace bouds of the form E P g c (E P g) α + δ for SVMs proved i [7]. To this ed let H be a RKHS of a cotiuous kerel over X, λ > 0, ad l be the hige loss fuctio. We defie ad L(f, x, y) := λ f H + l ( y, f(x) ) (11) L(f, b, x, y) := λ f H + l ( y, f(x) + b ) (1) for all f H, b R, x X, ad y Y. Sice R L,T (.) ad R L,T (.,.) coicide with the objective fuctios of the L1-SVM formulatios we see that the L1-SVMs actually implemet a empirical L-risk miimizatio i the sese of Theorem. Now the first variace boud from [7] does ot require ay assumptios o P. Propositio 1. Let 0 < λ < 1, H be a RKHS over X, ad F λ 1 B H. Furthermore, let L be defied by (11), P be a probability measure ad G be defied as i (9). The for all g G we have E P g λ 1 ( + K) E P g. Fially, the followig variace boud from [7] shows that the previous boud ca be improved if oe assumes a o-trivial Tsybakov expoet for P. Propositio. Let P be a distributio o X Y with Tsybakov oise expoet 0 < q. The there exists a costat C > 0 such that for all λ > 0, all 0 < r λ 1/ satisfyig f P,λ rb H, all f rb H, ad all b R with b Kr + 1 we have E ( L (f, b) L ( f P,λ, b P,λ ) ) C(Kr + 1) q+ q+1 (E ( L (f, b) L ( f P,λ, b P,λ ) )) q q+1 + C(Kr + 1) q+ q q+1 a q+1 (λ). Furthermore, the same result holds for SVMs without offset. 4 Proof of Theorem 1 I this sectio we prove Theorem 1. To this ed we write f(x) g(x) for two fuctios f, g : D [0, ), D (0, ), if there exists a costat C > 0 such that f(x) Cg(x) holds over some rage of x which usually is implicitly defied by the cotext. However for sequeces this rage is always N. Fially we write f(x) g(x) if both f(x) g(x) ad g(x) f(x) for the same rage. Sice our variace bouds have differet forms for the cases q = 0 ad q > 0 we have to prove the theorem for these cases separately. We begi with the case q = 0 ad a importat lemma which describes a shrikig techique.

8 Lemma 1. Let H ad P be as i Theorem 1. For γ > β we defie λ := 1 1+β+γ. Now assume that there are costats 0 ρ < β ad C 1 such that Pr ( ) T (X Y ) : f T,λ Cxλ ρ 1 1 e x for all 1, x 1. The there is aother costat Ĉ 1 such that for ˆρ := mi { β, ρ+β+γ, β + γ } ad for all 1, x 1 we have Pr ( T (X Y ) : f T,λ ˆρ 1 Ĉxλ ) 1 e x. Proof. Let ˆf T,λ be a miimizer of R L,T o Cxλ ρ 1 B H, where L is defied by (11). By our assumptio we have ˆf T,λ = f T,λ with probability ot less tha 1 e x sice f T,λ is uique for every traiig set T by the strict covexity of L. We will show that for some C > 0 ad all 1, x 1 the improved boud ˆf T,λ ˆρ 1 Cxλ (13) holds with probability ot less tha 1 e x ˆρ 1. Cosequetly, f T,λ Cxλ will hold with probability ot less tha 1 e x. Obviously, the latter implies the assertio. I order to establish (13) we will apply Theorem to the modified L1-SVM classifier that produces ˆf T,λ. To this ed we first observe that the separability coditio of Theorem is satisfied sice H is separable ad cotiuously embedded ito C(X). Furthermore it was show i [7] that the coverig umber coditio holds ad by Propositio 1 we may choose c such that c xλ 1, ad δ = 0. Additioally, we ca obviously choose B λ (ρ 1)/. The term ε(, B, c, δ, x) i Theorem ca the be estimated by (ρ 1)p +p ε(, B, c, δ, x) xλ λ p +p +p + x λ ρ 1 x pρ+β+γ +p λ + x λ β+γ. +p + xλ 1 Now for ρ β + γ we have ρ+β+γ pρ+β+γ +p, ad hece we obtai ε(, B, c, δ, x) x λ ρ+β+γ + x λ β+γ. Furthermore, if ρ > β + γ we have both β + γ < pρ+β+γ +p ad β + γ < ρ+β+γ ad thus we agai fid ε(, B, c, δ, x) x λ β+γ x λ β+γ + x λ ρ+β+γ. 1, Now, i both cases Theorem gives a costat C 1 > 0 idepedet of ad x such that for all 1 ad all x 1 the estimate λ ˆf T,λ λ ˆf T,λ + R l,p ( ˆf T,λ ) R l,p λ ˆf P,λ + R l,p ( ˆf P,λ ) R l,p + C 1 x λ ρ+β+γ + C 1 x λ β+γ

9 holds with probability ot less tha 1 e x. Furthermore, by Theorem 4 we obtai f P,λ λ (ρ 1)/ Cxλ (ρ 1)/ for large which gives f P,λ = ˆf P,λ for such. With probability ot less tha 1 e x we hece have λ ˆf T,λ λ f P,λ + R l,p (f P,λ ) R l,p + C 1 x λ ρ+β+γ + C 1 x λ β+γ C λ β + C 1 x λ ρ+β+γ + C 1 x λ β+γ for some costats C 1, C > 0 idepedet of ad x. From this we easily obtai that (13) holds for all 1 with probability ot less tha 1 e x. Proof (of Theorem 1 for q = 0). We first observe that there exists a γ > β 4(q+1) with α = (q+pq+4)(1+β+γ). We fix this γ ad defie ρ 0 := 0 ad ρ i+1 := mi { β, ρi+β+γ, β + γ }. The it is easy to check that this defiitio gives { ρ i = mi β, (β + γ) i j=1 } j, β + γ = mi { β, (β + γ)(1 i ) }. Now, iteratively applyig Lemma gives a sequece of costats C i > 0 with Pr ( ) T (X Y ) : f T,λ C i xλ ρ i 1 1 e x (14) for all 1 ad all x 1. Let us first cosider the case β < γ 0. The we have ρ i = (β + γ)(1 i ), ad hece (14) shows that for all ε > 0 there exists a costat C > 0 such that Pr ( T (X Y ) : f T,λ Cxλ (1 ε)(β+γ) 1 ) 1 e x for all 1 ad all x 1. We write ρ := (1 ε)(β+γ). As i the proof of Lemma 1 we deote a miimizer of R L,T o Cxλ ρ 1 B H by ˆf T,λ. We have just see that ˆf T,λ = f T,λ with probability ot less tha 1 e x. Therefore, we oly have to apply Theorem to the modified optimizatio problem which defies ˆf T,λ. To this ed we first see as i the proof of Lemma 1 that ε(, B, c, δ, x) x pρ+β+γ +p λ + x λ β+γ x pρ+β+γ +p λ x λ β+γ ε, where i the last two estimates we used the defiitio of ρ. Furthermore, we have already see i the proof of Lemma 1 that λ ˆf P,λ + R l,p ( ˆf P,λ ) R l,p a(λ ) holds for large. Therefore, applyig Theorem ad a iequality of Zhag (see [11]) betwee the excess classificatio risk ad the excess l-risk we fid that for all 1 we have with probability ot less tha 1 e x : R P ( ˆf T,λ ) R P λ ˆf T,λ + R l,p ( ˆf T,λ ) R l,p λ ˆf P,λ + R l,p ( ˆf P,λ ) R l,p + C 1 x λ β+γ ε C λ β+γ ε, (15)

10 where C 1, C > 0 are costats idepedet of ad x. Now, from (15) we easily deduce the assertio usig the defiitio of λ ad γ. Let us fially cosider the case γ > 0. The for large itegers i we have ρ i = β, ad hece (14) gives a C > 0 such that for all 1, x 1 we have Pr ( ) T (X Y ) : f T,λ Cxλ β 1 1 e x. Proceedig as for γ 0 we get ε(, B, c, δ, x) x pβ+β+γ +p λ + x λ β+γ x λ β, from which we easily obtai the assertio usig the defiitio of λ ad γ. I the rest of this sectio we will prove Theorem 1 for q > 0. We begi with a lemma which is similar to Lemma 1. Lemma. Let H ad P be as i Theorem 1. For γ > β we defie λ := 4(q+1) (q+pq+4)(1+β+γ). Now assume that there are ρ [0, β) ad C 1 with Pr ( ) T (X Y ) : f T,λ Cxλ ρ 1 1 e x for all 1 ad all x 1. The there is aother costat Ĉ 1 such that for ˆρ := mi { β, ρ+β+γ } ad for all 1, x 1 we have Pr ( T (X Y ) ˆρ 1 ) : f T,λ Ĉxλ 1 e x. The same result holds for L1-SVM s with offset. Proof. For brevity s sake we oly prove this Lemma for L1-SVM s with offset. The proof for L1-SVM s without offset is almost idetical. Now, let L be defied by (1). Aalogously to the proof of Lemma 1 we deote a miimizer of R L,T (.,.) o Cxλ ρ 1 (B H [ K 1, K + 1]) by ( ˆf T,λ, ˆb T,λ ). By our assumptio (see [7]) we have b T,λ Cxλ ρ 1 (K + 1) with probability ot less tha 1 e x for all possible values of the offset. I additio, for such traiig sets we have ˆf T,λ = f T,λ sice the RKHS compoet f T,λ of L1-SVM solutios is uique for T by the strict covexity of L i f. Furthermore, by the above cosideratios we may defie ˆb T,λ := b T,λ for such traiig sets. As i the proof of Lemma 1 it ow suffices to show the existece of a C > 0 such that ˆf ˆρ 1 T,λ Cxλ with probability ot less tha 1 e x. To this ed we first observe by Propositio that we may choose B, c ad δ such that B xλ ρ 1, c x q+ ρ 1 q+1 λ q+ q+1, ad δ x q+ ρ 1 q+1 λ q+ q+1 + βq q+1. Some calculatios the show that ε(, B, c, δ, x) i Theorem satisfies ε(, B, c, δ, x) x λ ρ+β+γ + x (ρ+β+γ)(q+pq+4)+βq( p) 8(q+1) λ.

11 Furthermore observe that we have ρ β γ if ad oly if ρ + β+γ (ρ+β+γ)(q+pq+4)+βq( p) 4(q+1). Now let us first cosider the case ρ β γ. The the above cosideratios show ε(, a, B, c, δ, x) x λ ρ+β+γ. Furthermore, we obviously have λ β λ ρ+β+γ. As i the proof of Lemma 1 we hece fid a costat C > 0 such that for all x 1, 1 we have λ ˆf T,λ Cx λ ρ+β+γ with probability ot less tha 1 e x. O the other had if ρ > β γ we have ε(, a, B, c, δ, x) x (ρ+β+γ)(q+pq+4)+βq( p) 8(q+1) λ x λ β, so that we get λ ˆf T,λ Cx λ β i the above sese. Proof (of Theorem 1 for q > 0). By usig Lemma the proof i the case q > 0 is completely aalogous to the case q = 0. Appedix Throughout this sectio P deotes a Borel probability measure o X Y ad H deotes a RKHS of cotiuous fuctios over X. We use the shorthad for H whe o cofusio should arise. Ulike i the other sectios of this paper, here L deotes a arbitrary covex loss fuctio, that is, a cotiuous fuctio L : Y R [0, ) covex i its secod variable. The correspodig L-risk R L,P (f) of a fuctio f : X R ad its miimal value R L,P are defied i the obvious way. For simplicity we also assume R L,P (0) = 1. Note that all the requiremets are met by the hige loss fuctio. Furthermore, let us defie f P,λ by replacig R l,p by R L,P i (7). I additio we write { fp,λ = arg mi f : f arg mi f 1 λ R L,P (f ) }. (16) Of course, we eed to prove the existece ad uiqueess of fp,λ which is doe i the followig lemma. Lemma 3. Uder the above assumptios fp,λ is well defied. Proof. Let us first show that there exists a f λ 1/ B H which miimizes R L,P (.) i λ 1/ B H. To that ed cosider a sequece (f ) i λ 1/ B H such that R L,P (f ) if f λ 1/ R L,P (f). By the Eberlei-Smulya theorem we ca assume without loss of geerality that there exists a f with f λ 1/ ad f f weakly. Usig the fact that weak covergece i RKHS s imply poitwise covergece, Lebesgue s theorem ad the cotiuity of L the give R L,P (f ) R L,P (f ).

12 Hece there is a miimizer of R L,P (.) i 1 λ B H, i.e. we have { } A := f : f arg mi R L,P (f ). f 1 λ We ow show that there is exactly oe f A havig miimal orm. Existece: Let (f ) A with f if f A f for. Like i the proof establishig A, we ca show that there exists a f A with f f weakly, ad R L,P (f ) R L,P (f ). This shows f A. Furthermore, by the weak covergece we always have f lim if f = if f A f. Uiqueess: Suppose we have two such elemets f ad g with f g. By covexity we fid 1 (f + g) arg mi f 1 λ R L,P (f). However,. H is strictly covex which gives 1 (f + g) < f. I the followig we will defie the approximatio error ad the approximatio error fuctio for geeral L. I order to also treat o-uiversal kerels we first deote the miimal L-risk of fuctios i H by R L,P,H := if f H R L,P (f). Furthermore, we say that f H miimizes the L-risk i H if R L,P (f) = R L,P,H. Note that if such a miimizer exists the by Lemma 3 there actually exists a uique elemet fl,p,h H miimizig the L-risk i H with f L,P,H f for all f H miimizig the L risk i H. Moreover we have f P,λ fl,p,h for all λ > 0 sice otherwise we fid a cotradictio by Now, for λ 0 we write λ f L,P,H + R L,P (f L,P,H) < λ f P,λ + R L,P (f P,λ ). a(λ) := λ f P,λ + R L,P (f P,λ ) R L,P,H, (17) a (λ) := R L,P (f P,λ) R L,P,H. (18) Recall, that for uiversal kerels ad the hige loss fuctio we have R L,P,H = R L,P (see [8]), ad hece i this case a(.) equals the approximatio error fuctio defied i Sectio. Furthermore, for these kerels, a (λ) is the classical approximatio error of the hypothesis class λ 1/ B H. Our first theorem shows how to compare a(.) ad a (.). Theorem 3. With the above otatios we have a(0) = a (0) = 0. Furthermore, a (.) is icreasig, ad a(.) is icreasig, cocave, ad cotiuous. I additio, we have a (λ) a(λ) for all λ 0, ad for ay h : (0, ) (0, ) with a (λ) h(λ) for all λ > 0, we have a ( λh(λ) ) h(λ) for all λ > 0.

13 Proof. It is clear from the defiitios (17) ad (18) that a(0) = a (0) = 0 ad a (.) is icreasig. Sice a(.) is a ifimum over a family of liear icreasig fuctios of λ it follows that a(.) is also cocave ad icreasig. Cosequetly a(.) is cotiuous for λ > 0 (see [1, Thm. 10.1]), ad cotiuity at 0 follows from the proof of [8, Prop. 3.]. To prove the secod assertio, observe that f P,λ 1/λ implies R L,P (f P,λ ) R L,P (f P,λ ) for all λ > 0 ad hece we fid a (λ) a(λ) for all λ 0. Now let λ := h(λ) f P,λ. The we obtai λ f P, λ + R L,P (f P, λ) λ f P,λ + R L,P (f P,λ) λ f P,λ + R L,P,H + h(λ) R L,P,H + h(λ). This shows a( λ) h(λ). Furthermore we have λh(λ) fp,λ h(λ) = λ ad thus the assertio follows sice a(.) is a icreasig fuctio. Our ext goal is to show how the asymptotic behaviour of a(.), a (.) ad λ f P,λ are related to each other. Let us begi with a lemma that characterizes the existece of f L,P,H H i terms of the fuctio λ f P,λ. Lemma 4. The miimizer fl,p,h H of the L-risk i H exists if ad oly if there exists a costat c > 0 with f P,λ c for all λ > 0. I this case we additioally have lim λ 0 + f P,λ fl,p,h H = 0. Proof. Let us first assume that fl,p,h H exists. The we have already see f P,λ fl,p,h for all λ > 0, so that it remais to show the covergece. To this ed let (λ ) be a positive sequece covergig to 0. By the boudedess of (f P,λ ) there the exists a f H ad a subsequece (f P,λi ) with f P,λi f weakly. This implies R L,P (f P,λi ) R L,P (f ) as i the proof of Lemma 3. Furthermore, we always have λ i f P,λi 0 ad thus R L,P,H = lim i λ i f P,λi + R L,P (f P,λi ) = R L,P (f ), (19) where the first equality ca be show as i [8] for uiversal kerels. I other words f miimizes the L-risk i H ad hece we have f P,λi fl,p,h f lim if f P,λ j j for all i 1. This shows both f P,λi f ad f L,P,H = f, ad cosequetly we fid f L,P,H = f by (19). I additio a easy calculatio gives f P,λi f = f P,λi f P,λi, f + f f f + f = 0. Now assume that f P,λ fl,p,h. The there exists a δ > 0 ad a subsequece (f P,λj ) with f P,λj fl,p,h > δ. O the other had applyig the above reasoig to this subsequece gives a sub-subsequece covergig to fl,p,h ad hece we have foud a cotradictio. Let us ow assume f P,λ c for some c > 0 ad all λ > 0. The there exists a f H ad a sequece (f P,λ ) with f P,λ f weakly. As i the first part of the proof we easily see that f miimizes the L-risk i H.

14 Note that if H is a uiversal kerel, i.e. it is dese i C(X), P is a empirical distributio based o a traiig set T, ad L is the (squared) hige loss fuctio the fl,t,h H exists ad coicides with the hard margi SVM solutio. Cosequetly, the above lemma shows that both the L1-SVM ad the L-SVM solutios f T,λ coverge to the hard margi solutio if T is fixed ad λ 0. The followig lemma which shows that the fuctio f P,λ miimizes R L,P (.) over the ball f P,λ B H is somewhat well-kow: Lemma 5. Let λ > 0 ad γ := 1/ f P,λ. The we have f P,γ = f P,λ. Proof. We first show that f P,λ miimizes R L,P (.) over the ball f P,λ B H. To this ed assume the coverse R L,P (f P,γ ) < R L,P (f P,λ ). Sice we also have f P,γ 1/ γ = f P,λ we the fid the false iequality λ f P,γ + R L,P (f P,γ) < λ f P,λ + R L,P (f P,λ ), (0) ad cosequetly f P,λ miimizes R L,P (.) over f P,λ B H. Now assume that f P,λ fp,γ, i.e. f P,λ > fp,γ. Sice R L,P (fp,γ ) = R L,P (f P,λ ) we the agai fid (0) ad hece the assumptio f P,λ fp,γ must be false. Let us ow tur to the mai theorem of this sectio which describes asymptotic relatioships betwee the approximatio error, the approximatio error fuctio, ad the fuctio λ f P,λ. Theorem 4. The fuctio λ f P,λ is bouded o (0, ) if ad oly if a(λ) λ ad i this case we also have a(λ) λ. Moreover for all α > 0 we have a (λ) λ α if ad oly if a(λ) λ α α+1. If oe of the estimates is true we additioally have f P,λ λ 1 α+1 ad R L,P (f P,λ ) R L,P,H λ α α+1. Furthermore, if λ α+ε a (λ) λ α for some α > 0 ad ε 0 the we have both λ α (α+ε)(α+1) fp,λ λ 1 α+1 ad λ α+ε α+1 RL,P (f P,λ ) R L,P λ α ad hece i particular λ α+ε α+1 a(λ) λ α α+1. α+1, Theorem 4 shows that if a (λ) behaves essetially like λ α the the approximatio error fuctio behaves essetially like λ α α+1. Cosequetly we do ot loose iformatio whe cosiderig a(.) istead of the approximatio error a (.). Proof (of Theorem 4). If λ f P,λ is bouded o (0, ) the miimizer f L,P,H exists by Lemma 4 ad hece we fid a(λ) λ f L,P,H + R L,P (f L,P,H) R L,P,H = λ f L,P,H. Coversely, if there exists a costat c > 0 with a(λ) cλ we fid λ f P,λ a(λ) cλ which shows f P,λ c for all λ > 0. Moreover by Theorem 3 we easily fid λa(1) a(λ) for all λ > 0.

15 For the rest of the proof we observe that Theorem 3 gives a(λ) a(cλ) c a(λ) for λ > 0 ad c 1, ad c a(λ) a(cλ) a(λ) for λ > 0 ad 0 < c 1. Therefore we ca igore arisig costats by usig the otatio. Now let us assume a (λ) λ α for some α > 0. The from Theorem 3 we kow a(λ 1+α ) λ α which leads to a(λ) λ α α+1. The latter immediately implies f P,λ λ 1 α α+1. Coversely, if a(λ) λ α+1 we defie γ := fp,λ. By Lemma 5 we the obtai a (γ) = R L,P (f P,λ ) R L,P,H a(λ) λ α α+1 f P,λ α = γ α. Now, if fl,p,h does ot exists the the fuctio λ f P,λ teds to 0 if λ 0 ad thus a (λ) λ α. I additio, if fl,p,h exists the assertio is trivial. For the third assertio recall that Lemma 5 states f P,λ = fp,γ with γ := f P,λ ad hece we fid a(λ) = λ f P,λ + a ( f P,λ ). (1) Furthermore, we have already see f P,λ λ 1 α+1, ad hece we get λ α α+1 R L,P (f P,λ ) R L,P = a ( f P,λ ) f P,λ (α+ε) λ α+ε α+1. Combiig this with (1) yields the third assertio. Refereces 1. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Patter Recogitio. Spriger, New York (1996). Yag, Y.: Miimax oparametric classificatio part I ad II. IEEE Tras. Iform. Theory 45 (1999) Wu, Q., Zhou, D.X.: Aalysis of support vector machie classificatio. Tech. Report, City Uiversity of Hog Kog (003) 4. Mamme, E., Tsybakov, A.: Smooth discrimiatio aalysis. A. Statist. 7 (1999) Tsybakov, A.: Optimal aggregatio of classifiers i statistical learig. A. Statist. 3 (004) Schölkopf, B., Smola, A.: Learig with Kerels. MIT Press (00) 7. Steiwart, I., Scovel, C.: Fast rates for support vector machies usig Gaussia kerels. A. Statist. submitted (004) publicatios/a-04a.pdf. 8. Steiwart, I.: Cosistecy of support vector machies ad other regularized kerel machies. IEEE Tras. Iform. Theory 51 (005) Steiwart, I., Scovel, C.: Fast rates to bayes for kerel machies. I Saul, L.K., Weiss, Y., Bottou, L., eds.: Advaces i Neural Iformatio Processig Systems 17. MIT Press, Cambridge, MA (005) Edmuds, D., Triebel, H.: Fuctio Spaces, Etropy Numbers, Differetial Operators. Cambridge Uiversity Press (1996) 11. Zhag, T.: Statistical behaviour ad cosistecy of classificatio methods based o covex risk miimizatio. A. Statist. 3 (004) Rockafellar, R.: Covex Aalysis. Priceto Uiversity Press (1970)

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Measure and Measurable Functions

Measure and Measurable Functions 3 Measure ad Measurable Fuctios 3.1 Measure o a Arbitrary σ-algebra Recall from Chapter 2 that the set M of all Lebesgue measurable sets has the followig properties: R M, E M implies E c M, E M for N implies

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

arxiv: v1 [math.st] 14 Aug 2007

arxiv: v1 [math.st] 14 Aug 2007 The Aals of Statistics 2007, Vol. 35, No. 2, 575 607 DOI: 10.1214/009053606000001226 I the Public Domai arxiv:0708.1838v1 [math.st] 14 Aug 2007 FAST RATES FOR SUPPORT VECTOR MACHINES USING GAUSSIAN KERNELS

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

On the Influence of the Kernel on the Consistency of Support Vector Machines

On the Influence of the Kernel on the Consistency of Support Vector Machines Joural of achie Learig Research 2 2001 67-93 Submitted 08/01; Published 12/01 O the Ifluece of the Kerel o the Cosistecy of Support Vector achies Igo Steiwart athematisches Istitut Friedrich-Schiller-Uiversität

More information

Sparseness of Support Vector Machines Some Asymptotically Sharp Bounds

Sparseness of Support Vector Machines Some Asymptotically Sharp Bounds Sparseess of Support Vector Machies Some Asymptotically Sharp Bouds Igo Steiwart Modelig, Algorithms, ad Iformatics Group, CCS-3 Mail Stop B256 Los Alamos Natioal Laboratory Los Alamos, NM 87545, USA igo@lal.gov

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n = 60. Ratio ad root tests 60.1. Absolutely coverget series. Defiitio 13. (Absolute covergece) A series a is called absolutely coverget if the series of absolute values a is coverget. The absolute covergece

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

Feedback in Iterative Algorithms

Feedback in Iterative Algorithms Feedback i Iterative Algorithms Charles Byre (Charles Byre@uml.edu), Departmet of Mathematical Scieces, Uiversity of Massachusetts Lowell, Lowell, MA 01854 October 17, 2005 Abstract Whe the oegative system

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Math 341 Lecture #31 6.5: Power Series

Math 341 Lecture #31 6.5: Power Series Math 341 Lecture #31 6.5: Power Series We ow tur our attetio to a particular kid of series of fuctios, amely, power series, f(x = a x = a 0 + a 1 x + a 2 x 2 + where a R for all N. I terms of a series

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame Iformatio Theory Tutorial Commuicatio over Chaels with memory Chi Zhag Departmet of Electrical Egieerig Uiversity of Notre Dame Abstract A geeral capacity formula C = sup I(; Y ), which is correct for

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 00, Number 0, Pages 000 000 S 0002-9939(XX0000-0 ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS MARCH T. BOEDIHARDJO AND WILLIAM B. JOHNSON 2

More information

Analytic Continuation

Analytic Continuation Aalytic Cotiuatio The stadard example of this is give by Example Let h (z) = 1 + z + z 2 + z 3 +... kow to coverge oly for z < 1. I fact h (z) = 1/ (1 z) for such z. Yet H (z) = 1/ (1 z) is defied for

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

n p (Ω). This means that the

n p (Ω). This means that the Sobolev s Iequality, Poicaré Iequality ad Compactess I. Sobolev iequality ad Sobolev Embeddig Theorems Theorem (Sobolev s embeddig theorem). Give the bouded, ope set R with 3 ad p

More information

Beurling Integers: Part 2

Beurling Integers: Part 2 Beurlig Itegers: Part 2 Isomorphisms Devi Platt July 11, 2015 1 Prime Factorizatio Sequeces I the last article we itroduced the Beurlig geeralized itegers, which ca be represeted as a sequece of real umbers

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

1+x 1 + α+x. x = 2(α x2 ) 1+x

1+x 1 + α+x. x = 2(α x2 ) 1+x Math 2030 Homework 6 Solutios # [Problem 5] For coveiece we let α lim sup a ad β lim sup b. Without loss of geerality let us assume that α β. If α the by assumptio β < so i this case α + β. By Theorem

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

LECTURE 8: ASYMPTOTICS I

LECTURE 8: ASYMPTOTICS I LECTURE 8: ASYMPTOTICS I We are iterested i the properties of estimators as. Cosider a sequece of radom variables {, X 1}. N. M. Kiefer, Corell Uiversity, Ecoomics 60 1 Defiitio: (Weak covergece) A sequece

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Math 299 Supplement: Real Analysis Nov 2013

Math 299 Supplement: Real Analysis Nov 2013 Math 299 Supplemet: Real Aalysis Nov 203 Algebra Axioms. I Real Aalysis, we work withi the axiomatic system of real umbers: the set R alog with the additio ad multiplicatio operatios +,, ad the iequality

More information

Estimation of the essential supremum of a regression function

Estimation of the essential supremum of a regression function Estimatio of the essetial supremum of a regressio fuctio Michael ohler, Adam rzyżak 2, ad Harro Walk 3 Fachbereich Mathematik, Techische Uiversität Darmstadt, Schlossgartestr. 7, 64289 Darmstadt, Germay,

More information

A Proof of Birkhoff s Ergodic Theorem

A Proof of Birkhoff s Ergodic Theorem A Proof of Birkhoff s Ergodic Theorem Joseph Hora September 2, 205 Itroductio I Fall 203, I was learig the basics of ergodic theory, ad I came across this theorem. Oe of my supervisors, Athoy Quas, showed

More information

Properties of Fuzzy Length on Fuzzy Set

Properties of Fuzzy Length on Fuzzy Set Ope Access Library Joural 206, Volume 3, e3068 ISSN Olie: 2333-972 ISSN Prit: 2333-9705 Properties of Fuzzy Legth o Fuzzy Set Jehad R Kider, Jaafar Imra Mousa Departmet of Mathematics ad Computer Applicatios,

More information

MAS111 Convergence and Continuity

MAS111 Convergence and Continuity MAS Covergece ad Cotiuity Key Objectives At the ed of the course, studets should kow the followig topics ad be able to apply the basic priciples ad theorems therei to solvig various problems cocerig covergece

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

Solutions to home assignments (sketches)

Solutions to home assignments (sketches) Matematiska Istitutioe Peter Kumli 26th May 2004 TMA401 Fuctioal Aalysis MAN670 Applied Fuctioal Aalysis 4th quarter 2003/2004 All documet cocerig the course ca be foud o the course home page: http://www.math.chalmers.se/math/grudutb/cth/tma401/

More information

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number MATH 532 Itegrable Fuctios Dr. Neal, WKU We ow shall defie what it meas for a measurable fuctio to be itegrable, show that all itegral properties of simple fuctios still hold, ad the give some coditios

More information

A constructive analysis of convex-valued demand correspondence for weakly uniformly rotund and monotonic preference

A constructive analysis of convex-valued demand correspondence for weakly uniformly rotund and monotonic preference MPRA Muich Persoal RePEc Archive A costructive aalysis of covex-valued demad correspodece for weakly uiformly rotud ad mootoic preferece Yasuhito Taaka ad Atsuhiro Satoh. May 04 Olie at http://mpra.ub.ui-mueche.de/55889/

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

Boundaries and the James theorem

Boundaries and the James theorem Boudaries ad the James theorem L. Vesely 1. Itroductio The followig theorem is importat ad well kow. All spaces cosidered here are real ormed or Baach spaces. Give a ormed space X, we deote by B X ad S

More information

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the

More information

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1 Solutio Sagchul Lee October 7, 017 1 Solutios of Homework 1 Problem 1.1 Let Ω,F,P) be a probability space. Show that if {A : N} F such that A := lim A exists, the PA) = lim PA ). Proof. Usig the cotiuity

More information

Chapter IV Integration Theory

Chapter IV Integration Theory Chapter IV Itegratio Theory Lectures 32-33 1. Costructio of the itegral I this sectio we costruct the abstract itegral. As a matter of termiology, we defie a measure space as beig a triple (, A, µ), where

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

SOME SEQUENCE SPACES DEFINED BY ORLICZ FUNCTIONS

SOME SEQUENCE SPACES DEFINED BY ORLICZ FUNCTIONS ARCHIVU ATHEATICU BRNO Tomus 40 2004, 33 40 SOE SEQUENCE SPACES DEFINED BY ORLICZ FUNCTIONS E. SAVAŞ AND R. SAVAŞ Abstract. I this paper we itroduce a ew cocept of λ-strog covergece with respect to a Orlicz

More information

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences Commuicatios of the Korea Statistical Society 29, Vol. 16, No. 5, 841 849 Precise Rates i Complete Momet Covergece for Negatively Associated Sequeces Dae-Hee Ryu 1,a a Departmet of Computer Sciece, ChugWoo

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover.

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover. Compactess Defiitio 1. A cover or a coverig of a topological space X is a family C of subsets of X whose uio is X. A subcover of a cover C is a subfamily of C which is a cover of X. A ope cover of X is

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Homework 4. x n x X = f(x n x) +

Homework 4. x n x X = f(x n x) + Homework 4 1. Let X ad Y be ormed spaces, T B(X, Y ) ad {x } a sequece i X. If x x weakly, show that T x T x weakly. Solutio: We eed to show that g(t x) g(t x) g Y. It suffices to do this whe g Y = 1.

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Information Theory and Statistics Lecture 4: Lempel-Ziv code Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems McGill Uiversity Math 354: Hoors Aalysis 3 Fall 212 Assigmet 3 Solutios to selected problems Problem 1. Lipschitz fuctios. Let Lip K be the set of all fuctios cotiuous fuctios o [, 1] satisfyig a Lipschitz

More information

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size.

The Borel hierarchy classifies subsets of the reals by their topological complexity. Another approach is to classify them by size. Lecture 7: Measure ad Category The Borel hierarchy classifies subsets of the reals by their topological complexity. Aother approach is to classify them by size. Filters ad Ideals The most commo measure

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

A 2nTH ORDER LINEAR DIFFERENCE EQUATION

A 2nTH ORDER LINEAR DIFFERENCE EQUATION A 2TH ORDER LINEAR DIFFERENCE EQUATION Doug Aderso Departmet of Mathematics ad Computer Sciece, Cocordia College Moorhead, MN 56562, USA ABSTRACT: We give a formulatio of geeralized zeros ad (, )-discojugacy

More information

Technical Proofs for Homogeneity Pursuit

Technical Proofs for Homogeneity Pursuit Techical Proofs for Homogeeity Pursuit bstract This is the supplemetal material for the article Homogeeity Pursuit, submitted for publicatio i Joural of the merica Statistical ssociatio. B Proofs B. Proof

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information