Exponential convergence of testing error for stochastic gradient methods

Size: px
Start display at page:

Download "Exponential convergence of testing error for stochastic gradient methods"

Transcription

1 Expoetial covergece of testig error for stochastic gradiet methods Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach To cite this versio: Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach. Expoetial covergece of testig error for stochastic gradiet methods <hal v2> HAL Id: hal Submitted o 28 Ju 208 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 EXPONENTIAL CONVERGENCE OF TESTING ERROR FOR STOCHASTIC GRADIENT METHODS LOUCAS PILLAUD-VIVIEN, ALESSANDRO RUDI, FRANCIS BACH ABSTRACT. We cosider biary classificatio problems with positive defiite kerels ad square loss, ad study the covergece rates of stochastic gradiet methods. We show that while the excess testig loss squared loss coverges slowly to zero as the umber of observatios ad thus iteratios goes to ifiity, the testig error classificatio error coverges expoetially fast if low-oise coditios are assumed. To achieve these rates of covergece we show sharper high-probability bouds with respect to the umber of observatios for stochastic gradiet descet.. INTRODUCTION Stochastic gradiet methods are ow ubiquitous i machie learig, both from the practical side, as a simple algorithm that ca lear from a sigle or a few passes over the data BLC05, ad from the theoretical side, as it leads to optimal rates for estimatio problems i a variety of situatios NY83, PJ92. They follow a simple priciple RM5: to fid a miimizer of a fuctio F defied o a vector space from oisy gradiets, simply follow the egative stochastic gradiet ad the algorithm will coverge to a statioary poit, local miimum or global miimum of F depedig o the properties of the fuctio F, with a rate of covergece that decays with the umber of gradiet steps typically as O/, or O/ depedig o the assumptios which are made o the problem PJ92, NV08, NJLS09, SSSS07, Xia0, BM, BM3, DFB7. O the oe had, these rates are optimal for the estimatio of the miimizer of a fuctio give access to oisy gradiets NY83, which is essetially the usual machie learig set-up where the fuctio F is the expected loss, e.g., logistic or hige for classificatio, or least-squares for regressio, ad the oisy gradiets are obtaied from samplig a sigle pair of observatios. O the other had, although these rates as O/ or O/ are optimal, there are a variety of extra assumptios that allow for faster rates, eve expoetial rates. First, for stochastic gradiet from a fiite pool, that is for F = k k i= F i, a sequece of works startig from SAG LSB2, SVRG JZ3, SAGA DBLJ4, have show explicit expoetial covergece. However, these results, oce applied to machie learig where the fuctio F i is the loss fuctio associated with the i-th observatio of a fiite traiig data set of size k, say othig about the loss o usee data test loss. The rates we preset i this paper are o usee data. Secod, assumig that at the optimum all stochastic gradiets are equal to zero, the for strogly-covex problems e.g., liear predictios with low-correlated features, liear covergece rates ca be obtaied for test losses Sol98, SL3. However, for supervised machie learig, this has limited relevace as havig zero gradiets for all stochastic gradiets at the optimum essetially implies predictio problems with o ucertaity that is, the output is a determiistic fuctio of the iput. Moreover, we ca oly get a expoetial rate for strogly-covex problems ad thus this imposes a parametric oiseless problem, which limits the applicability eve if the problem was oiseless, this ca oly reasoably be i a oparametric way with eural etworks or positive defiite kerels. Our rates are o oisy problems ad o ifiite-dimesioal problems where we ca hope that we approach the optimal predictio fuctio with large umbers of observatios. For predictio fuctios described by a reproducig kerel Hilbert space, ad for the square loss, the excess testig loss equal to testig loss mius the miimal testig loss over all measurable predictio fuctios is kow to coverge to zero at a subexpoetial rate typically greater tha O/ DB6, DFB7, these rates beig optimal for the estimatio of testig losses. INRIA - DÉPARTEMENT D INFORMATIQUE DE L ENS, ECOLE NORMALE SUPÉRIEURE, CNRS, INRIA, PSL RESEARCH UNIVERSITY, PARIS, FRANCE address: loucas.pillaud-vivie@iria.fr, alessadro.rudi@iria.fr, fracis.bach@iria.fr.

3 Goig back to the origis of supervised machie learig with biary labels, we will ot cosider gettig to the optimal testig loss usig a covex surrogate such as logistic, hige or least-squares but the testig error umber of mistakes i predictios, also referred to as the 0- loss. It is kow that the excess testig error testig error mius the miimal testig error over all measurable predictio fuctios is upper bouded by a fuctio of the excess testig loss Zha04, BJM06, but always with a loss i the covergece rate e.g., o differece or takig square roots. Thus a slow rate i O/ or O/ o the excess loss leads to a slower rate o the excess testig error. Such geeral relatioships betwee excess loss ad excess error have bee refied with the use of margi coditios, which characterize how hard the predictio problems are MT99. Simplest iput poits are poits where the label is determiistic i.e., coditioal probabilities of the label are equal to zero or oe, while hardest poits are the oes where the coditioal probabilities are equal to /2. Margi coditios quatify the mass of iput poits which are hardest to predict, ad lead to improved trasfer fuctios from testig losses to testig errors, but still o expoetial covergece rates BJM06. I this paper, we cosider the strogest margi coditio, that is coditioal probabilities are bouded away from /2, but ot ecessarily equal to 0 or. This assumptio o the learig problem has bee used i the past to show that regularized empirical covex risk miimizatio leads to expoetial covergece rates AT07, KB05. Our mai cotributio is to show that stochastic gradiet descet also achieves similar rates see a empirical illustratio i Figure 2 i the Appedix A. This requires several side cotributios that are iterestig o their ow, that is, a ew ad simple formalizatio of the learig problem that allows expoetial rates of estimatio regardless of the algorithms used to fid the estimator ad a ew cocetratio result for averaged stochastic gradiet descet SGD applied to least-squares, which is fier tha existig work BM3. The paper is orgaized as follows: i Sectio 2, we preset the learig set-up, amely biary classificatio with positive defiite kerels, with a particular focus o the relatioship betwee errors ad losses. Our mai results rely o a geeric coditio for which we give cocrete examples i Sectio 3. I Sectio 4, we preset our versio of stochastic gradiet descet, with the use of tail averagig JKK + 6, ad provide ew deviatio iequalities, which we apply i Sectio 5 to our learig problem, leadig to expoetial covergece rates for the testig errors. We coclude i Sectio 6 by providig several aveues for future work. Fially, sythetic experimets illustratig our results ca be foud i Sectio A of the Appedix. Mai cotributios of the paper. We would like to uderlie that our mai cotributios are i the two followig results; a we show i Theorem 4 the expoetial covergece of stochastic gradiet descet o the testig error, ad b this result strogly rests o a ew deviatio iequality stated i Corollary for stochastic gradiet descet for least-square problems. This last result is iterestig o its ow ad gives a improved high-probability result which does ot deped o the dimesio of the problem ad has a tighter depedece o the strogly covex parameter through the effective dimesio of the problem, see CDV07, DB6. 2. PROBLEM SET-UP I this sectio, we preset the geeral machie learig set-up, from geeric assumptios to more specific assumptios. 2.. Geeric assumptios We cosider a measurable set X ad a probability distributio ρ o data x, y X {, }; we deote by ρ X the margial probability o x, ad by ρ± x the coditioal probability that y = ± give x. We have Ey x = ρ x ρ x. Our mai margi coditio is the followig ad idepedet of the learig framework: A Ey x δ almost surely for some δ 0,. This margi coditio ofte referred to as a low-oise coditio is commoly used i the theoretical study of biary classificatio MT99, AT07, KB05, ad usually takes the followig form: δ > 0, P Ey x < δ = Oδ α for α > 0. Here, however, δ is a fixed costat. Our stroger margi coditio A is ecessary to show expoetial covergece rates but we give also explicit rates i the case of the latter low-oise coditio. This extesio is derived i Appedix J ad more precisely i Corollary 4. Note that the smaller the α, the larger the mass of iputs with hard-to-predict labels. Our coditio correspods to α = +, ad simply states that for all iputs, the problem is ever totally ambiguous, ad the degree of o-ambiguity is 2

4 bouded from below by δ. Whe δ =, the the label y {, } is a determiistic fuctio of x, but our results apply for all δ 0, ad thus to oisy problems with low oise. Note that problems like image classificatio or object recogitio are well characterized by A. Ideed, the oise i classifyig a image betwee two disparate classes cars/pedestrias, bikes/airplaes is usually way smaller that /2. We will cosider learig fuctios i a reproducig kerel Hilbert space RKHS H with kerel fuctio K : X X R ad dot-product, H. We make the followig stadard assumptios o H: A2 H is a separable Hilbert space ad there exists R > 0, such that for all x X, Kx, x R 2. For x X, we cosider the fuctio K x : X R defied as K x x = Kx, x. We have the classical reproducig property for g H, gx = g, K x H STC04, SS02. We will cosider other orms, beyod the RKHS orm g H, that is the L 2 -orm always with respect to ρ X, defied as g 2 L 2 = X gx2 dρ X x, as well as the L -orm L o the support of ρ X. A key property is that A2 implies g L R g H. Fially, we will cosider observatios with stadard assumptios: A3 The observatios x, y X {, }, Z are idepedet ad idetically distributed with respect to the distributio ρ Ridge regressio I this paper, we focus primarily o least-squares estimatio to obtai estimators. We defie g as the miimizer over L 2 of Ey gx 2 = y gx 2 dρx, y. X {,} We always have g x = Ey x = ρ x ρ x, but we do ot require g H. We also cosider the ridge regressio problem CDV07 ad deote by g λ the uique whe λ > 0 miimizer i H of Ey gx 2 + λ g 2 H. The fuctio g λ always exists for λ > 0 ad is always a elemet of H. Whe H is dese i L 2 our results deped o the L -error g λ g, which is weaker tha g λ g H which itself oly exists whe g H which we do ot assume. Whe H is ot dese we simply defie g as the orthoormal projector for the L 2 orm o H of g = Ey x so that our boud will the deped o g λ g. Note that g is the miimizer of Ey gx 2 with respect to g i the closure of H i L 2. Moreover our mai techical assumptio is: A4 There exists λ > 0 such that almost surely, sigey xg λ x δ 2. I the assumptio above, we could replace δ/2 by ay multiplicative costats i 0, times δ istead of /2. Note that with A4, λ depeds o δ ad o the probability measure ρ, which are both fixed respectively by A ad the problem, so that λ is fixed too. It implies that for ay estimator ĝ such that g λ ĝ L < δ/2, the predictios from ĝ obtaied by takig the sig of ĝx for ay x, are the same as the sig of the optimal predictio sigey x. Note that a sufficiet coditio is g λ ĝ H < δ/2r which does ot assume that g H, see ext subsectio. Note that more geerally, for all problems for which A is true ad ridge regressio i the populatio case is so that g λ g L teds to zero as λ teds to zero the A4 is satisfied, sice g λ g L δ/2 for λ small eough, together with A the implies A4. I Sectio 3, we provide cocrete examples where A4 is satisfied ad we the preset the SGD algorithm ad our covergece results. Before we relate excess testig losses to excess testig errors From testig losses to testig error Here we provide some results that will be useful to prove expoetial rates for classificatio with squared loss ad stochastic gradiet descet. First we defie the 0- loss defiig the classificatio error: Rg = ρ{x, y : siggx y}, where sig u = + for u 0 ad for u < 0. I particular deote by R the so-called Bayes risk R = RE y x which is the miimum achievable classificatio error DGL3. A well kow approach to boud the testig errors by testig losses is via trasfer fuctios. I particular we recall the followig result DGL3, BJM06, let g x be equal to E y x a.e., the Rg R φ g g 2 L 2, g L2 dρ X, 3

5 with φu = u or φu = u β, with β /2,, depedig o some properties of ρ BJM06. While this result does ot require A or A4, it does ot readily lead to expoetial rates sice the squared loss excess risk has miimax lower bouds that are polyomial i CDV07. Here we follow a differet approach, requirig via A4 the existece of g λ havig the same sig as g ad with absolute value uiformly bouded from below. The we ca boud the 0- error with respect to the distace i H of the estimator ĝ from g λ as show i the ext lemma proof i Appedix C. This will lead to expoetial rates whe the distributio satisfies a margi coditio A as we prove i the ext sectio ad i Sectio 5. Note also that for the sake of completeess we recalled i Appedix D that expoetial rates could be achieved for kerel ridge regressio. Lemma From approximately correct sig to 0- error Let q 0,. Uder A, A2, A4, ĝ H a radom fuctio such that ĝ gλh < probability at least q. The Rĝ = R, with probability at least q, ad i particular ERĝ R q. I the ext sectio we provide sufficiet coditios ad explicit settigs aturally satisfyig A4. 3. CONCRETE EXAMPLES AND RELATED WORK δ 2R, with I this sectio we illustrate specific settigs that aturally satisfy A4. We start by the followig simple result showig that the existece of g H such that g x = E y x a.e. o the support of ρ X, is sufficiet to have A4 proof i Appedix E.. Propositio Uder A, assume that there exists g H such that g x := E y x o the support of ρ X, the for ay δ, there exists λ > 0 satisfyig A4, that is, sigey xg λ x δ 2. We are goig to use the propositio above to derive more specific settigs. I particular we cosider the case where the positive ad egative classes are separated by a margi that is strictly positive. Let X R d ad deote by S the support of the probability ρ X ad by S + = {x X : g x > 0} the part associated to the positive class, ad by S the oe associated with the egative class. Cosider the followig assumptio: A5 There exists µ > 0 such that mi x S+,x S x x µ. Deote by W s,2 the Sobolev space of order s defied with respect to the L 2 orm, o R d see AF03 ad Appedix E.2. We also itroduce the followig assumptio: A6 X R d ad the kerel is such that W s,2 H, with s > d/2. A example of kerel such that H = W s,2, with s > d/2 is the Abel kerel Kx, x = e σ x x, for σ > 0. I the followig propositio we show that if there exist two fuctios i H, oe matchig E y x o S + ad the secod matchig E y x o S ad if the kerel satisfies A6, the A4 is satisfied. Propositio 2 Uder A, A5, A6, if there exist two fuctios g +, g W s,2 such that g +x = E y x o S + ad g x = E y x o S, the A4 is satisfied. Fially we are able to itroduce aother settig where A4 is aturally satisfied the proof of the propositio above ad the example below are give i Appedix E.2. Example Idepedet oise o the labels Let ρ X be a probability distributio o X R d ad let S +, S X be a partitio of the support of ρ X satisfyig ρ X S +, ρ X S > 0 ad A5. Let Z. For i, x i idepedetly sampled from ρ X ad the label y i defied by the law y i = { ζ i if x i S + ζ i if x i S, with ζ i idepedetly distributed as ζ i = with probability p 0, /2 ad ζ i = with probability p. The A is satisfied with δ = 2p ad A4 is satisfied as soo as A2 ad A6 are, that is, the kerel is bouded ad H is rich eough see a example i Appedix E Figure 4. 4

6 Fially ote that the results of this sectio ca be easily geeralized from X = R d to ay Polish space, by usig a separatig kerel DVRT4, RCDVR4 istead of A6. 4. STOCHASTIC GRADIENT DESCENT We ow cosider the stochastic gradiet algorithm to solve the ridge regressio problem with a fixed strictly positive regularizatio parameter λ. We cosider solvig the regularized problem with regularizatio g g 0 2 H through stochastic approximatio startig from a fuctio g 0 H typically 0. Deote by F : H R, the fuctioal F g = EY gx 2 = EY K X, g 2, where the last idetity is due to the reproducig property of the RKHS H. Note that F has the followig gradiet F g = 2E Y K X, g K X. We cosider also F λ = F +λ g 0 2 H, for which F λg = F g + 2λg g 0, ad we have for each pair of observatio x, y that F λ g = E F,λ g = E g, K x y 2 + λ g g 0 2 H, with F,λg = g, K x y 2 + λ g g 0 2 H. Deotig Σ = E K x K x the covariace operator defied as a liear operator from H to H see FBJ04 ad refereces therei, we have the optimality coditios for g λ ad g : Σg λ E y K x + λg λ g 0 = 0, E y g x K x = 0, see CDV07 or Appedix F. for the proof of the last idetity. Let γ be a positive sequece; we cosider the stochastic gradiet recursio 2 i H started at g 0 : g = g γ 2 F,λg = g γ K x, g y K x + λg g 0. We are goig to cosider Polyak-Ruppert averagig PJ92, that is ḡ = + i=0 g i, as well as the tail-averagig estimate ḡ tail = /2 i= /2 g i, studied by JKK + 6. For the sake of clarity, all the results i the mai text are for the tail averaged estimate but ote that all of them have bee also proved for the full average i Appedix I. As explaied earlier see Lemma, we eed to show the covergece of g to g λ i H-orm. We are goig to cosider two cases: for the o-averaged recursio γ is a decreasig sequece, with the importat particular case γ = γ/ α, for α 0, ; 2 for the averaged or tail-averaged fuctios γ is a costat sequece equal to γ. For all the proofs of this sectio see Appedix G. I the ext subsectio we reformulate the recursio i Eq. as a least-squares recursio covergig to g λ. 4.. Reformulatio as oisy recursio We ca first reformulate the SGD recursio equatio i Eq. as a regular least-squares SGD recursio with oise, with the otatio ξ = y g x, which satisfies E ξ K x = 0. This is the object of the followig lemma for the proof see Appedix F.2.: Lemma 2 The SGD recursio ca be rewritte as follows: g g λ = I γ K x K x + λi g g λ + γ ε, 2 with the oise term ε k = ξ k K xk + g x k g λ x k K xk E g x k g λ x k K xk H. We are thus i presece of a least-squares problem i the Hilbert space H, to estimate a fuctio g λ H with a specific oise ε i the gradiet ad feature vector K x. I the ext sectio, we will cosider the geeric recursio above, which will require some bouds o the oise. I our settig, we have the followig almost sure bouds ad the oise see Lemma 9 of Appedix G: ε H R + 2 g g λ L E ε ε 2 + g g λ 2 Σ, where Σ = E K x K x is the covariace operator. Note that g0 is the iitializatio of the recursio, ad is ot the limit of g λ whe λ teds to zero this limit beig g. 2 The complexity of steps of the recursio is O 2 if usig kerel fuctios or Oτ whe usig explicit feature represetatios, with τ the complexity of computig dot-products ad addig feature vectors. 5

7 4.2. SGD for geeral Least-Square problems We ow cosider results o averaged SGD for least-squares that are iterestig o their ow. As said before, we show results i two differet settigs depedig o the step-size sequece. First, we cosider γ as a decreasig sequece, secod we take γ costat but prove the covergece of the tail-averaged iterates. Sice the results we eed could be of iterest eve for fiite-dimesioal models, i this sectio, we study the followig geeral recursio: We make the followig assumptios: H We start at some η 0 H. H2 η = I γh η + γ ε, 3 H, ε are i.i.d. ad H is a positive self-adjoit operator so that almost surely H λi, ad H := EH. H3 Noise: Eε = 0, ε H c /2 almost surely ad Eε ε C, with C commutig with H. Note that oe cosequece of this assumptio is E ε 2 H trc. H4 For all, E H CH H γ0 C ad γ γ 0. H5 A is a positive self-adjoit operator which commutes with H. Note that we will later apply the results of this sectio to H = K x K x + λi, H = Σ + λi, C = Σ ad A {I, Σ}. We first cosider the o-averaged SGD recursio, the the tail-averaged recursio. The key differece with existig bouds is the eed for precise probabilistic deviatio results. For least-squares, oe ca always separate the impact of the iitial coditio η 0 ad of the oise terms ε k, amely η = η bias + η variace, where η bias is the recursio with o oise ε k = 0, ad η variace is the recursio started at η 0 = 0. The fial performace will be bouded by the sum of the two separate performaces see, e.g.,db5. Hece all of our bouds will deped o these two. See more details i Appedix G No-averaged SGD I this sectio, we prove results for the recursio defied by Eq. 3 i the case where for α 0,, γ = γ/ α. These results exted the oes of BM by providig deviatio iequalities, but are limited to least-squares. For geeral loss fuctios ad i the strogly-covex case, see also KT09. Theorem SGD, decreasig step size: γ = γ/ α Assume H, H2, H3, γ = γ/ α, γλ < ad deote by η H the -th iterate of the recursio i Eq. 3. We have for t > 0, ad α 0,, g g λ H exp γλ + α g 0 g λ H + V, α almost surely for large eough 3 t 2, with P V t 2 exp 8γtrC/λ + γc /2 t α. We ca make the followig observatios: The proof techique see Appedix G. for the detailed proof relies o the followig scheme: we otice that η ca be decomposed i two terms, a the bias: obtaied from a product of cotractat operators, ad b the variace: a sum of icremets of a martigale. We treat separately the two terms. For the secod oe, we prove almost sure bouds o the icremets ad o the variace that lead to a Berstei-type cocetratio result o the tail PV t. Followig this proof techique, the coefficiet i the latter expoetial is composed of the variace boud plus the almost sure boud of the icremets of martigale times t. Note that we oly preseted i Theorem the case where α 0,. Ideed, we oly focused o the case where we had expoetial covergece see the whole result i the Appedix: Propositio 6. Actually, that there are three differet regimes. For α = 0 costat step-size, the algorithm is ot covergig, as the tail probability boud o P V t is ot depedet o. For α =, cofirmig results from BM, there is o expoetial forgettig of iitial coditios. Ad for α 0,, the forgettig of iitial coditios ad the tail probability are covergig to zero expoetially fast, respectively, as exp C α ad exp C α, for a costat C, hece the atural choice of α = /2 i our experimets. 6

8 4.4. Averaged ad Tail-averaged SGD with costat step-size I the subsectio, we take:, γ = γ. We first start with a result o the variace term, whose proof exteds the work of DFB7 to deviatio iequalities which are sharper tha the oes from BM3. Theorem 2 Covergece of the variace term i averaged SGD Assume H, H2, H3, H4, H5 ad cosider the average of the + first iterates of the sequece defied i Eq. 3: η = + i=0 η i. Assume η 0 = 0. We have for t > 0, : A H + t2 P /2 η t 2 exp, 4 E t where E t is defied with respect to the costats itroduced i the assumptios: E t = 4trAH 2 C + 2c/2 A /2 op 3λ t. 5 The work that remais to be doe is to boud the bias term of the recursio η bias. We have doe it for the full averaged sequece see Appedix I. Theorem 6 but as it is quite techical ad could lower a bit the clarity of the reasoig, we have decided to leave it i the Appedix. We preset here aother approach ad cosider the tail-averaged recursio, η tail = /2 i= /2 η i as proposed by JKK + 6, Sha. For this, we use the simple almost sure boud ηi bias H λγ i tail, bias η 0 H, such that η H λγ /2 η 0 H. For the variace term, we ca simply use the result above for ad /2, as η tail = 2 η η /2. This leads to: Corollary Covergece of tail-averaged SGD Assume H, H2, H3, H4, H5 ad cosider the tail-average of the sequece defied i Eq. 3: η tail = /2 i= /2 η i. We have for t > 0, : tail A /2 η γλ /2 A /2 op η 0 H + L, with 6 H PL t 4 exp + t 2 /4E t, 7 where L is defied i the proof see Appedix G.3 ad is the variace term of the tail-averaged recursio. We ca make the followig observatios o the two previous results: The proof techique see Appedix G.2 ad G.3 for the detailed proofs relies o cocetratio iequality of Berstei type. Ideed, we otice that i the settig of Theorem 2 η is a sum of icremets of a martigale. We prove almost sure bouds o the icremets ad o the variace followig the proof techique of DFB7 that lead to a Berstei type cocetratio result o the tail PV t. Followig the proof techique summed-up before, we see that E t is composed of the variace boud plus the almost sure boud times t. Remark that classically, A ad C are proportioal to H for excess risk predictios. I the fiite d-dimesioal settig this leads us to the usual variace boud proportioal to the dimesio d: trah 2 C = tri = d. The result is geeral i the sese that we ca apply it for all matrices A commutig with H this ca be used to prove results i L 2 or i H. Fially, ote that we improved the variace boud with respect to the strog covexity parameter λ which is usually of the order /λ 2 see Sha, ad is here trah 2 C. Ideed, i our settig, we will apply it for A = C = Σ ad H = Σ + λi, so that trah 2 C is upper bouded by the effective dimesio trσσ + λi which ca be way smaller tha /λ 2 see CDV07, DB6. The complete proof for the full average is writte i Appedix I. ad more precisely i Theorem 6. I this case the iitial coditios are ot forgotte expoetially fast though. 5. EXPONENTIALLY CONVERGENT SGD FOR CLASSIFICATION ERROR I this sectio we wat to show our mai results, o the error made o usee data by the -th iterate of the regularized SGD algorithm. Hece, we go back to the origial SGD recursio defied i Eq. 2. Let us recall it: g g λ = I γ K x K x + λi g g λ + γ ε, with the oise term ε k = ξ k K xk + g x k g λ x k K xk E g x k g λ x k K xk H. Like i the previous sectio we are goig to state two results i two differet settigs, the first oe for SGD with 7

9 decreasig step-size γ = γ/ α ad the secod oe for tail averaged SGD with costat step-size. For all the proofs of this sectio see the Appedix sectio H. 5.. SGD with decreasig step-size I this sectio, we focus o decreasig step-sizes γ = γ/ α for α 0,, which lead to expoetial covergece rates. Results for α = ad α = 0 ca be derived i a similar way but do ot lead to expoetial rates. Theorem 3 Assume A, A2, A3, A4 ad γ = γ/ α, α 0, for ay ad γλ <. Let g be the -th iterate of the recursio defied i Eq. 2, as soo as satisfies exp γλ α + α δ/5r g 0 g λ H, the Rg = R, with probability at least 2 exp δ2 α, C R with C R = 2 α+7 γr 2 trσ + g g λ 2 /λ + 8γR 2 δ + 2 g g λ /3, ad i particular ERg R 2 exp δ2 α. C R Note that Theorem 3 shows that with probability at least 2 exp δ2 C R, α the predictios of g are perfect. We ca also make the followig observatios: The idea of the proof see Appedix H. for the detailed proof is the followig: we kow that as soo as g g λ H δ/2r, the predictios of g are perfect Lemma. We just have to apply Theorem for to the origial SGD recursio ad make sure to boud each term by δ/4r. Similar results for o-averaged SGD could be derived beyod least-squares e.g., hige or logistic loss usig results from KT09. Also ote that the larger the α, the smaller the boud. However, it is oly valid for larger that a certai quatity depedig of λγ. A good trade-off is α = /2, for which we get a excess error of 2 exp δ2 C R /2, which is valid as soo as log0r g 0 g λ H /δ/4λ 2 γ 2. Notice also that we should go for large γλ to icrease the factor i the expoetial ad make the coditio happe as soo as possible. If we wat to emphasize the depedece of the boud o the importat parameters, we ca write that: ERg R 2 exp λδ 2 α /R 2. Whe the coditio o is ot met, the we still have the usual boud obtaied by takig directly the excess loss BJM06 but we lose expoetial covergece Tail averaged SGD with costat step-size We ow cosider the tail-averaged recursio 4, with the followig result: Theorem 4 Assume A, A2, A3, A4 ad γ = γ for ay, γλ < ad γ γ 0 = R 2 + 2λ. Let g be the -th iterate of the recursio defied i Eq. 2, ad ḡ tail = /2 i= /2 g i, as soo as 2/γλ l5r g 0 g λ H /δ, the Rḡ tail = R, with probability at least 4 exp δ 2 K R +, with K R = 29 R 2 + g g λ 2 trσσ + λi δR g g λ /3λ, ad i particular ERḡ tail R 4 exp δ 2 K R +. Theorem 4 shows that with probability at least 4 exp δ 2 K R +, the predictios of ḡ tail are perfect. We ca also make the followig observatios: The idea of the proof see Appedix H.2 for the detailed proof is the followig: we kow that as soo as ḡ tail g λ H δ/2r, the predictios of ḡ tail are perfect Lemma. We just have to apply Corollary to the origial SGD recursio, ad make sure to boud each term by δ/4r. 4 The full averagig result correspodig to Theorem 4 is proved i Appedix I.2, Theorem 7. 8

10 If we wat to emphasize the depedece of the boud o the importat parameters, we ca write that: ERg R 2 exp λ 2 δ 2 /R 4. Note that the λ 2 could be made much smaller with assumptios o the decrease of eigevalues of Σ it has bee show CDV07 that if the decay happes at speed / β : trσσ + λi 2 λ trσσ + λi R 2 /λ +/β. We wat to take γλ as big as possible to satisfy quickly the coditio. I compariso to the covergece rate i the case of decreasig step-sizes, the depedece o is improved as the covergece is really a expoetial of ad ot of some power of as i the previous result. Fially, the complete proof for the full average is cotaied i Appedix I.2 ad more precisely i Theorem CONCLUSION I this paper, we have show that stochastic gradiet could be expoetially coverget, oce some margi coditios are assumed; ad eve if a weaker margi coditio is assumed, fast rates ca be achieved see Appedix J. This is obtaied by ruig averaged stochastic gradiet o a least-squares problem, ad provig ew deviatio iequalities. Our work could be exteded i several atural ways: a our work relies o ew cocetratio results for the least-mea-squares algorithm i.e., SGD for square loss, it is atural to exted it to other losses, such as the logistic or hige loss; b goig beyod biary classificatio is also atural with the square loss CRR6, OBLJ7 or without TCKG05; c i our experimets, we use regularizatio, but we have experimeted with uregularized recursios, which do exhibit fast covergece, but for which proofs are usually harder DB6; fially, d i order to avoid the O 2 complexity, extedig the results of RCR7, RR7 would lead to a subquadratic complexity. ACKNOWLEDGEMENTS We ackowledge support from the Europea Research Coucil grat SEQUOIA We would like to thak Raphaël Berthier for useful discussios. REFERENCES AF03 Robert A. Adams ad Joh J.F. Fourier. Sobolev spaces, volume 40. Academic Press, AT07 Jea-Yves Audibert ad Alexadre B. Tsybakov. Fast learig rates for plug-i classifiers. The Aals of statistics, 352: , BJM06 Peter L. Bartlett, Michael I. Jorda, ad Jo D. McAuliffe. Covexity, classificatio, ad risk bouds. Joural of the America Statistical Associatio, 0473:38 56, BLC05 L. Bottou ad Y. Le Cu. O-lie learig for very large data sets. Applied Stochastic Models i Busiess ad Idustry, 22:37 5, BM F. Bach ad E. Moulies. No-asymptotic aalysis of stochastic approximatio algorithms for machie learig. I Advaces i Neural Iformatio Processig Systems NIPS, 20. BM3 F. Bach ad E. Moulies. No-strogly-covex smooth stochastic approximatio with covergece rate O/. I Advaces i Neural Iformatio Processig Systems NIPS, 203. CDV07 Adrea Capoetto ad Eresto De Vito. Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 73:33 368, CRR6 Carlo Ciliberto, Lorezo Rosasco, ad Alessadro Rudi. A cosistet regularizatio approach for structured predictio. I Advaces i Neural Iformatio Processig Systems, 206. DB5 A. Défossez ad F. Bach. Costat step size least-mea-square: Bias-variace trade-offs ad optimal samplig distributios. I Proc. AISTATS, 205. DB6 Aymeric Dieuleveut ad Fracis Bach. Noparametric stochastic approximatio with large step-sizes. The Aals of Statistics, 444: , 206. DBLJ4 Aaro Defazio, Fracis Bach, ad Simo Lacoste-Julie. SAGA: A fast icremetal gradiet method with support for o-strogly covex composite objectives. I Advaces i Neural Iformatio Processig Systems, 204. DFB7 Aymeric Dieuleveut, Nicolas Flammario, ad Fracis Bach. Harder, better, faster, stroger covergece rates for least-squares regressio. Joural of Machie Learig Research, pages 5, 207. DGL3 Luc Devroye, László Györfi, ad Gábor Lugosi. A Probabilistic Theory of Patter Recogitio, volume 3. Spriger Sciece & Busiess Media, 203. DVRT4 Eresto De Vito, Lorezo Rosasco, ad Alessadro Toigo. Learig sets with separatig kerels. Applied ad Computatioal Harmoic Aalysis, 372:85 27, 204. FBJ04 Keji Fukumizu, Fracis Bach, ad Michael I. Jorda. Dimesioality reductio for supervised learig with reproducig kerel Hilbert spaces. Joural of Machie Learig Research, 5Ja:73 99, JKK + 6 Prateek Jai, Sham M. Kakade, Rahul Kidambi, Praeeth Netrapalli, ad Aaro Sidford. Parallelizig stochastic approximatio through mii-batchig ad tail-averagig. Techical Report , arxiv, 206. JZ3 R. Johso ad T. Zhag. Acceleratig stochastic gradiet descet usig predictive variace reductio. I Advaces i Neural Iformatio Processig Systems,

11 KB05 Vladimir Koltchiskii ad Olexadra Bezosova. Expoetial covergece rates i classificatio. I Iteratioal Coferece o Computatioal Learig Theory. Spriger, KT09 Sham M. Kakade ad Ambuj Tewari. O the geeralizatio ability of olie strogly covex programmig algorithms. I Advaces i Neural Iformatio Processig Systems, LSB2 Nicolas Le Roux, Mark Schmidt, ad Fracis Bach. A stochastic gradiet method with a expoetial covergece rate for fiite traiig sets. I Advaces i Neural Iformatio Processig Systems, 202. MT99 Eo Mamme ad Alexadre Tsybakov. Smooth discrimiatio aalysis. The Aals of Statistics, 276: , 999. NJLS09 A. Nemirovski, A. Juditsky, G. La, ad A. Shapiro. Robust stochastic approximatio approach to stochastic programmig. SIAM Joural o Optimizatio, 94: , NV08 Y. Nesterov ad J. P. Vial. Cofidece level solutios for stochastic programmig. Automatica, 446: , NY83 A. S. Nemirovski ad D. B. Yudi. Problem complexity ad method efficiecy i optimizatio. Joh Wiley, 983. OBLJ7 Ato Osoki, Fracis Bach, ad Simo Lacoste-Julie. O structured predictio theory with calibrated covex surrogate losses. I Advaces i Neural Iformatio Processig Systems, 207. Pi94 Iosif Pielis. Optimum bouds for the distributios of martigales i baach spaces. The Aals of Probability, pages , 994. PJ92 B. T. Polyak ad A. B. Juditsky. Acceleratio of stochastic approximatio by averagig. SIAM Joural o Cotrol ad Optimizatio, 304: , 992. RBV0 Lorezo Rosasco, Mikhail Belki, ad Eresto De Vito. O learig with itegral operators. Joural of Machie Learig Research, Feb: , 200. RCDVR4 Alessadro Rudi, Guillermo D Caas, Eresto De Vito, ad Lorezo Rosasco. Learig sets ad subspaces. Regularizatio, Optimizatio, Kerels, ad Support Vector Machies, pages , 204. RCR7 Alessadro Rudi, Luigi Carratio, ad Lorezo Rosasco. FALKON: A optimal large scale kerel method. I Advaces i Neural Iformatio Processig Systems RM5 H. Robbis ad S. Moro. A stochastic approximatio method. A. Math. Statistics, 22: , 95. RR7 Alessadro Rudi ad Lorezo Rosasco. Geeralizatio properties of learig with radom features. I Advaces i Neural Iformatio Processig Systems, 207. Sha Ohad Shamir. Makig gradiet descet optimal for strogly covex stochastic optimizatio. CoRR, abs/ , 20. SL3 Mark Schmidt ad Nicolas Le Roux. Fast covergece of stochastic gradiet descet uder a strog growth coditio. Techical Report , arxiv, 203. Sol98 Mikhail V Solodov. Icremetal gradiet algorithms with stepsizes bouded away from zero. Computatioal Optimizatio ad Applicatios, :23 35, 998. SS02 B. Schölkopf ad A. J. Smola. Learig with Kerels. MIT Press, SSSS07 S. Shalev-Shwartz, Y. Siger, ad N. Srebro. Pegasos: Primal estimated sub-gradiet solver for svm. I Proceedigs of the Iteratioal Coferece o Machie Learig ICML, STC04 J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, TCKG05 B. Taskar, V. Chatalbashev, D. Koller, ad C. Guestri. Learig structured predictio models: A large margi approach. I Proceedigs of the Iteratioal Coferece o Machie Learig ICML, Xia0 L. Xiao. Dual averagig methods for regularized stochastic learig ad olie optimizatio. Joural of Machie Learig Research, 9: , 200. YRC07 Yua Yao, Lorezo Rosasco, ad Adrea Capoetto. O early stoppig i gradiet descet learig. Costructive Approximatio, 262:289 35, Zha04 Tog Zhag. Statistical behavior ad cosistecy of classificatio methods based o covex risk miimizatio. Aals of Statistics, pages 56 85,

12 Orgaizatio of the Appedix A. Experimets where the experimets ad their settigs are explaied. B. Probabilistic lemmas where cocetratio iequalities i Hilbert spaces used i sectio G are recalled. C. From H to 0- loss where, from high probability boud for H, we derived boud for the 0- error. D. Proofs of Expoetial rates for Kerel Ridge Regressio where expoetial rates for Kerel Ridge Regressio are prove Theorem 5. E. Proofs ad additioal results about cocrete examples where additioal results ad crocrete examples to satisfy A4 are give. F. Prelimiaries for Stochastic Gradiet Descet where the SGD recursio is derived. G. Proof of stochastic gradiet descet results where high probability bouds for the geeral SGD recursio are show Theorems ad 2. H. Expoetially coverget SGD for classificatio error where expoetial covergece of test error are show Theorems 3 ad 4. I. Extesio for the full averaged case where previous results are exteded for full averaged SGD istead of tail-averaged. J. Covergece uder weaker margi assumptio where previous results are exteded i the case of a weaker margi assumptio. APPENDIX A. EXPERIMENTS To illustrate our results, we cosider oe-dimesioal sythetic examples X = 0, for which our assumptios are easily satisfied. Ideed, we cosider the followig set-up that fulfils our assumptios: A, A3 We cosider here X U 0, ε/2 + ε/2, ad with the otatios of Example, we take S + = 0, ε/2 ad S = + ε/2,. For i, x i idepedetly sampled from ρ X we defie y i = if x i S + ad y i = if x i S. A2 We take the kerel to be the expoetial kerel Kx, x = exp x x for which the RKHS is a Sobolev space H = W s,2, with s > d/2, which is dese i L 2 AF03. A4 With this settig we could fid a closed form for g λ ad checked that it verified A4. Ideed we could solve the optimality equatio satisfied by g λ : z 0,, 0 Kx, zg λ xdρ X x + λg λ z = 0 Kx, zg ρ xdρ X x, the solutio beig a liear combiatio of expoetials i each set : 0, ε/2, ε/2, + ε/2 ad + ε/2,..5.0 ǫ g λ ρ X y x FIGURE. Represetig the ρ X desity uiform with ε-margi, the best estimator, i.e., Ex y ad g λ used for the simulatios λ = 0.0. I the case of SGD with decreasig step size, we computed oly the test error ERg R. For tail averaged SGD with costat step size, we computed the test error as well as the traiig error, the test

13 loss which correspods to the L 2 loss : 0 g x g λ x 2 dρx ad the traiig loss. I all cases we computed the errors of the -th iterate with respect to the calculated g λ, takig g 0 = 0. For ay, g = g γ g x y K x + λg. We ca use represetats to fid the recursio o the coefficiets. Ideed, if g = i= a i K x i, the the followig recursio for the a i reads : for i, a i = γ λa i a = γ a i Kx, x i y. From a i, we ca also compute the coefficiets of ḡ ad ḡ tail that we ote ā i ad ā,tail i respectively: ā i = a k i k=i + ad ā,tail i = /2 k= /2 ak i. To show our theoretical results we have decided to preset the followig figures: i= For the expoetial covergece of the averaged ad tail averaged cases, we plotted the error log 0 ERg R as a fuctio of. With this scale ad followig our results it goes as a lie after a certai Figures 2 ad 3 right. We recover the results of DFB7 that show covergece at speed / for the loss Figure 2 left. We adapted the scale to compare with the error plot. For Figure 3 left, we plotted log logerg R of the excess error with respect to the log of to show a lie of slope /2. It meets our theoretical boud of the form exp K, Note that for the plots where we plotted the expected excess errors, i.e., ERg R, we plotted the mea of the errors over 000 replicatios util = 200, whereas for the plots where we plotted the losses, i.e., a fuctio of g g 2, we plotted the mea of the loss over 00 replicatios util = log 0 g g λ trai_loss test_loss log 0 Rg R trai_error test_error FIGURE 2. Showig liear covergece for the L 0 errors i the case of margi of width ε. Left figure correspods to the test ad traiig loss i the averaged case whereas the right oe correspods to the error i the same settig. Note that the y-axis is the same while the x-axis is differet of a factor 0. The fact that the error plot is a lie after a certai matches our theoretical results. We took the followig parameters : ε = 0.05, γ = 0.25, λ = 0.0. We ca make the followig observatios: First remark that betwee plots of losses ad errors Figure 2 left ad right resp., there is a factor 0 betwee the umbers of samples 200 for errors ad 2000 for losses ad aother factor 0 betwee errors ad losses 0 4 for errors ad 0 3 for losses. That uderlies well our theoretical result which is the differece betwee expoetial rates of covergece of the excess error ad / rate of covergece of the loss. Moreover, we see that eve if the excess error with tail averagig seems a bit faster, we have liear rates too for the covergece of the excess error i the averaged case. Fially, we remark that the error o the trai set is always below the oe for a ukow test set of what seems to be close to a factor 2. APPENDIX B. PROBABILISTIC LEMMAS I this sectio we recall two fudametal results for cocetratio iequalities i Hilbert spaces show i Pi94. 2

14 log logrg R test_error log log 0 Rg R test_error_average test_error_tail_average FIGURE 3. Left plot shows the error i the o-averaged case for γ = γ/ ad right compares the test error betwee averaged ad tail averaged case. We took the followig parameters : ε = 0.05, γ = 0.25, λ = 0.0. Propositio 3 Let X k k N be a sequece of vectors of H adapted to a o decreasig sequece of σ-fields F k such that E X k F k = 0, sup k X k a ad E X k 2 F k b 2 for some sequeces a, b N. R+ The, for all t 0,, P X k t 2 exp t a t + b2 a a 2 l + ta. 8 b Proof : As E X k F k = 0, the F j-adapted sequece f j defied by f j = j X k is a martigale ad so is the stopped-martigale f j. By applyig Theorem 3.4 of Pi94 to the martigale f j, we have the result. Corollary 2 Let X k k N be a sequece of vectors of H adapted to a o decreasig sequece of σ-fields F k such that E X k F k = 0, sup k X k a ad E X k 2 F k b 2 for some sequeces a, b N. R+ The, for all t 0,, t 2 P X k t 2 exp 2 b a t/3 Proof : We apply 3 ad simply otice that t t + b2 l + ta a a a 2 b = b2 + at a 2 b 2 at = b2 φ a 2 where φu = + u l + u u for u > 0. Moreover φu t t + b2 l + ta b2 a a a 2 b a 2 b 2, l + at at b 2 b 2 u 2, so that: 2 + u/3 a t/b a t/3b 2 = t 2 2 b 2 + a t/3. APPENDIX C. FROM H TO 0- LOSS I this sectio we prove Lemma. Note that A4 requires the existece of g λ havig the same sig of g almost everywhere o the support of ρ X ad with absolute value uiformly bouded from below. I Lemma we prove that we ca boud the 0- error with respect to the distace i H of the estimator ĝ form g λ. Proof of Lemma : Deote by W the evet such that ĝ gλ H < δ/2r. Note that for ay f H, for ay x X. So for ĝ W, we have fx = f, K x H Kx H f H R f H, ĝx g λ x R ĝ gλ H < δ/2 x X. 3

15 Let x be i the support of ρ X. By A4 g λ x δ/2 a.e.. Let ĝ W ad x X such that g λ x > 0, we have ĝx = g λ x g λ x ĝx g λ x g λ x ĝx > 0, so sigĝx = sigg λ x = +. Similarly let ĝ W ad x X such that g λ x < 0, we have ĝx = g λ x + ĝx g λ x g λ x + g λ x ĝx < 0, so sigĝx = sigg λ x =. Fially ote that for ay ĝ H, by A4, either g λ x > 0 or g λ x < 0 a.e., so sigĝx = sigg λ x a.e. Now ote that by A, A4 we have that sigg x = sigg λ x a.e., where g x := E y x. So whe ĝ W, we have that sigĝx = sigg λ x = sigg x a.e., so Fially ote that Rĝ = ρ{x, y : sigĝx y} = ρ{x, y : sigg x y} = R. ERĝ = ERĝ W + ERĝ W c, where W is o the set W ad 0 outside, W c is the complemet set of W. So, whe ĝ W, we have while ERĝ W = R E W R, ERĝ W c E W c q. APPENDIX D. EXPONENTIAL RATES FOR KERNEL RIDGE REGRESSION D.. Results I this sectio, we first specialize some results already kow i literature about the cosistecy of kerel ridge least-squares regressio KRLS i H-orm CDV07 ad the we derive expoetial classificatio learig rates. Let x i, y i i= be examples idepedetly ad idetically distributed accordig to ρ, that is Assumptio A3. Deote by Σ, Σ the liear operators o H defied by Σ = K xi K xi, Σ = K x K x dρ X x, i= referred to as the covariace ad empirical o-cetered covariace operators see FBJ04 ad refereces therei. We recall that the KRLS estimator ĝ λ H, which miimizes the regularized empirical risk, is defied as follows i terms of Σ, ĝ λ = Σ + λi y i K xi. Moreover we recall that the populatio regularized estimator g λ is characterized by see CDV07 X i= g λ = Σ + λi EyK x. The followig lemma bouds the empirical regularized estimator with respect to the populatio oe i terms of λ, ad is essetially cotaied i the work of CDV07; here we rederive it i a subcase see below for the proof. Lemma 3 Uder assumptio A2, A3 for ay λ > 0, ote u = i= y ik xi EyK x H ad v = Σ Σ op, we have: ĝ λ g λ H u λ + Rv λ 2. By usig deviatio iequalities for u, v i Lemma 3 ad the applyig Lemma, we obtai the followig expoetial boud for kerel ridge regressio see complete proof below: Theorem 5 Uder A,A2,A3,A4 we have that for ay Z, Rĝ λ R = 0 with probability at least 4 exp C 0λ 4 δ 2. Moreover, ERĝ λ R 4 exp C 0 λ 4 δ 2 /R 8, with C 0 := 72 + λr R 8

16 The result above is a refiemet of Thm. 2.6 from YRC07. We improved the depedecy i ad removed the requiremets that g H or g = Σ r w for a w L 2 dρ X ad r > /2. Similar results exist for losses that are usually cosidered more suitable for classificatio, like the hige or logistic loss ad more geerally losses that are o-decreasig KB05. With respect to this latter work, our aalysis uses the explicit characterizatio of the kerel ridge regressio estimator i terms of liear operators o H CDV07. This, together with A4, allows us to use aalytic tools specific to reproducig kerel Hilbert spaces, leadig to proofs that are comparatively simpler, with explicit costats ad a clearer problem settig cosistig essetially i A, A4 ad o assumptios o E y x. Fially ote that the expoet of λ could be reduced by usig a refied aalysis uder additioal regularity assumptio of ρ X ad E y x as source coditio ad itrisic dimesio from CDV07, but it is beyod the scope of this paper. D.2. Proofs Here we prove that Kerel Ridge Regressio achieves expoetial classificatio rates uder assumptios A, A4. I particular by Lemma 3 we boud ĝλ g λ H i high probability ad the we use Lemma that gives expoetial classficatio rates whe ĝ λ g λ H is small eough i high probability. Proof of Lemma 3 : Deote by Σ λ the operator Σ + λi ad with Σ λ the operator Σ + λi. We have ĝ λ g λ = Σ λ y ik xi Σ λ EyK x i= = Σ λ y ik xi EyK x + Σ λ Σ λ EyK x. i= For the first term, sice Σ λ op λ, we have Σ λ y ik xi EyK x H Σ λ op y ik xi EyK x H i= i= y ik xi EyK x λ H. For the secod term, sice Σ λ op λ ad EyK x E yk x R, we have Σ λ Σ λ EyK x H = Σ λ Σ ΣΣλ EyK x H Σ Σ op Σ opσ EyKx op H R Σ λ Σ 2 op. λ i= λ Proof of Theorem 5 : Let τ > 0. By Lemma 2 we kow that ĝ λ g λ H u λ + Rv λ 2, with u = i= yikx i EyKx H ad v = Σ Σ op. For u we ca apply Pielis iequality Thm. 3.5, Pi94, sice x i, y i i= are sampled idepedetly accordig to the probability ρ ad that y ik xi EyK x is zero mea. Sice yikx i EyKx H 2R a.e. ad H is a Hilbert space, the we apply Pielis iequality with b 2 = 4R2 8R2 τ u, ad D =, obtaiig with probability at least 2e τ. Now, deote by HS the Hilbert-Schmidt orm ad recall that HS. To boud v we apply agai the Pielis iequality RBV0 cosiderig that the space of Hilbert-Schmidt operators is agai a Hilbert space ad that Σ = i= Kx i Kx i, that xi i= are idepedetly sampled from ρ X ad that EK xi K xi = Σ. I particular we apply it with D = ad b 2 = 4R4, so v = Σ Σ Σ 8R4 Σ HS τ, with probability 2e τ. Fially we take the itersectio boud of the two evets obtaiig, with probability at least 4e τ, 8R2 τ ĝ λ g λ H λ 2 + 8R6 τ λ 4. 5

17 δ 2 By selectig τ =, we obtai ĝ λ g λ H δ 9R 2 8R 2 λ 2 + 8R 6 λ 4 2 apply Lemma to have the expoetial boud for the classificatio error. 3R, with probability 4e τ. Now we ca APPENDIX E. PROOFS AND ADDITIONAL RESULTS ABOUT CONCRETE EXAMPLES I the ext subsectio we prove that g H is sufficiet to satisfy A4, while i subsectio E.2 we prove that specific settigs aturally satisfy A4. E.. From g H to A4 Here we assume that there exists g H such that g x = E y x a.e. o the support of ρ X. First we itroduce Aλ, that is a quatity related to the approximatio error of g λ with respect to g ad we study its behavior whe λ 0. The we express gλ g H i terms of Aλ. Fially we prove that for ay δ give by A, there exists λ such that A4 is satisfied. Let σ t, u t t Z be a eigebasis of Σ with σ σ 2 0, ad let α j = g, u j we itroduce the followig quatity Aλ = αt 2. Lemma 4 Uder A2, Aλ is decreasig for ay λ > 0 ad j N t:σ t λ lim Aλ = 0. λ 0 Proof : Uder A2 ad the liearity of trace, we have that σ j = trσ = tr K x K x dρ X x = K x, K x H dρ X x = Kx, xdρ X x R 2. Deote by t λ Z, the umber mi{t Z σ t λ}. Sice the σ j j Z is a o-decreasig summable sequece, the it coverges to 0, the lim t λ =. λ 0 Fially, sice αj 2 j Z is a summable sequece we have that lim Aλ = lim λ 0 λ 0 t:σ t λ αt 2 = lim αj 2 = lim λ 0 t j=t λ Here we express gλ g H i terms of g H ad of A λ. Lemma 5 Uder A2, for ay λ > 0 we have gλ g H λ g 2 H + A λ. Proof : Deote by Σ λ the operator Σ + λi. Note that sice g H, the αj 2 = 0. EyK x = Eg xk x = EK x K xg = EK x K xg = Σg, the g λ = Σ λ EyKx = Σ λ Σg. So we have gλ g H = Σ λ Σg g H = Σ λ Σ Ig H = λ Σ λ g H. Moreover λ Σ + λi g H λ Σ + λi /2 λ Σ + λi /2 g H λ Σ + λi /2 g H. Now we express λ Σ + λi /2 g H i terms of Aλ. We have that λ Σ + λi /2 g 2 = λ g H, Σ + λi g = λ g, j + λ j Zσ u j u j g j=t 6 = j Z λα 2 j σ j + λ.

Exponential Convergence of Testing Error for Stochastic Gradient Methods

Exponential Convergence of Testing Error for Stochastic Gradient Methods Proceedigs of Machie Learig Research vol 75: 47, 208 3st Aual Coferece o Learig Theory Expoetial Covergece of Testig Error for Stochastic Gradiet Methods Loucas Pillaud-Vivie Alessadro Rudi Fracis Bach

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose

More information

Improvement of Generic Attacks on the Rank Syndrome Decoding Problem

Improvement of Generic Attacks on the Rank Syndrome Decoding Problem Improvemet of Geeric Attacks o the Rak Sydrome Decodig Problem Nicolas Arago, Philippe Gaborit, Adrie Hauteville, Jea-Pierre Tillich To cite this versio: Nicolas Arago, Philippe Gaborit, Adrie Hauteville,

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

On the behavior at infinity of an integrable function

On the behavior at infinity of an integrable function O the behavior at ifiity of a itegrable fuctio Emmauel Lesige To cite this versio: Emmauel Lesige. O the behavior at ifiity of a itegrable fuctio. The America Mathematical Mothly, 200, 7 (2), pp.75-8.

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

A Simple Proof of the Shallow Packing Lemma

A Simple Proof of the Shallow Packing Lemma A Simple Proof of the Shallow Packig Lemma Nabil Mustafa To cite this versio: Nabil Mustafa. A Simple Proof of the Shallow Packig Lemma. Discrete ad Computatioal Geometry, Spriger Verlag, 06, 55 (3), pp.739-743.

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Regularization methods for large scale machine learning

Regularization methods for large scale machine learning Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 4 Scribe: Cheg Mao Sep., 05 I this lecture, we cotiue to discuss the effect of oise o the rate of the excess risk E(h) = R(h) R(h

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

Math Solutions to homework 6

Math Solutions to homework 6 Math 175 - Solutios to homework 6 Cédric De Groote November 16, 2017 Problem 1 (8.11 i the book): Let K be a compact Hermitia operator o a Hilbert space H ad let the kerel of K be {0}. Show that there

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

TENSOR PRODUCTS AND PARTIAL TRACES

TENSOR PRODUCTS AND PARTIAL TRACES Lecture 2 TENSOR PRODUCTS AND PARTIAL TRACES Stéphae ATTAL Abstract This lecture cocers special aspects of Operator Theory which are of much use i Quatum Mechaics, i particular i the theory of Quatum Ope

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4 MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Estimation of the essential supremum of a regression function

Estimation of the essential supremum of a regression function Estimatio of the essetial supremum of a regressio fuctio Michael ohler, Adam rzyżak 2, ad Harro Walk 3 Fachbereich Mathematik, Techische Uiversität Darmstadt, Schlossgartestr. 7, 64289 Darmstadt, Germay,

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECURE 4 his lecture is partly based o chapters 4-5 i [SSBD4]. Let us o give a variat of SGD for strogly covex fuctios. Algorithm SGD for strogly covex

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi,

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Testing the number of parameters with multidimensional MLP

Testing the number of parameters with multidimensional MLP Testig the umber of parameters with multidimesioal MLP Joseph Rykiewicz To cite this versio: Joseph Rykiewicz. Testig the umber of parameters with multidimesioal MLP. ASMDA 2005, 2005, Brest, Frace. pp.561-568,

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS Ryszard Zieliński Ist Math Polish Acad Sc POBox 21, 00-956 Warszawa 10, Polad e-mail: rziel@impagovpl ABSTRACT Weak laws of large umbers (W LLN), strog

More information

lim za n n = z lim a n n.

lim za n n = z lim a n n. Lecture 6 Sequeces ad Series Defiitio 1 By a sequece i a set A, we mea a mappig f : N A. It is customary to deote a sequece f by {s } where, s := f(). A sequece {z } of (complex) umbers is said to be coverget

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression

Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression Joural of Machie Learig Research 204) 595-627 Submitted 0/3; Revised 2/3; Published 2/4 Adaptivity of Averaged Stochastic Gradiet Descet to Local Strog Covexity for Logistic Regressio Fracis Bach INRIA

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

Stochastic Simulation

Stochastic Simulation Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Multi parameter proximal point algorithms

Multi parameter proximal point algorithms Multi parameter proximal poit algorithms Ogaeditse A. Boikayo a,b,, Gheorghe Moroşau a a Departmet of Mathematics ad its Applicatios Cetral Europea Uiversity Nador u. 9, H-1051 Budapest, Hugary b Departmet

More information

A Weak Law of Large Numbers Under Weak Mixing

A Weak Law of Large Numbers Under Weak Mixing A Weak Law of Large Numbers Uder Weak Mixig Bruce E. Hase Uiversity of Wiscosi Jauary 209 Abstract This paper presets a ew weak law of large umbers (WLLN) for heterogeous depedet processes ad arrays. The

More information

Doubly Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization with Factorized Data

Doubly Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization with Factorized Data Doubly Stochastic Primal-Dual Coordiate Method for Regularized Empirical Risk Miimizatio with Factorized Data Adams Wei Yu, Qihag Li, Tiabao Yag Caregie Mello Uiversity The Uiversity of Iowa weiyu@cs.cmu.edu,

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

The Wasserstein distances

The Wasserstein distances The Wasserstei distaces March 20, 2011 This documet presets the proof of the mai results we proved o Wasserstei distaces themselves (ad ot o curves i the Wasserstei space). I particular, triagle iequality

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

Optimization Results for a Generalized Coupon Collector Problem

Optimization Results for a Generalized Coupon Collector Problem Optimizatio Results for a Geeralized Coupo Collector Problem Emmauelle Aceaume, Ya Busel, E Schulte-Geers, B Sericola To cite this versio: Emmauelle Aceaume, Ya Busel, E Schulte-Geers, B Sericola. Optimizatio

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Polynomial identity testing and global minimum cut

Polynomial identity testing and global minimum cut CHAPTER 6 Polyomial idetity testig ad global miimum cut I this lecture we will cosider two further problems that ca be solved usig probabilistic algorithms. I the first half, we will cosider the problem

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

SOME GENERALIZATIONS OF OLIVIER S THEOREM

SOME GENERALIZATIONS OF OLIVIER S THEOREM SOME GENERALIZATIONS OF OLIVIER S THEOREM Alai Faisat, Sait-Étiee, Georges Grekos, Sait-Étiee, Ladislav Mišík Ostrava (Received Jauary 27, 2006) Abstract. Let a be a coverget series of positive real umbers.

More information

TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES

TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES M Sghiar To cite this versio: M Sghiar. TURBULENT FUNCTIONS AND SOLVING THE NAVIER-STOKES EQUATION BY FOURIER SERIES. Iteratioal

More information