Exponential convergence of testing error for stochastic gradient methods

Size: px

Start display at page:

Download "Exponential convergence of testing error for stochastic gradient methods"

Ronald Miller
5 years ago
Views:

Expoetial covergece of testig error for stochastic gradiet methods Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach To cite this versio: Loucas

<hal-0662278v2> HAL Id: hal-0662278 https://hal.archives-ouvertes.

documets, whether they are published or ot.

1 Expoetial covergece of testig error for stochastic gradiet methods Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach To cite this versio: Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach. Expoetial covergece of testig error for stochastic gradiet methods <hal v2> HAL Id: hal Submitted o 28 Ju 208 HAL is a multi-discipliary ope access archive for the deposit ad dissemiatio of scietific research documets, whether they are published or ot. The documets may come from teachig ad research istitutios i Frace or abroad, or from public or private research ceters. L archive ouverte pluridiscipliaire HAL, est destiée au dépôt et à la diffusio de documets scietifiques de iveau recherche, publiés ou o, émaat des établissemets d eseigemet et de recherche fraçais ou étragers, des laboratoires publics ou privés.

2 EXPONENTIAL CONVERGENCE OF TESTING ERROR FOR STOCHASTIC GRADIENT METHODS LOUCAS PILLAUD-VIVIEN, ALESSANDRO RUDI, FRANCIS BACH ABSTRACT. We cosider biary classificatio problems with positive defiite kerels ad square loss, ad study the covergece rates of stochastic gradiet methods. We show that while the excess testig loss squared loss coverges slowly to zero as the umber of observatios ad thus iteratios goes to ifiity, the testig error classificatio error coverges expoetially fast if low-oise coditios are assumed. To achieve these rates of covergece we show sharper high-probability bouds with respect to the umber of observatios for stochastic gradiet descet.. INTRODUCTION Stochastic gradiet methods are ow ubiquitous i machie learig, both from the practical side, as a simple algorithm that ca lear from a sigle or a few passes over the data BLC05, ad from the theoretical side, as it leads to optimal rates for estimatio problems i a variety of situatios NY83, PJ92. They follow a simple priciple RM5: to fid a miimizer of a fuctio F defied o a vector space from oisy gradiets, simply follow the egative stochastic gradiet ad the algorithm will coverge to a statioary poit, local miimum or global miimum of F depedig o the properties of the fuctio F, with a rate of covergece that decays with the umber of gradiet steps typically as O/, or O/ depedig o the assumptios which are made o the problem PJ92, NV08, NJLS09, SSSS07, Xia0, BM, BM3, DFB7. O the oe had, these rates are optimal for the estimatio of the miimizer of a fuctio give access to oisy gradiets NY83, which is essetially the usual machie learig set-up where the fuctio F is the expected loss, e.g., logistic or hige for classificatio, or least-squares for regressio, ad the oisy gradiets are obtaied from samplig a sigle pair of observatios. O the other had, although these rates as O/ or O/ are optimal, there are a variety of extra assumptios that allow for faster rates, eve expoetial rates. First, for stochastic gradiet from a fiite pool, that is for F = k k i= F i, a sequece of works startig from SAG LSB2, SVRG JZ3, SAGA DBLJ4, have show explicit expoetial covergece. However, these results, oce applied to machie learig where the fuctio F i is the loss fuctio associated with the i-th observatio of a fiite traiig data set of size k, say othig about the loss o usee data test loss. The rates we preset i this paper are o usee data. Secod, assumig that at the optimum all stochastic gradiets are equal to zero, the for strogly-covex problems e.g., liear predictios with low-correlated features, liear covergece rates ca be obtaied for test losses Sol98, SL3. However, for supervised machie learig, this has limited relevace as havig zero gradiets for all stochastic gradiets at the optimum essetially implies predictio problems with o ucertaity that is, the output is a determiistic fuctio of the iput. Moreover, we ca oly get a expoetial rate for strogly-covex problems ad thus this imposes a parametric oiseless problem, which limits the applicability eve if the problem was oiseless, this ca oly reasoably be i a oparametric way with eural etworks or positive defiite kerels. Our rates are o oisy problems ad o ifiite-dimesioal problems where we ca hope that we approach the optimal predictio fuctio with large umbers of observatios. For predictio fuctios described by a reproducig kerel Hilbert space, ad for the square loss, the excess testig loss equal to testig loss mius the miimal testig loss over all measurable predictio fuctios is kow to coverge to zero at a subexpoetial rate typically greater tha O/ DB6, DFB7, these rates beig optimal for the estimatio of testig losses. INRIA - DÉPARTEMENT D INFORMATIQUE DE L ENS, ECOLE NORMALE SUPÉRIEURE, CNRS, INRIA, PSL RESEARCH UNIVERSITY, PARIS, FRANCE address: loucas.pillaud-vivie@iria.fr, alessadro.rudi@iria.fr, fracis.bach@iria.fr.

3 Goig back to the origis of supervised machie learig with biary labels, we will ot cosider gettig to the optimal testig loss usig a covex surrogate such as logistic, hige or least-squares but the testig error umber of mistakes i predictios, also referred to as the 0- loss. It is kow that the excess testig error testig error mius the miimal testig error over all measurable predictio fuctios is upper bouded by a fuctio of the excess testig loss Zha04, BJM06, but always with a loss i the covergece rate e.g., o differece or takig square roots. Thus a slow rate i O/ or O/ o the excess loss leads to a slower rate o the excess testig error. Such geeral relatioships betwee excess loss ad excess error have bee refied with the use of margi coditios, which characterize how hard the predictio problems are MT99. Simplest iput poits are poits where the label is determiistic i.e., coditioal probabilities of the label are equal to zero or oe, while hardest poits are the oes where the coditioal probabilities are equal to /2. Margi coditios quatify the mass of iput poits which are hardest to predict, ad lead to improved trasfer fuctios from testig losses to testig errors, but still o expoetial covergece rates BJM06. I this paper, we cosider the strogest margi coditio, that is coditioal probabilities are bouded away from /2, but ot ecessarily equal to 0 or. This assumptio o the learig problem has bee used i the past to show that regularized empirical covex risk miimizatio leads to expoetial covergece rates AT07, KB05. Our mai cotributio is to show that stochastic gradiet descet also achieves similar rates see a empirical illustratio i Figure 2 i the Appedix A. This requires several side cotributios that are iterestig o their ow, that is, a ew ad simple formalizatio of the learig problem that allows expoetial rates of estimatio regardless of the algorithms used to fid the estimator ad a ew cocetratio result for averaged stochastic gradiet descet SGD applied to least-squares, which is fier tha existig work BM3. The paper is orgaized as follows: i Sectio 2, we preset the learig set-up, amely biary classificatio with positive defiite kerels, with a particular focus o the relatioship betwee errors ad losses. Our mai results rely o a geeric coditio for which we give cocrete examples i Sectio 3. I Sectio 4, we preset our versio of stochastic gradiet descet, with the use of tail averagig JKK + 6, ad provide ew deviatio iequalities, which we apply i Sectio 5 to our learig problem, leadig to expoetial covergece rates for the testig errors. We coclude i Sectio 6 by providig several aveues for future work. Fially, sythetic experimets illustratig our results ca be foud i Sectio A of the Appedix. Mai cotributios of the paper. We would like to uderlie that our mai cotributios are i the two followig results; a we show i Theorem 4 the expoetial covergece of stochastic gradiet descet o the testig error, ad b this result strogly rests o a ew deviatio iequality stated i Corollary for stochastic gradiet descet for least-square problems. This last result is iterestig o its ow ad gives a improved high-probability result which does ot deped o the dimesio of the problem ad has a tighter depedece o the strogly covex parameter through the effective dimesio of the problem, see CDV07, DB6. 2. PROBLEM SET-UP I this sectio, we preset the geeral machie learig set-up, from geeric assumptios to more specific assumptios. 2.. Geeric assumptios We cosider a measurable set X ad a probability distributio ρ o data x, y X {, }; we deote by ρ X the margial probability o x, ad by ρ± x the coditioal probability that y = ± give x. We have Ey x = ρ x ρ x. Our mai margi coditio is the followig ad idepedet of the learig framework: A Ey x δ almost surely for some δ 0,. This margi coditio ofte referred to as a low-oise coditio is commoly used i the theoretical study of biary classificatio MT99, AT07, KB05, ad usually takes the followig form: δ > 0, P Ey x < δ = Oδ α for α > 0. Here, however, δ is a fixed costat. Our stroger margi coditio A is ecessary to show expoetial covergece rates but we give also explicit rates i the case of the latter low-oise coditio. This extesio is derived i Appedix J ad more precisely i Corollary 4. Note that the smaller the α, the larger the mass of iputs with hard-to-predict labels. Our coditio correspods to α = +, ad simply states that for all iputs, the problem is ever totally ambiguous, ad the degree of o-ambiguity is 2

4 bouded from below by δ. Whe δ =, the the label y {, } is a determiistic fuctio of x, but our results apply for all δ 0, ad thus to oisy problems with low oise. Note that problems like image classificatio or object recogitio are well characterized by A. Ideed, the oise i classifyig a image betwee two disparate classes cars/pedestrias, bikes/airplaes is usually way smaller that /2. We will cosider learig fuctios i a reproducig kerel Hilbert space RKHS H with kerel fuctio K : X X R ad dot-product, H. We make the followig stadard assumptios o H: A2 H is a separable Hilbert space ad there exists R > 0, such that for all x X, Kx, x R 2. For x X, we cosider the fuctio K x : X R defied as K x x = Kx, x. We have the classical reproducig property for g H, gx = g, K x H STC04, SS02. We will cosider other orms, beyod the RKHS orm g H, that is the L 2 -orm always with respect to ρ X, defied as g 2 L 2 = X gx2 dρ X x, as well as the L -orm L o the support of ρ X. A key property is that A2 implies g L R g H. Fially, we will cosider observatios with stadard assumptios: A3 The observatios x, y X {, }, Z are idepedet ad idetically distributed with respect to the distributio ρ Ridge regressio I this paper, we focus primarily o least-squares estimatio to obtai estimators. We defie g as the miimizer over L 2 of Ey gx 2 = y gx 2 dρx, y. X {,} We always have g x = Ey x = ρ x ρ x, but we do ot require g H. We also cosider the ridge regressio problem CDV07 ad deote by g λ the uique whe λ > 0 miimizer i H of Ey gx 2 + λ g 2 H. The fuctio g λ always exists for λ > 0 ad is always a elemet of H. Whe H is dese i L 2 our results deped o the L -error g λ g, which is weaker tha g λ g H which itself oly exists whe g H which we do ot assume. Whe H is ot dese we simply defie g as the orthoormal projector for the L 2 orm o H of g = Ey x so that our boud will the deped o g λ g. Note that g is the miimizer of Ey gx 2 with respect to g i the closure of H i L 2. Moreover our mai techical assumptio is: A4 There exists λ > 0 such that almost surely, sigey xg λ x δ 2. I the assumptio above, we could replace δ/2 by ay multiplicative costats i 0, times δ istead of /2. Note that with A4, λ depeds o δ ad o the probability measure ρ, which are both fixed respectively by A ad the problem, so that λ is fixed too. It implies that for ay estimator ĝ such that g λ ĝ L < δ/2, the predictios from ĝ obtaied by takig the sig of ĝx for ay x, are the same as the sig of the optimal predictio sigey x. Note that a sufficiet coditio is g λ ĝ H < δ/2r which does ot assume that g H, see ext subsectio. Note that more geerally, for all problems for which A is true ad ridge regressio i the populatio case is so that g λ g L teds to zero as λ teds to zero the A4 is satisfied, sice g λ g L δ/2 for λ small eough, together with A the implies A4. I Sectio 3, we provide cocrete examples where A4 is satisfied ad we the preset the SGD algorithm ad our covergece results. Before we relate excess testig losses to excess testig errors From testig losses to testig error Here we provide some results that will be useful to prove expoetial rates for classificatio with squared loss ad stochastic gradiet descet. First we defie the 0- loss defiig the classificatio error: Rg = ρ{x, y : siggx y}, where sig u = + for u 0 ad for u < 0. I particular deote by R the so-called Bayes risk R = RE y x which is the miimum achievable classificatio error DGL3. A well kow approach to boud the testig errors by testig losses is via trasfer fuctios. I particular we recall the followig result DGL3, BJM06, let g x be equal to E y x a.e., the Rg R φ g g 2 L 2, g L2 dρ X, 3

5 with φu = u or φu = u β, with β /2,, depedig o some properties of ρ BJM06. While this result does ot require A or A4, it does ot readily lead to expoetial rates sice the squared loss excess risk has miimax lower bouds that are polyomial i CDV07. Here we follow a differet approach, requirig via A4 the existece of g λ havig the same sig as g ad with absolute value uiformly bouded from below. The we ca boud the 0- error with respect to the distace i H of the estimator ĝ from g λ as show i the ext lemma proof i Appedix C. This will lead to expoetial rates whe the distributio satisfies a margi coditio A as we prove i the ext sectio ad i Sectio 5. Note also that for the sake of completeess we recalled i Appedix D that expoetial rates could be achieved for kerel ridge regressio. Lemma From approximately correct sig to 0- error Let q 0,. Uder A, A2, A4, ĝ H a radom fuctio such that ĝ gλh < probability at least q. The Rĝ = R, with probability at least q, ad i particular ERĝ R q. I the ext sectio we provide sufficiet coditios ad explicit settigs aturally satisfyig A4. 3. CONCRETE EXAMPLES AND RELATED WORK δ 2R, with I this sectio we illustrate specific settigs that aturally satisfy A4. We start by the followig simple result showig that the existece of g H such that g x = E y x a.e. o the support of ρ X, is sufficiet to have A4 proof i Appedix E.. Propositio Uder A, assume that there exists g H such that g x := E y x o the support of ρ X, the for ay δ, there exists λ > 0 satisfyig A4, that is, sigey xg λ x δ 2. We are goig to use the propositio above to derive more specific settigs. I particular we cosider the case where the positive ad egative classes are separated by a margi that is strictly positive. Let X R d ad deote by S the support of the probability ρ X ad by S + = {x X : g x > 0} the part associated to the positive class, ad by S the oe associated with the egative class. Cosider the followig assumptio: A5 There exists µ > 0 such that mi x S+,x S x x µ. Deote by W s,2 the Sobolev space of order s defied with respect to the L 2 orm, o R d see AF03 ad Appedix E.2. We also itroduce the followig assumptio: A6 X R d ad the kerel is such that W s,2 H, with s > d/2. A example of kerel such that H = W s,2, with s > d/2 is the Abel kerel Kx, x = e σ x x, for σ > 0. I the followig propositio we show that if there exist two fuctios i H, oe matchig E y x o S + ad the secod matchig E y x o S ad if the kerel satisfies A6, the A4 is satisfied. Propositio 2 Uder A, A5, A6, if there exist two fuctios g +, g W s,2 such that g +x = E y x o S + ad g x = E y x o S, the A4 is satisfied. Fially we are able to itroduce aother settig where A4 is aturally satisfied the proof of the propositio above ad the example below are give i Appedix E.2. Example Idepedet oise o the labels Let ρ X be a probability distributio o X R d ad let S +, S X be a partitio of the support of ρ X satisfyig ρ X S +, ρ X S > 0 ad A5. Let Z. For i, x i idepedetly sampled from ρ X ad the label y i defied by the law y i = { ζ i if x i S + ζ i if x i S, with ζ i idepedetly distributed as ζ i = with probability p 0, /2 ad ζ i = with probability p. The A is satisfied with δ = 2p ad A4 is satisfied as soo as A2 ad A6 are, that is, the kerel is bouded ad H is rich eough see a example i Appedix E Figure 4. 4

6 Fially ote that the results of this sectio ca be easily geeralized from X = R d to ay Polish space, by usig a separatig kerel DVRT4, RCDVR4 istead of A6. 4. STOCHASTIC GRADIENT DESCENT We ow cosider the stochastic gradiet algorithm to solve the ridge regressio problem with a fixed strictly positive regularizatio parameter λ. We cosider solvig the regularized problem with regularizatio g g 0 2 H through stochastic approximatio startig from a fuctio g 0 H typically 0. Deote by F : H R, the fuctioal F g = EY gx 2 = EY K X, g 2, where the last idetity is due to the reproducig property of the RKHS H. Note that F has the followig gradiet F g = 2E Y K X, g K X. We cosider also F λ = F +λ g 0 2 H, for which F λg = F g + 2λg g 0, ad we have for each pair of observatio x, y that F λ g = E F,λ g = E g, K x y 2 + λ g g 0 2 H, with F,λg = g, K x y 2 + λ g g 0 2 H. Deotig Σ = E K x K x the covariace operator defied as a liear operator from H to H see FBJ04 ad refereces therei, we have the optimality coditios for g λ ad g : Σg λ E y K x + λg λ g 0 = 0, E y g x K x = 0, see CDV07 or Appedix F. for the proof of the last idetity. Let γ be a positive sequece; we cosider the stochastic gradiet recursio 2 i H started at g 0 : g = g γ 2 F,λg = g γ K x, g y K x + λg g 0. We are goig to cosider Polyak-Ruppert averagig PJ92, that is ḡ = + i=0 g i, as well as the tail-averagig estimate ḡ tail = /2 i= /2 g i, studied by JKK + 6. For the sake of clarity, all the results i the mai text are for the tail averaged estimate but ote that all of them have bee also proved for the full average i Appedix I. As explaied earlier see Lemma, we eed to show the covergece of g to g λ i H-orm. We are goig to cosider two cases: for the o-averaged recursio γ is a decreasig sequece, with the importat particular case γ = γ/ α, for α 0, ; 2 for the averaged or tail-averaged fuctios γ is a costat sequece equal to γ. For all the proofs of this sectio see Appedix G. I the ext subsectio we reformulate the recursio i Eq. as a least-squares recursio covergig to g λ. 4.. Reformulatio as oisy recursio We ca first reformulate the SGD recursio equatio i Eq. as a regular least-squares SGD recursio with oise, with the otatio ξ = y g x, which satisfies E ξ K x = 0. This is the object of the followig lemma for the proof see Appedix F.2.: Lemma 2 The SGD recursio ca be rewritte as follows: g g λ = I γ K x K x + λi g g λ + γ ε, 2 with the oise term ε k = ξ k K xk + g x k g λ x k K xk E g x k g λ x k K xk H. We are thus i presece of a least-squares problem i the Hilbert space H, to estimate a fuctio g λ H with a specific oise ε i the gradiet ad feature vector K x. I the ext sectio, we will cosider the geeric recursio above, which will require some bouds o the oise. I our settig, we have the followig almost sure bouds ad the oise see Lemma 9 of Appedix G: ε H R + 2 g g λ L E ε ε 2 + g g λ 2 Σ, where Σ = E K x K x is the covariace operator. Note that g0 is the iitializatio of the recursio, ad is ot the limit of g λ whe λ teds to zero this limit beig g. 2 The complexity of steps of the recursio is O 2 if usig kerel fuctios or Oτ whe usig explicit feature represetatios, with τ the complexity of computig dot-products ad addig feature vectors. 5

7 4.2. SGD for geeral Least-Square problems We ow cosider results o averaged SGD for least-squares that are iterestig o their ow. As said before, we show results i two differet settigs depedig o the step-size sequece. First, we cosider γ as a decreasig sequece, secod we take γ costat but prove the covergece of the tail-averaged iterates. Sice the results we eed could be of iterest eve for fiite-dimesioal models, i this sectio, we study the followig geeral recursio: We make the followig assumptios: H We start at some η 0 H. H2 η = I γh η + γ ε, 3 H, ε are i.i.d. ad H is a positive self-adjoit operator so that almost surely H λi, ad H := EH. H3 Noise: Eε = 0, ε H c /2 almost surely ad Eε ε C, with C commutig with H. Note that oe cosequece of this assumptio is E ε 2 H trc. H4 For all, E H CH H γ0 C ad γ γ 0. H5 A is a positive self-adjoit operator which commutes with H. Note that we will later apply the results of this sectio to H = K x K x + λi, H = Σ + λi, C = Σ ad A {I, Σ}. We first cosider the o-averaged SGD recursio, the the tail-averaged recursio. The key differece with existig bouds is the eed for precise probabilistic deviatio results. For least-squares, oe ca always separate the impact of the iitial coditio η 0 ad of the oise terms ε k, amely η = η bias + η variace, where η bias is the recursio with o oise ε k = 0, ad η variace is the recursio started at η 0 = 0. The fial performace will be bouded by the sum of the two separate performaces see, e.g.,db5. Hece all of our bouds will deped o these two. See more details i Appedix G No-averaged SGD I this sectio, we prove results for the recursio defied by Eq. 3 i the case where for α 0,, γ = γ/ α. These results exted the oes of BM by providig deviatio iequalities, but are limited to least-squares. For geeral loss fuctios ad i the strogly-covex case, see also KT09. Theorem SGD, decreasig step size: γ = γ/ α Assume H, H2, H3, γ = γ/ α, γλ < ad deote by η H the -th iterate of the recursio i Eq. 3. We have for t > 0, ad α 0,, g g λ H exp γλ + α g 0 g λ H + V, α almost surely for large eough 3 t 2, with P V t 2 exp 8γtrC/λ + γc /2 t α. We ca make the followig observatios: The proof techique see Appedix G. for the detailed proof relies o the followig scheme: we otice that η ca be decomposed i two terms, a the bias: obtaied from a product of cotractat operators, ad b the variace: a sum of icremets of a martigale. We treat separately the two terms. For the secod oe, we prove almost sure bouds o the icremets ad o the variace that lead to a Berstei-type cocetratio result o the tail PV t. Followig this proof techique, the coefficiet i the latter expoetial is composed of the variace boud plus the almost sure boud of the icremets of martigale times t. Note that we oly preseted i Theorem the case where α 0,. Ideed, we oly focused o the case where we had expoetial covergece see the whole result i the Appedix: Propositio 6. Actually, that there are three differet regimes. For α = 0 costat step-size, the algorithm is ot covergig, as the tail probability boud o P V t is ot depedet o. For α =, cofirmig results from BM, there is o expoetial forgettig of iitial coditios. Ad for α 0,, the forgettig of iitial coditios ad the tail probability are covergig to zero expoetially fast, respectively, as exp C α ad exp C α, for a costat C, hece the atural choice of α = /2 i our experimets. 6

8 4.4. Averaged ad Tail-averaged SGD with costat step-size I the subsectio, we take:, γ = γ. We first start with a result o the variace term, whose proof exteds the work of DFB7 to deviatio iequalities which are sharper tha the oes from BM3. Theorem 2 Covergece of the variace term i averaged SGD Assume H, H2, H3, H4, H5 ad cosider the average of the + first iterates of the sequece defied i Eq. 3: η = + i=0 η i. Assume η 0 = 0. We have for t > 0, : A H + t2 P /2 η t 2 exp, 4 E t where E t is defied with respect to the costats itroduced i the assumptios: E t = 4trAH 2 C + 2c/2 A /2 op 3λ t. 5 The work that remais to be doe is to boud the bias term of the recursio η bias. We have doe it for the full averaged sequece see Appedix I. Theorem 6 but as it is quite techical ad could lower a bit the clarity of the reasoig, we have decided to leave it i the Appedix. We preset here aother approach ad cosider the tail-averaged recursio, η tail = /2 i= /2 η i as proposed by JKK + 6, Sha. For this, we use the simple almost sure boud ηi bias H λγ i tail, bias η 0 H, such that η H λγ /2 η 0 H. For the variace term, we ca simply use the result above for ad /2, as η tail = 2 η η /2. This leads to: Corollary Covergece of tail-averaged SGD Assume H, H2, H3, H4, H5 ad cosider the tail-average of the sequece defied i Eq. 3: η tail = /2 i= /2 η i. We have for t > 0, : tail A /2 η γλ /2 A /2 op η 0 H + L, with 6 H PL t 4 exp + t 2 /4E t, 7 where L is defied i the proof see Appedix G.3 ad is the variace term of the tail-averaged recursio. We ca make the followig observatios o the two previous results: The proof techique see Appedix G.2 ad G.3 for the detailed proofs relies o cocetratio iequality of Berstei type. Ideed, we otice that i the settig of Theorem 2 η is a sum of icremets of a martigale. We prove almost sure bouds o the icremets ad o the variace followig the proof techique of DFB7 that lead to a Berstei type cocetratio result o the tail PV t. Followig the proof techique summed-up before, we see that E t is composed of the variace boud plus the almost sure boud times t. Remark that classically, A ad C are proportioal to H for excess risk predictios. I the fiite d-dimesioal settig this leads us to the usual variace boud proportioal to the dimesio d: trah 2 C = tri = d. The result is geeral i the sese that we ca apply it for all matrices A commutig with H this ca be used to prove results i L 2 or i H. Fially, ote that we improved the variace boud with respect to the strog covexity parameter λ which is usually of the order /λ 2 see Sha, ad is here trah 2 C. Ideed, i our settig, we will apply it for A = C = Σ ad H = Σ + λi, so that trah 2 C is upper bouded by the effective dimesio trσσ + λi which ca be way smaller tha /λ 2 see CDV07, DB6. The complete proof for the full average is writte i Appedix I. ad more precisely i Theorem 6. I this case the iitial coditios are ot forgotte expoetially fast though. 5. EXPONENTIALLY CONVERGENT SGD FOR CLASSIFICATION ERROR I this sectio we wat to show our mai results, o the error made o usee data by the -th iterate of the regularized SGD algorithm. Hece, we go back to the origial SGD recursio defied i Eq. 2. Let us recall it: g g λ = I γ K x K x + λi g g λ + γ ε, with the oise term ε k = ξ k K xk + g x k g λ x k K xk E g x k g λ x k K xk H. Like i the previous sectio we are goig to state two results i two differet settigs, the first oe for SGD with 7

9 decreasig step-size γ = γ/ α ad the secod oe for tail averaged SGD with costat step-size. For all the proofs of this sectio see the Appedix sectio H. 5.. SGD with decreasig step-size I this sectio, we focus o decreasig step-sizes γ = γ/ α for α 0,, which lead to expoetial covergece rates. Results for α = ad α = 0 ca be derived i a similar way but do ot lead to expoetial rates. Theorem 3 Assume A, A2, A3, A4 ad γ = γ/ α, α 0, for ay ad γλ <. Let g be the -th iterate of the recursio defied i Eq. 2, as soo as satisfies exp γλ α + α δ/5r g 0 g λ H, the Rg = R, with probability at least 2 exp δ2 α, C R with C R = 2 α+7 γr 2 trσ + g g λ 2 /λ + 8γR 2 δ + 2 g g λ /3, ad i particular ERg R 2 exp δ2 α. C R Note that Theorem 3 shows that with probability at least 2 exp δ2 C R, α the predictios of g are perfect. We ca also make the followig observatios: The idea of the proof see Appedix H. for the detailed proof is the followig: we kow that as soo as g g λ H δ/2r, the predictios of g are perfect Lemma. We just have to apply Theorem for to the origial SGD recursio ad make sure to boud each term by δ/4r. Similar results for o-averaged SGD could be derived beyod least-squares e.g., hige or logistic loss usig results from KT09. Also ote that the larger the α, the smaller the boud. However, it is oly valid for larger that a certai quatity depedig of λγ. A good trade-off is α = /2, for which we get a excess error of 2 exp δ2 C R /2, which is valid as soo as log0r g 0 g λ H /δ/4λ 2 γ 2. Notice also that we should go for large γλ to icrease the factor i the expoetial ad make the coditio happe as soo as possible. If we wat to emphasize the depedece of the boud o the importat parameters, we ca write that: ERg R 2 exp λδ 2 α /R 2. Whe the coditio o is ot met, the we still have the usual boud obtaied by takig directly the excess loss BJM06 but we lose expoetial covergece Tail averaged SGD with costat step-size We ow cosider the tail-averaged recursio 4, with the followig result: Theorem 4 Assume A, A2, A3, A4 ad γ = γ for ay, γλ < ad γ γ 0 = R 2 + 2λ. Let g be the -th iterate of the recursio defied i Eq. 2, ad ḡ tail = /2 i= /2 g i, as soo as 2/γλ l5r g 0 g λ H /δ, the Rḡ tail = R, with probability at least 4 exp δ 2 K R +, with K R = 29 R 2 + g g λ 2 trσσ + λi δR g g λ /3λ, ad i particular ERḡ tail R 4 exp δ 2 K R +. Theorem 4 shows that with probability at least 4 exp δ 2 K R +, the predictios of ḡ tail are perfect. We ca also make the followig observatios: The idea of the proof see Appedix H.2 for the detailed proof is the followig: we kow that as soo as ḡ tail g λ H δ/2r, the predictios of ḡ tail are perfect Lemma. We just have to apply Corollary to the origial SGD recursio, ad make sure to boud each term by δ/4r. 4 The full averagig result correspodig to Theorem 4 is proved i Appedix I.2, Theorem 7. 8

10 If we wat to emphasize the depedece of the boud o the importat parameters, we ca write that: ERg R 2 exp λ 2 δ 2 /R 4. Note that the λ 2 could be made much smaller with assumptios o the decrease of eigevalues of Σ it has bee show CDV07 that if the decay happes at speed / β : trσσ + λi 2 λ trσσ + λi R 2 /λ +/β. We wat to take γλ as big as possible to satisfy quickly the coditio. I compariso to the covergece rate i the case of decreasig step-sizes, the depedece o is improved as the covergece is really a expoetial of ad ot of some power of as i the previous result. Fially, the complete proof for the full average is cotaied i Appedix I.2 ad more precisely i Theorem CONCLUSION I this paper, we have show that stochastic gradiet could be expoetially coverget, oce some margi coditios are assumed; ad eve if a weaker margi coditio is assumed, fast rates ca be achieved see Appedix J. This is obtaied by ruig averaged stochastic gradiet o a least-squares problem, ad provig ew deviatio iequalities. Our work could be exteded i several atural ways: a our work relies o ew cocetratio results for the least-mea-squares algorithm i.e., SGD for square loss, it is atural to exted it to other losses, such as the logistic or hige loss; b goig beyod biary classificatio is also atural with the square loss CRR6, OBLJ7 or without TCKG05; c i our experimets, we use regularizatio, but we have experimeted with uregularized recursios, which do exhibit fast covergece, but for which proofs are usually harder DB6; fially, d i order to avoid the O 2 complexity, extedig the results of RCR7, RR7 would lead to a subquadratic complexity. ACKNOWLEDGEMENTS We ackowledge support from the Europea Research Coucil grat SEQUOIA We would like to thak Raphaël Berthier for useful discussios. REFERENCES AF03 Robert A. Adams ad Joh J.F. Fourier. Sobolev spaces, volume 40. Academic Press, AT07 Jea-Yves Audibert ad Alexadre B. Tsybakov. Fast learig rates for plug-i classifiers. The Aals of statistics, 352: , BJM06 Peter L. Bartlett, Michael I. Jorda, ad Jo D. McAuliffe. Covexity, classificatio, ad risk bouds. Joural of the America Statistical Associatio, 0473:38 56, BLC05 L. Bottou ad Y. Le Cu. O-lie learig for very large data sets. Applied Stochastic Models i Busiess ad Idustry, 22:37 5, BM F. Bach ad E. Moulies. No-asymptotic aalysis of stochastic approximatio algorithms for machie learig. I Advaces i Neural Iformatio Processig Systems NIPS, 20. BM3 F. Bach ad E. Moulies. No-strogly-covex smooth stochastic approximatio with covergece rate O/. I Advaces i Neural Iformatio Processig Systems NIPS, 203. CDV07 Adrea Capoetto ad Eresto De Vito. Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 73:33 368, CRR6 Carlo Ciliberto, Lorezo Rosasco, ad Alessadro Rudi. A cosistet regularizatio approach for structured predictio. I Advaces i Neural Iformatio Processig Systems, 206. DB5 A. Défossez ad F. Bach. Costat step size least-mea-square: Bias-variace trade-offs ad optimal samplig distributios. I Proc. AISTATS, 205. DB6 Aymeric Dieuleveut ad Fracis Bach. Noparametric stochastic approximatio with large step-sizes. The Aals of Statistics, 444: , 206. DBLJ4 Aaro Defazio, Fracis Bach, ad Simo Lacoste-Julie. SAGA: A fast icremetal gradiet method with support for o-strogly covex composite objectives. I Advaces i Neural Iformatio Processig Systems, 204. DFB7 Aymeric Dieuleveut, Nicolas Flammario, ad Fracis Bach. Harder, better, faster, stroger covergece rates for least-squares regressio. Joural of Machie Learig Research, pages 5, 207. DGL3 Luc Devroye, László Györfi, ad Gábor Lugosi. A Probabilistic Theory of Patter Recogitio, volume 3. Spriger Sciece & Busiess Media, 203. DVRT4 Eresto De Vito, Lorezo Rosasco, ad Alessadro Toigo. Learig sets with separatig kerels. Applied ad Computatioal Harmoic Aalysis, 372:85 27, 204. FBJ04 Keji Fukumizu, Fracis Bach, ad Michael I. Jorda. Dimesioality reductio for supervised learig with reproducig kerel Hilbert spaces. Joural of Machie Learig Research, 5Ja:73 99, JKK + 6 Prateek Jai, Sham M. Kakade, Rahul Kidambi, Praeeth Netrapalli, ad Aaro Sidford. Parallelizig stochastic approximatio through mii-batchig ad tail-averagig. Techical Report , arxiv, 206. JZ3 R. Johso ad T. Zhag. Acceleratig stochastic gradiet descet usig predictive variace reductio. I Advaces i Neural Iformatio Processig Systems,

11 KB05 Vladimir Koltchiskii ad Olexadra Bezosova. Expoetial covergece rates i classificatio. I Iteratioal Coferece o Computatioal Learig Theory. Spriger, KT09 Sham M. Kakade ad Ambuj Tewari. O the geeralizatio ability of olie strogly covex programmig algorithms. I Advaces i Neural Iformatio Processig Systems, LSB2 Nicolas Le Roux, Mark Schmidt, ad Fracis Bach. A stochastic gradiet method with a expoetial covergece rate for fiite traiig sets. I Advaces i Neural Iformatio Processig Systems, 202. MT99 Eo Mamme ad Alexadre Tsybakov. Smooth discrimiatio aalysis. The Aals of Statistics, 276: , 999. NJLS09 A. Nemirovski, A. Juditsky, G. La, ad A. Shapiro. Robust stochastic approximatio approach to stochastic programmig. SIAM Joural o Optimizatio, 94: , NV08 Y. Nesterov ad J. P. Vial. Cofidece level solutios for stochastic programmig. Automatica, 446: , NY83 A. S. Nemirovski ad D. B. Yudi. Problem complexity ad method efficiecy i optimizatio. Joh Wiley, 983. OBLJ7 Ato Osoki, Fracis Bach, ad Simo Lacoste-Julie. O structured predictio theory with calibrated covex surrogate losses. I Advaces i Neural Iformatio Processig Systems, 207. Pi94 Iosif Pielis. Optimum bouds for the distributios of martigales i baach spaces. The Aals of Probability, pages , 994. PJ92 B. T. Polyak ad A. B. Juditsky. Acceleratio of stochastic approximatio by averagig. SIAM Joural o Cotrol ad Optimizatio, 304: , 992. RBV0 Lorezo Rosasco, Mikhail Belki, ad Eresto De Vito. O learig with itegral operators. Joural of Machie Learig Research, Feb: , 200. RCDVR4 Alessadro Rudi, Guillermo D Caas, Eresto De Vito, ad Lorezo Rosasco. Learig sets ad subspaces. Regularizatio, Optimizatio, Kerels, ad Support Vector Machies, pages , 204. RCR7 Alessadro Rudi, Luigi Carratio, ad Lorezo Rosasco. FALKON: A optimal large scale kerel method. I Advaces i Neural Iformatio Processig Systems RM5 H. Robbis ad S. Moro. A stochastic approximatio method. A. Math. Statistics, 22: , 95. RR7 Alessadro Rudi ad Lorezo Rosasco. Geeralizatio properties of learig with radom features. I Advaces i Neural Iformatio Processig Systems, 207. Sha Ohad Shamir. Makig gradiet descet optimal for strogly covex stochastic optimizatio. CoRR, abs/ , 20. SL3 Mark Schmidt ad Nicolas Le Roux. Fast covergece of stochastic gradiet descet uder a strog growth coditio. Techical Report , arxiv, 203. Sol98 Mikhail V Solodov. Icremetal gradiet algorithms with stepsizes bouded away from zero. Computatioal Optimizatio ad Applicatios, :23 35, 998. SS02 B. Schölkopf ad A. J. Smola. Learig with Kerels. MIT Press, SSSS07 S. Shalev-Shwartz, Y. Siger, ad N. Srebro. Pegasos: Primal estimated sub-gradiet solver for svm. I Proceedigs of the Iteratioal Coferece o Machie Learig ICML, STC04 J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, TCKG05 B. Taskar, V. Chatalbashev, D. Koller, ad C. Guestri. Learig structured predictio models: A large margi approach. I Proceedigs of the Iteratioal Coferece o Machie Learig ICML, Xia0 L. Xiao. Dual averagig methods for regularized stochastic learig ad olie optimizatio. Joural of Machie Learig Research, 9: , 200. YRC07 Yua Yao, Lorezo Rosasco, ad Adrea Capoetto. O early stoppig i gradiet descet learig. Costructive Approximatio, 262:289 35, Zha04 Tog Zhag. Statistical behavior ad cosistecy of classificatio methods based o covex risk miimizatio. Aals of Statistics, pages 56 85,

12 Orgaizatio of the Appedix A. Experimets where the experimets ad their settigs are explaied. B. Probabilistic lemmas where cocetratio iequalities i Hilbert spaces used i sectio G are recalled. C. From H to 0- loss where, from high probability boud for H, we derived boud for the 0- error. D. Proofs of Expoetial rates for Kerel Ridge Regressio where expoetial rates for Kerel Ridge Regressio are prove Theorem 5. E. Proofs ad additioal results about cocrete examples where additioal results ad crocrete examples to satisfy A4 are give. F. Prelimiaries for Stochastic Gradiet Descet where the SGD recursio is derived. G. Proof of stochastic gradiet descet results where high probability bouds for the geeral SGD recursio are show Theorems ad 2. H. Expoetially coverget SGD for classificatio error where expoetial covergece of test error are show Theorems 3 ad 4. I. Extesio for the full averaged case where previous results are exteded for full averaged SGD istead of tail-averaged. J. Covergece uder weaker margi assumptio where previous results are exteded i the case of a weaker margi assumptio. APPENDIX A. EXPERIMENTS To illustrate our results, we cosider oe-dimesioal sythetic examples X = 0, for which our assumptios are easily satisfied. Ideed, we cosider the followig set-up that fulfils our assumptios: A, A3 We cosider here X U 0, ε/2 + ε/2, ad with the otatios of Example, we take S + = 0, ε/2 ad S = + ε/2,. For i, x i idepedetly sampled from ρ X we defie y i = if x i S + ad y i = if x i S. A2 We take the kerel to be the expoetial kerel Kx, x = exp x x for which the RKHS is a Sobolev space H = W s,2, with s > d/2, which is dese i L 2 AF03. A4 With this settig we could fid a closed form for g λ ad checked that it verified A4. Ideed we could solve the optimality equatio satisfied by g λ : z 0,, 0 Kx, zg λ xdρ X x + λg λ z = 0 Kx, zg ρ xdρ X x, the solutio beig a liear combiatio of expoetials i each set : 0, ε/2, ε/2, + ε/2 ad + ε/2,..5.0 ǫ g λ ρ X y x FIGURE. Represetig the ρ X desity uiform with ε-margi, the best estimator, i.e., Ex y ad g λ used for the simulatios λ = 0.0. I the case of SGD with decreasig step size, we computed oly the test error ERg R. For tail averaged SGD with costat step size, we computed the test error as well as the traiig error, the test

13 loss which correspods to the L 2 loss : 0 g x g λ x 2 dρx ad the traiig loss. I all cases we computed the errors of the -th iterate with respect to the calculated g λ, takig g 0 = 0. For ay, g = g γ g x y K x + λg. We ca use represetats to fid the recursio o the coefficiets. Ideed, if g = i= a i K x i, the the followig recursio for the a i reads : for i, a i = γ λa i a = γ a i Kx, x i y. From a i, we ca also compute the coefficiets of ḡ ad ḡ tail that we ote ā i ad ā,tail i respectively: ā i = a k i k=i + ad ā,tail i = /2 k= /2 ak i. To show our theoretical results we have decided to preset the followig figures: i= For the expoetial covergece of the averaged ad tail averaged cases, we plotted the error log 0 ERg R as a fuctio of. With this scale ad followig our results it goes as a lie after a certai Figures 2 ad 3 right. We recover the results of DFB7 that show covergece at speed / for the loss Figure 2 left. We adapted the scale to compare with the error plot. For Figure 3 left, we plotted log logerg R of the excess error with respect to the log of to show a lie of slope /2. It meets our theoretical boud of the form exp K, Note that for the plots where we plotted the expected excess errors, i.e., ERg R, we plotted the mea of the errors over 000 replicatios util = 200, whereas for the plots where we plotted the losses, i.e., a fuctio of g g 2, we plotted the mea of the loss over 00 replicatios util = log 0 g g λ trai_loss test_loss log 0 Rg R trai_error test_error FIGURE 2. Showig liear covergece for the L 0 errors i the case of margi of width ε. Left figure correspods to the test ad traiig loss i the averaged case whereas the right oe correspods to the error i the same settig. Note that the y-axis is the same while the x-axis is differet of a factor 0. The fact that the error plot is a lie after a certai matches our theoretical results. We took the followig parameters : ε = 0.05, γ = 0.25, λ = 0.0. We ca make the followig observatios: First remark that betwee plots of losses ad errors Figure 2 left ad right resp., there is a factor 0 betwee the umbers of samples 200 for errors ad 2000 for losses ad aother factor 0 betwee errors ad losses 0 4 for errors ad 0 3 for losses. That uderlies well our theoretical result which is the differece betwee expoetial rates of covergece of the excess error ad / rate of covergece of the loss. Moreover, we see that eve if the excess error with tail averagig seems a bit faster, we have liear rates too for the covergece of the excess error i the averaged case. Fially, we remark that the error o the trai set is always below the oe for a ukow test set of what seems to be close to a factor 2. APPENDIX B. PROBABILISTIC LEMMAS I this sectio we recall two fudametal results for cocetratio iequalities i Hilbert spaces show i Pi94. 2

14 log logrg R test_error log log 0 Rg R test_error_average test_error_tail_average FIGURE 3. Left plot shows the error i the o-averaged case for γ = γ/ ad right compares the test error betwee averaged ad tail averaged case. We took the followig parameters : ε = 0.05, γ = 0.25, λ = 0.0. Propositio 3 Let X k k N be a sequece of vectors of H adapted to a o decreasig sequece of σ-fields F k such that E X k F k = 0, sup k X k a ad E X k 2 F k b 2 for some sequeces a, b N. R+ The, for all t 0,, P X k t 2 exp t a t + b2 a a 2 l + ta. 8 b Proof : As E X k F k = 0, the F j-adapted sequece f j defied by f j = j X k is a martigale ad so is the stopped-martigale f j. By applyig Theorem 3.4 of Pi94 to the martigale f j, we have the result. Corollary 2 Let X k k N be a sequece of vectors of H adapted to a o decreasig sequece of σ-fields F k such that E X k F k = 0, sup k X k a ad E X k 2 F k b 2 for some sequeces a, b N. R+ The, for all t 0,, t 2 P X k t 2 exp 2 b a t/3 Proof : We apply 3 ad simply otice that t t + b2 l + ta a a a 2 b = b2 + at a 2 b 2 at = b2 φ a 2 where φu = + u l + u u for u > 0. Moreover φu t t + b2 l + ta b2 a a a 2 b a 2 b 2, l + at at b 2 b 2 u 2, so that: 2 + u/3 a t/b a t/3b 2 = t 2 2 b 2 + a t/3. APPENDIX C. FROM H TO 0- LOSS I this sectio we prove Lemma. Note that A4 requires the existece of g λ havig the same sig of g almost everywhere o the support of ρ X ad with absolute value uiformly bouded from below. I Lemma we prove that we ca boud the 0- error with respect to the distace i H of the estimator ĝ form g λ. Proof of Lemma : Deote by W the evet such that ĝ gλ H < δ/2r. Note that for ay f H, for ay x X. So for ĝ W, we have fx = f, K x H Kx H f H R f H, ĝx g λ x R ĝ gλ H < δ/2 x X. 3

15 Let x be i the support of ρ X. By A4 g λ x δ/2 a.e.. Let ĝ W ad x X such that g λ x > 0, we have ĝx = g λ x g λ x ĝx g λ x g λ x ĝx > 0, so sigĝx = sigg λ x = +. Similarly let ĝ W ad x X such that g λ x < 0, we have ĝx = g λ x + ĝx g λ x g λ x + g λ x ĝx < 0, so sigĝx = sigg λ x =. Fially ote that for ay ĝ H, by A4, either g λ x > 0 or g λ x < 0 a.e., so sigĝx = sigg λ x a.e. Now ote that by A, A4 we have that sigg x = sigg λ x a.e., where g x := E y x. So whe ĝ W, we have that sigĝx = sigg λ x = sigg x a.e., so Fially ote that Rĝ = ρ{x, y : sigĝx y} = ρ{x, y : sigg x y} = R. ERĝ = ERĝ W + ERĝ W c, where W is o the set W ad 0 outside, W c is the complemet set of W. So, whe ĝ W, we have while ERĝ W = R E W R, ERĝ W c E W c q. APPENDIX D. EXPONENTIAL RATES FOR KERNEL RIDGE REGRESSION D.. Results I this sectio, we first specialize some results already kow i literature about the cosistecy of kerel ridge least-squares regressio KRLS i H-orm CDV07 ad the we derive expoetial classificatio learig rates. Let x i, y i i= be examples idepedetly ad idetically distributed accordig to ρ, that is Assumptio A3. Deote by Σ, Σ the liear operators o H defied by Σ = K xi K xi, Σ = K x K x dρ X x, i= referred to as the covariace ad empirical o-cetered covariace operators see FBJ04 ad refereces therei. We recall that the KRLS estimator ĝ λ H, which miimizes the regularized empirical risk, is defied as follows i terms of Σ, ĝ λ = Σ + λi y i K xi. Moreover we recall that the populatio regularized estimator g λ is characterized by see CDV07 X i= g λ = Σ + λi EyK x. The followig lemma bouds the empirical regularized estimator with respect to the populatio oe i terms of λ, ad is essetially cotaied i the work of CDV07; here we rederive it i a subcase see below for the proof. Lemma 3 Uder assumptio A2, A3 for ay λ > 0, ote u = i= y ik xi EyK x H ad v = Σ Σ op, we have: ĝ λ g λ H u λ + Rv λ 2. By usig deviatio iequalities for u, v i Lemma 3 ad the applyig Lemma, we obtai the followig expoetial boud for kerel ridge regressio see complete proof below: Theorem 5 Uder A,A2,A3,A4 we have that for ay Z, Rĝ λ R = 0 with probability at least 4 exp C 0λ 4 δ 2. Moreover, ERĝ λ R 4 exp C 0 λ 4 δ 2 /R 8, with C 0 := 72 + λr R 8

16 The result above is a refiemet of Thm. 2.6 from YRC07. We improved the depedecy i ad removed the requiremets that g H or g = Σ r w for a w L 2 dρ X ad r > /2. Similar results exist for losses that are usually cosidered more suitable for classificatio, like the hige or logistic loss ad more geerally losses that are o-decreasig KB05. With respect to this latter work, our aalysis uses the explicit characterizatio of the kerel ridge regressio estimator i terms of liear operators o H CDV07. This, together with A4, allows us to use aalytic tools specific to reproducig kerel Hilbert spaces, leadig to proofs that are comparatively simpler, with explicit costats ad a clearer problem settig cosistig essetially i A, A4 ad o assumptios o E y x. Fially ote that the expoet of λ could be reduced by usig a refied aalysis uder additioal regularity assumptio of ρ X ad E y x as source coditio ad itrisic dimesio from CDV07, but it is beyod the scope of this paper. D.2. Proofs Here we prove that Kerel Ridge Regressio achieves expoetial classificatio rates uder assumptios A, A4. I particular by Lemma 3 we boud ĝλ g λ H i high probability ad the we use Lemma that gives expoetial classficatio rates whe ĝ λ g λ H is small eough i high probability. Proof of Lemma 3 : Deote by Σ λ the operator Σ + λi ad with Σ λ the operator Σ + λi. We have ĝ λ g λ = Σ λ y ik xi Σ λ EyK x i= = Σ λ y ik xi EyK x + Σ λ Σ λ EyK x. i= For the first term, sice Σ λ op λ, we have Σ λ y ik xi EyK x H Σ λ op y ik xi EyK x H i= i= y ik xi EyK x λ H. For the secod term, sice Σ λ op λ ad EyK x E yk x R, we have Σ λ Σ λ EyK x H = Σ λ Σ ΣΣλ EyK x H Σ Σ op Σ opσ EyKx op H R Σ λ Σ 2 op. λ i= λ Proof of Theorem 5 : Let τ > 0. By Lemma 2 we kow that ĝ λ g λ H u λ + Rv λ 2, with u = i= yikx i EyKx H ad v = Σ Σ op. For u we ca apply Pielis iequality Thm. 3.5, Pi94, sice x i, y i i= are sampled idepedetly accordig to the probability ρ ad that y ik xi EyK x is zero mea. Sice yikx i EyKx H 2R a.e. ad H is a Hilbert space, the we apply Pielis iequality with b 2 = 4R2 8R2 τ u, ad D =, obtaiig with probability at least 2e τ. Now, deote by HS the Hilbert-Schmidt orm ad recall that HS. To boud v we apply agai the Pielis iequality RBV0 cosiderig that the space of Hilbert-Schmidt operators is agai a Hilbert space ad that Σ = i= Kx i Kx i, that xi i= are idepedetly sampled from ρ X ad that EK xi K xi = Σ. I particular we apply it with D = ad b 2 = 4R4, so v = Σ Σ Σ 8R4 Σ HS τ, with probability 2e τ. Fially we take the itersectio boud of the two evets obtaiig, with probability at least 4e τ, 8R2 τ ĝ λ g λ H λ 2 + 8R6 τ λ 4. 5

17 δ 2 By selectig τ =, we obtai ĝ λ g λ H δ 9R 2 8R 2 λ 2 + 8R 6 λ 4 2 apply Lemma to have the expoetial boud for the classificatio error. 3R, with probability 4e τ. Now we ca APPENDIX E. PROOFS AND ADDITIONAL RESULTS ABOUT CONCRETE EXAMPLES I the ext subsectio we prove that g H is sufficiet to satisfy A4, while i subsectio E.2 we prove that specific settigs aturally satisfy A4. E.. From g H to A4 Here we assume that there exists g H such that g x = E y x a.e. o the support of ρ X. First we itroduce Aλ, that is a quatity related to the approximatio error of g λ with respect to g ad we study its behavior whe λ 0. The we express gλ g H i terms of Aλ. Fially we prove that for ay δ give by A, there exists λ such that A4 is satisfied. Let σ t, u t t Z be a eigebasis of Σ with σ σ 2 0, ad let α j = g, u j we itroduce the followig quatity Aλ = αt 2. Lemma 4 Uder A2, Aλ is decreasig for ay λ > 0 ad j N t:σ t λ lim Aλ = 0. λ 0 Proof : Uder A2 ad the liearity of trace, we have that σ j = trσ = tr K x K x dρ X x = K x, K x H dρ X x = Kx, xdρ X x R 2. Deote by t λ Z, the umber mi{t Z σ t λ}. Sice the σ j j Z is a o-decreasig summable sequece, the it coverges to 0, the lim t λ =. λ 0 Fially, sice αj 2 j Z is a summable sequece we have that lim Aλ = lim λ 0 λ 0 t:σ t λ αt 2 = lim αj 2 = lim λ 0 t j=t λ Here we express gλ g H i terms of g H ad of A λ. Lemma 5 Uder A2, for ay λ > 0 we have gλ g H λ g 2 H + A λ. Proof : Deote by Σ λ the operator Σ + λi. Note that sice g H, the αj 2 = 0. EyK x = Eg xk x = EK x K xg = EK x K xg = Σg, the g λ = Σ λ EyKx = Σ λ Σg. So we have gλ g H = Σ λ Σg g H = Σ λ Σ Ig H = λ Σ λ g H. Moreover λ Σ + λi g H λ Σ + λi /2 λ Σ + λi /2 g H λ Σ + λi /2 g H. Now we express λ Σ + λi /2 g H i terms of Aλ. We have that λ Σ + λi /2 g 2 = λ g H, Σ + λi g = λ g, j + λ j Zσ u j u j g j=t 6 = j Z λα 2 j σ j + λ.

Exponential Convergence of Testing Error for Stochastic Gradient Methods

Exponential Convergence of Testing Error for Stochastic Gradient Methods Proceedigs of Machie Learig Research vol 75: 47, 208 3st Aual Coferece o Learig Theory Expoetial Covergece of Testig Error for Stochastic Gradiet Methods Loucas Pillaud-Vivie Alessadro Rudi Fracis Bach