Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression

Size: px

Start display at page:

Download "Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression"

Hannah Lawrence
6 years ago
Views:

1 Joural of Machie Learig Research 204) Submitted 0/3; Revised 2/3; Published 2/4 Adaptivity of Averaged Stochastic Gradiet Descet to Local Strog Covexity for Logistic Regressio Fracis Bach INRIA - Sierra Project-team Départemet d Iformatique de l Ecole Normale Supérieure Paris, Frace fracis.bach@es.fr Editor: Léo Bottou Abstract I this paper, we cosider supervised learig problems such as logistic regressio ad study the stochastic gradiet method with averagig, i the usual stochastic approximatio settig where observatios are used oly oce. We show that after N iteratios, with a costat step-size proportioal to /R 2 N where N is the umber of observatios ad R is the maximum orm of the observatios, the covergece rate is always of order O/ N), ad improves to OR 2 /µn) where µ is the lowest eigevalue of the Hessia at the global optimum whe this eigevalue is greater tha R 2 / N). Sice µ does ot eed to be kow i advace, this shows that averaged stochastic gradiet is adaptive to ukow local strog covexity of the objective fuctio. Our proof relies o the geeralized selfcocordace properties of the logistic loss ad thus exteds to all geeralized liear models with uiformly bouded features. Keywords: stochastic approximatio, logistic regressio, self-cocordace. Itroductio The miimizatio of a objective fuctio which is oly available through ubiased estimates of the fuctio values or its gradiets is a key methodological problem i may disciplies. Its aalysis has bee attacked maily i three scietific commuities: stochastic approximatio Fabia, 968; Ruppert, 988; Polyak ad Juditsky, 992; Kusher ad Yi, 2003; Broadie et al., 2009), optimizatio Nesterov ad Vial, 2008; Nemirovski et al., 2009), ad machie learig Bottou ad Le Cu, 2005; Shalev-Shwartz et al., 2007; Bottou ad Bousquet, 2008; Shalev-Shwartz ad Srebro, 2008; Shalev-Shwartz et al., 2009; Duchi ad Siger, 2009; Xiao, 200). The mai algorithms which have emerged are stochastic gradiet descet a.k.a. Robbis-Moro algorithm), as well as a simple modificatio where iterates are averaged a.k.a. Polyak-Ruppert averagig). For covex optimizatio problems, the covergece rates of these algorithms depeds primarily o the potetial strog covexity of the objective fuctio Nemirovski ad Yudi, 983). For µ-strogly covex fuctios, after iteratios i.e., observatios), the optimal rate of covergece of fuctio values is O/µ) while for covex fuctios the optimal rate is O/ ), both of them achieved by averaged stochastic gradiet with step size respectively proportioal to /µ or / Nemirovski ad Yudi, 983; Agarwal et al., c 204 Fracis Bach.

2 Bach 202). For smooth fuctios, averaged stochastic gradiet with step sizes proportioal to / achieves them up to logarithmic terms Bach ad Moulies, 20). Covex optimizatio problems comig from supervised machie learig are typically of the form fθ) = E ly, θ, x ) ], where ly, θ, x ) is the loss betwee the respose y R ad the predictio θ, x R, where x is the iput data i a Hilbert space H ad liear predictios parameterized by θ H are cosidered. They may or may ot have strogly covex objective fuctios. This most ofte depeds o a) the correlatios betwee covariates x, ad b) the strog covexity of the loss fuctio l. The logistic loss l : u log + e u ) is ot strogly covex uless restricted to a compact set ideed, restricted to u U, U], we have l u) = e u + e u ) 2 4 e U ). Moreover, i the sequetial observatio model, the correlatios are ot kow at traiig time. Therefore, may theoretical results based o strog covexity do ot apply addig a squared orm µ 2 θ 2 is a possibility, however, i order to avoid addig too much bias, µ has to be small ad typically much smaller tha /, which the makes all strogly-covex bouds vacuous). The goal of this paper is to show that with proper assumptios, amely self-cocordace, oe ca readily obtai favorable theoretical guaratees for logistic regressio, amely a rate of the form OR 2 /µ) where µ is the lowest eigevalue of the Hessia at the global optimum, without ay expoetially icreasig costat factor e.g., with the otatios above, without terms of the form e U ). Aother goal of this paper is to desig a algorithm ad provide a aalysis that beefit from hidde local strog covexity without requirig to kow the local strog covexity costat i advace. I smooth situatios, the results of Bach ad Moulies 20) imply that the averaged stochastic gradiet method with step sizes of the form O/ ) is adaptive to the strog covexity of the problem. However the depedece i µ i the strogly covex case is of the form O/µ 2 ), which is sub-optimal. Moreover, the fial rate is rather complicated, otably because all possible step-sizes are cosidered. Fially, it does ot apply here because eve i low-correlatio settigs, the objective fuctio of logistic regressio caot be globally strogly covex. I this paper, we provide a aalysis for stochastic gradiet with averagig for geeralized liear models such as logistic regressio, with a step size proportioal to /R 2 where R is the radius of the data ad the umber of observatios, showig such adaptivity. I particular, we show that the algorithm ca adapt to the local strog-covexity costat, that is, the lowest eigevalue of the Hessia at the optimum. The aalysis is doe for a fiite horizo N ad a costat step size decreasig i N as /R 2 N, sice the aalysis is the slightly easier, though a) a decayig stepsize could be cosidered as well, ad b) it could be classically exteded to varyig step-sizes by a doublig trick Haza ad Kale, 200). 2. Stochastic Approximatio for Geeralized Liear Models I this sectio, we preset the assumptios our work relies o, as well as related work. 2. Assumptios Throughout this paper, we make the followig assumptios. We cosider a fuctio f defied o a Hilbert space H, equipped with a orm. Throughout the paper, we idetify the Hilbert space ad its dual; thus, the gradiets of f also belogs to H ad we 596

3 Adaptivity of Averaged Stochastic Gradiet Descet use the same orm o these. Moreover, we cosider a icreasig family of σ-fields F ) ad we assume that we are give a determiistic θ 0 H, ad a sequece of fuctios f : H R, for. We make the followig assumptios, for a certai R > 0: A) Covexity ad differetiability of f: f is covex ad three-times differetiable. A2) Geeralized self-cocordace of f Bach, 200): for all θ, θ 2 H, the fuctio ϕ : t f θ + tθ 2 θ ) ] satisfies: t R, ϕ t) R θ θ 2 ϕ t). A3) Attaied global miimum: f has a global miimum attaied at θ H. A4) Lipschitz-cotiuity of f ad f: all gradiets of f ad f are bouded by R, that is, for all θ H, f θ) R ad, f θ) R almost surely. A5) Adapted measurability:, f is F -measurable. A6) Ubiased gradiets:, Ef θ ) F ) = f θ ). A7) Stochastic gradiet recursio:, θ = θ γ f θ ), where γ ) is a determiistic sequece. I this paper, we will also cosider the averaged iterate θ = θ k, which may be trivially computed o-lie through the recursio θ = θ + θ. Amog the seve assumptios above, the o-stadard oe is A2): the otio of selfcocordace is a importat tool i covex optimizatio ad i particular for the study of Newto s method Nesterov ad Nemirovskii, 994). It correspods to havig the third derivative bouded by the 3 2-th power of the secod derivative. For machie learig, Bach 200) has geeralized the otio of self-cocordace by removig the 3 2-th power, so that it is applicable to cost fuctios arisig from probabilistic modelig, as show below. The key cosequece of our otio of self-cocordace is a relatioship show i Lemma 9 Sectio 5) betwee the orm of a gradiet f θ) ad the excess cost fuctio fθ) fθ ), which is the same tha for strogly covex fuctios, but with the local strog covexity costat rather tha the global oe which is equal to zero here). Our set of assumptios correspods to the followig examples with i.i.d. data, ad F equal to the σ-field geerated by x, y,..., x, y ): Logistic regressio: f θ) = log + exp y x, θ )), with data x uiformly almost surely bouded by R ad y {, }. The orm cosidered here is also the orm of the Hilbert space. Note that this icludes other biary classificatio losses, such as f θ) = y x, θ + + x, θ 2. Geeralized liear models with uiformly bouded features: f θ) = θ, Φx, y ) + log hy) exp θ, Φx, y) ) dy, with Φx, y) H almost surely bouded i orm by R, for all observatios x ad all potetial resposes y i a measurable space. This icludes multiomial regressio ad coditioal radom fields Lafferty et al., 200). Robust regressio: we may use f θ) = ϕy x, θ ), with ϕt) = log cosh t = log et +e t 2, with a similar boudedess assumptio o x. 597

4 Bach 2.2 Ruig-time Complexity The stochastic gradiet descet recursio θ = θ γ f θ ) operates i full geerality i the potetially ifiite-dimesioal Hilbert space H. There are two practical set-ups where this recursio ca be implemeted. Whe H is fiite-dimesioal with dimesio d, the the complexity of a sigle iteratio is Od), ad thus Od) after iteratios. Whe H is ifiite-dimesioal, the recursio ca be readily implemeted whe a) all fuctios f deped o oe-dimesioal projectios x, θ, that is, are of the form f θ) = ϕ x, θ ) for certai radom fuctios ϕ e.g., ϕ u) = ly, u) i machie learig), ad b) all scalar products K ij = x i, x j betwee x i ad x j, for i, j, ca be computed. This may be doe through the classical applicatio of the kerel trick Schölkopf ad Smola, 200; Shawe-Taylor ad Cristiaii, 2004): if θ 0 = 0, we may represet θ as a liear combiatio of vectors x,..., x, that is, θ = i= α ix i, ad the recursio may be writte i terms of the weights α, through α = γ x ϕ α i K i ). A key elemet to otice here is that without regularizatio, the weights α i correspodig to previous observatios remai costat. The overall complexity of the algorithm is O 2 ) times the cost of evaluatig a sigle kerel fuctio. See Bordes et al. 2005) ad Wag et al. 202) for approaches aimig at reducig the computatioal load i this settig. Fially, ote that i the kerel settig, the fuctio fθ) caot be strogly covex because the covariace operator of x is typically a compact operator, with a sequece of eigevalues tedig to zero some regularizatio is the eeded). 3. Related Work I this sectio, we review related work, first for o-strogly covex problems the for strogly covex problems. 3. No-strogly-covex Fuctios Whe oly covexity of the objective fuctio is assumed, several authors Nesterov ad Vial, 2008; Nemirovski et al., 2009; Shalev-Shwartz et al., 2009; Xiao, 200) have show that usig a step-size proportioal to /, together with some form of averagig, leads to the miimax optimal rate of O/ ) Nemirovski ad Yudi, 983; Agarwal et al., 202). Without averagig, the kow covergeces rates are suboptimal, that is, averagig is key to obtaiig the optimal rate Bach ad Moulies, 20). Note that the smoothess of the loss does ot chage the rate, but may help to obtai better costats, with the potetial use of acceleratio La, 202). Recet work Bach ad Moulies, 203) has cosidered algorithms which improve o the rate O/ ) for smooth self-cocordat losses, such as the square ad logistic losses. Their aalysis relies o some of the results proved i this paper i particular the high-order bouds i Sectio 4). The compactess of the domai is ofte used withi the algorithm by usig orthogoal projectios) ad withi the aalysis i particular to optimize the step size ad obtai high-probability bouds). I this paper, we do ot make such compactess assumptios, i= 598

5 Adaptivity of Averaged Stochastic Gradiet Descet sice i a machie learig cotext, the available boud would be loose ad hurt practical performace. Note that the aalysis of the related dual averagig methods Nesterov, 2009; Xiao, 200) has also bee carried without compactess assumptios, ad previous aalyses would also go through i the same set-up for stochastic mirror descet Nemirovski ad Yudi, 983), at least for bouds i expectatio. I the preset paper, we derive higherorder bouds ad bouds i high-probability where the lack of compactess is harder to deal with. Aother differece betwee several aalyses is the use of decayig step sizes of the form γ / vs. the use of a costat step size of the form γ / N for a fiite kow horizo N of iteratios. The use of a doublig trick as doe by Haza ad Kale 200) for strogly covex optimizatio, where a costat step size is used for iteratios betwee 2 p ad 2 p+, with a costat that is proportioal to / 2 p, would allow to obtai a aytime algorithm from a fiite horizo oe. I order to simplify our aalysis, we oly cosider a fiite horizo N ad a costat step-size that will be proportioal to / N. 3.2 Strogly-covex Fuctios Whe the fuctio is µ-strogly covex, that is, θ fθ) µ 2 θ 2 is covex, there are essetially two approaches to obtaiig the miimax-optimal rate of O/µ) Nemirovski ad Yudi, 983; Agarwal et al., 202): a) usig a step size proportioal to /µ with averagig for o-smooth problems Nesterov ad Vial, 2008; Nemirovski et al., 2009; Xiao, 200; Shalev-Shwartz et al., 2009; Duchi ad Siger, 2009; Lacoste-Julie et al., 202) or a step size proportioal to /R 2 + µ) also with averagig, for smooth problems, where R 2 is the smoothess costat of the loss of a sigle observatio Le Roux et al., 202); b) for smooth problems, usig loger step-sizes proportioal to / α for α /2, ) with averagig Polyak ad Juditsky, 992; Ruppert, 988; Bach ad Moulies, 20). Note that the ofte advocated step size, that is, of the form C/ where C is larger tha /µ, leads, without averagig to a covergece rate of O/µ 2 ) Fabia, 968; Bach ad Moulies, 20), hece with a worse depedece o µ. The solutio a) requires to have a good estimate of the strog-covexity costat µ, while the secod solutio b) does ot require to kow such estimate ad leads to a covergece rate achievig asymptotically the Cramer-Rao lower boud Polyak ad Juditsky, 992). Thus, this last solutio is adaptive to ukow but positive) amout of strog covexity. However, uless we take the limitig settig α = /2, it is ot adaptive to lack of strog covexity. While the o-asymptotic aalysis of Bach ad Moulies 20) already gives a covergece rate i that situatio, the boud is rather complicated ad also has a suboptimal depedece o µ. Aother goal of this paper is to cosider a less geeral result, but more compact ad, as already metioed, a better depedece o the strog covexity costat µ moreover, as reviewed below, we cosider the local strog covexity costat, which is much larger). Fially, ote that uless we restrict the support, the objective fuctio for logistic regressio caot be globally strogly covex sice the Hessia teds to zero whe θ teds to ifiity). I this paper we show that stochastic gradiet descet with averagig is adaptive to the local strog covexity costat, that is, the lowest eigevalue of the Hessia 599

6 Bach of f at the global optimum, without ay expoetial terms i RD which would be preset if a compact domai of diameter D was imposed ad traditioal aalyses were performed). 3.3 Adaptivity to Ukow Costats The desirable property of adaptivity to the difficulty of a optimizatio problem has also bee studied i several settigs. Gradiet descet with costat step size is for example aturally adaptive to the strog covexity of the problem see, e.g., Nesterov, 2004). I the stochastic cotext, Juditsky ad Nesterov 200) provide aother strategy tha averagig with loger step sizes, but for uiform covexity costats. 4. No-Strogly Covex Aalysis I this sectio, we study the averaged stochastic gradiet method i the o-strogly covex case, that is, without ay global or local) strog covexity assumptios. We first recall existig results i Sectio 4., that boud the expectatio of the excess risk leadig to a boud i O/ N). We the show usig martigale momet iequalities how all higherorder momets may be bouded i Sectio 4.2, still with a rate of O/ N). However, i Sectio 4.3, we cosider the covergece of the squared gradiet, with ow a rate of O/N). This last result is key to obtaiig the adaptivity to local strog covexity i Sectio Existig Results I this sectio, we review existig results for Lipschitz-cotiuous o-strogly covex problems Nesterov ad Vial, 2008; Nemirovski et al., 2009; Shalev-Shwartz et al., 2009; Duchi ad Siger, 2009; Xiao, 200). Note that smoothess is ot eeded here. We cosider a θ k the averaged costat step size γ = γ > 0, for all, ad we deote by θ = iterate. We prove the followig propositio, which provides a boud o the expectatio of f θ ) fθ ) that decays at rate Oγ + /γ), hece the usual choice γ / : Lemma Assume A) ad A3-7). With costat step size equal to γ, for ay 0, we have: ) Ef θ k fθ ) + 2γ E θ θ 2 2γ θ 0 θ 2 + γ 2 R2. Proof We have the followig recursio, obtaied from the Lipschitz-cotiuity of f : with θ θ 2 = θ θ 2 2γ θ θ, f θ ) + γ 2 f θ ) 2 θ θ 2 2γ θ θ, f θ ) + γ 2 R 2 + M, M = 2γ θ θ, f θ ) f θ ). We thus get, usig the classical result from covexity fθ ) fθ ) θ θ, f θ ) : 2γ fθ ) fθ ) ] θ θ 2 θ θ 2 + γ 2 R 2 + M. ) 600

7 Adaptivity of Averaged Stochastic Gradiet Descet Summig over itegers less tha, this implies: fθ k ) fθ ) + 2γ θ θ 2 2γ θ 0 θ 2 + γ 2 R2 + 2γ M k. We get the desired result by takig expectatio i the last iequality, ad usig the expectatio EM k = EEM k F k )) = 0 ad f θ ) k fθ k). The followig corollary cosiders a specific choice of the step size ote that the boud is oly true for the last iterate): Corollary 2 Assume A) ad A3-7). With costat step size equal to γ = 2R 2 N, we have: {,..., N}, E θ θ 2 θ 0 θ 2 + 4R 2, N ) Ef θ k fθ ) R2 θ 0 θ 2 + N N 4 N. Note that if θ 0 θ 2 was kow, the a better step-size would be γ = θ 0 θ R, leadig to N. However, this requires a estimate or simply a covergece rate proportioal to R θ 0 θ N a upper-boud) of θ 0 θ 2, which is typically ot available. We are goig to improve this result i several ways: All momets of θ θ 2 ad f θ ) fθ ) will be bouded, leadig to a subexpoetial behavior. Note that we do ot assume that the iterates are restricted to a predefied bouded set, which is the usual assumptio made to derive tail bouds for stochastic approximatio Nesterov ad Vial, 2008; Nemirovski et al., 2009; Kakade ad Tewari, 2009). We are goig to show that the squared orm of the gradiet at θ = θ k coverges at rate O/), eve i the o-strogly covex case. This will allow us to derive fier covergece rates i presece of local strog covexity i Sectio 5. The bouds above do ot explicitly deped o the dimesio of the problem, however, i practice, the quatity R 2 θ 0 θ 2 typically implicitly scales liearly i the problem dimesio. 4.2 Higher-Order ad Tail Boud I this sectio, we prove ovel higher-order bouds see the proof i Appedix C), both for ay costat step-sizes ad the for the specific choice γ = 2R 2. This will immediately N lead to tail bouds. Propositio 3 Assume A) ad A3-7). With costat step size equal to γ, for ay 0 ad iteger p, we have: E 2γ f θ ) fθ ) ] ) p + θ θ 2 3 θ 0 θ pγ 2 R 2) p. 60

8 Bach Corollary 4 Assume A) ad A3-7). With costat step size equal to γ = 2R 2 N, for ay iteger p, we have: {,..., N}, E θ θ 2p E f θ N ) fθ ) ] p R 2 3R 2 θ 0 θ 2 + 5p )] p, N 3R 2 θ 0 θ 2 + 5p )] p. I Appedix C, we first provide two alterative proofs of the same result: a) our origial somewhat tedious proof based o takig powers of the iequality i Equatio ) ad usig martigale momet iequalities, b) a shorter proof later derived by Bach ad Moulies 203), that uses Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.). We also provide i Appedix C a direct proof of the large deviatio boud that we ow preset. Havig a boud o all momets allows immediately to derive large deviatio bouds i the same two cases by applyig Lemma from Appedix A): Propositio 5 Assume A) ad A3-7). With costat step size equal to γ, for ay 0 ad t 0, we have: P f θ ) fθ ) 30γR 2 t + 3 θ 0 θ 2 ) 2 exp t), γ P θ θ 2 60γ 2 R 2 t + 6 θ 0 θ 2) 2 exp t). Corollary 6 Assume A) ad A3-7). With costat step size equal to γ = 2R 2 N, for ay t 0 we have: P f θ N ) fθ ) t + 6R2 θ 0 θ 2 ) 2 exp t), N N P θ N θ 2 R 2 t + 6 θ 0 θ 2) 2 exp t). We ca make the followig observatios: The results above are obtaied by direct applicatio of Propositio 3. I Appedix C, we also provide a alterative direct proof of a slightly weaker result, which was suggested ad outlied by Alekh Agarwal persoal commuicatio), ad that uses Freedma s iequality for martigales Freedma, 975, Theorem.6). The results above boudig the orm betwee the last iterate ad a global optimum exted to the averaged iterate. The iterates θ ad θ do ot ecessarily coverge to θ ote that θ may ot be uique i geeral ayway). Give that Ef θ ) fθ )] p ) /p is affie i p, we obtai a subexpoetial behavior, that is, tail bouds similar to a expoetial distributio. The same decay was obtaied by Nesterov ad Vial 2008) ad Nemirovski et al. 2009), but with a extra orthogoal projectio step that is equivalet i our settig to kow a boud o θ, which is i practice ot available. 602

9 Adaptivity of Averaged Stochastic Gradiet Descet The costats i the bouds of of Propositio 3 ad thus other results as well) could clearly be improved. I particular, we have, for p =, 2, 3 see proof i Appedix E): E 2γ f θ ) fθ ) ] + θ θ 2) θ 0 θ 2 + γ 2 R 2, E 2γ f θ ) fθ ) ] + θ θ 2) 2 θ 0 θ 2 + 9γ 2 R 2) 2, E 2γ f θ ) fθ ) ] + θ θ 2) 3 θ 0 θ γ 2 R 2) Covergece of Gradiets I this sectio, we prove higher-order bouds o the covergece of the gradiet, with a improved rate O/) for f θ ) 2. I this sectio, we will eed the self-cocordace property i Assumptio A2). Propositio 7 Assume A-7). With costat step size equal to γ, for ay 0 ad iteger p, we have: E f ) 2p) /2p θ k R 8 p + 4p + 40R 2 γp + 3 γ θ 0 θ ] γr θ 0 θ. Corollary 8 Assume A-7). With costat step size equal to γ = p, we have: E f N N 2R 2 N, for ay iteger ) 2p) /2p θ k R 8 p + 4p ] + 20p + 6R 2 θ 0 θ 2 + 6R θ 0 θ. N We ca make the followig observatios: The squared orm of the gradiet f θ N ) 2 coverges at rate O/N). Give that E f θ N ) 2p ) /2p is affie i p, we obtai a subexpoetial behavior for f θ N ), that is, tail bouds similar to a expoetial distributio. The proof of Propositio 7 makes use of the self-cocordace assumptio that allows to upperboud deviatios of gradiets by deviatios of fuctio values) together with the proof techique of Polyak ad Juditsky 992). 5. Self-Cocordace Aalysis for Strogly-Covex Problems I the previous sectio, we have show that f θ N ) 2 is of order O/N). If the fuctio f was strogly covex with costat µ > 0, this would immediately lead to the boud f θ N ) fθ ) 2µ f θ N ) 2, of order O/µN). However, because of the Lipschitzcotiuity of f o the full Hilbert space H, it caot be strogly covex. I this sectio, we show how the self-cocordace assumptio may be used to obtai the exact same behavior, but with µ replaced by the local strog covexity costat, which is more likely to be strictly positive. The required property is summarized i the followig propositio about geeralized) self-cocordat fuctio see proof i Appedix B.): 603

10 Bach Lemma 9 Let f be a covex three-times differetiable fuctio from H to R, such that for all θ, θ 2 H, the fuctio ϕ : t f θ + tθ 2 θ ) ] satisfies: t R, ϕ t) R θ θ 2 ϕ t). Let θ be a global miimizer of f ad µ the lowest eigevalue of f θ ), which is assumed strictly positive. If f θ) R µ 3 4, the θ θ 2 4 f θ) 2 µ 2 ad fθ) fθ ) 2 f θ) 2 µ. We may ow use this propositio for the averaged stochastic gradiet. For simplicity, we oly cosider the step-size γ = 2R 2, ad the last iterate see proof i Appedix F): N Propositio 0 Assume A-7). Assume γ = 2R 2. Let µ > 0 be the lowest eigevalue N of the Hessia of f at the uique global optimum θ. The: Ef θ N ) fθ ) R2 4, 5R θ 0 θ + ) Nµ E θn θ 2 R2 4. Nµ 2 6R θ 0 θ + 2) We ca make the followig observatios: The proof relies o Lemma 9 ad requires a cotrol of the probability that f θ N ) R µ 3 4, which is obtaied from Propositio 7. We cojecture a boud of the form R 2 Nµ R θ 0 θ + ] p p) 4 for the p-th order momet of f θ N ) fθ ), for some scalar costats ad. The ew boud ow has the term R θ 0 θ with a fourth power compared to the boud i Lemma, which has a secod power), which typically grows with the dimesio of the uderlyig space or the slowess of the decay of eigevalues of the covariace operator whe H is ifiite-dimesioal). It would be iterestig to study whether this depedece ca be reduced. The key elemets i the previous propositio are that a) the costat µ is the local covexity costat, ad b) the step-size does ot deped o that costat µ, hece the claimed adaptivity. The bouds are oly better tha the o-strogly-covex bouds from Lemma, whe the Hessia lowest eigevalue is large eough, that is, µr 2 N larger tha a fixed costat. I the cotext of logistic regressio, eve whe the covariace matrix of the iputs is ivertible, the the oly available lower boud o µ is equal to the lowest eigevalue of the covariace matrix times exp R θ ), which is expoetially small. However, the previous boud is overly pessimistic sice it is based o a upper boud o the largest possible value of x, θ. I practice, the actual value of µ is much larger ad oly a small costat smaller tha the lowest eigevalue of the covariace matrix. I order to assess if this result ca be improved, it is iterestig to look at the asymptotic result from Polyak ad Juditsky 992) for logistic regressio, which leads to a limit rate of / times tr f θ ) Ef θ )f θ ) ) ; ote that this rate holds both for the 604

11 Adaptivity of Averaged Stochastic Gradiet Descet stochastic approximatio algorithm ad for the global optimum of the traiig cost, usig stadard asymptotic statistics results Va der Vaart, 998). Whe the model is well-specified, that is, the log-odds ratio of the coditioal distributio of the label give the iput is liear, the Ef θ )f θ ) = Ef θ ) = f θ ), ad the asymptotic rate is exactly d/, where d is the dimesio of H which has to be fiite-dimesioal for the covariace matrix to be ivertible). It would be iterestig to see if makig the extra assumptio of well-specificatio, we ca also get a improved o-asymptotic result. Whe the model is mis-specified however, the quatity Ef θ )f θ ) may be large eve whe f θ ) is small, ad the asymptotic regime does ot readily lead to a improved boud. 6. Coclusio I this paper, we have provided a ovel aalysis of averaged stochastic gradiet for logistic regressio ad related problems. The key aspects of our result are a) the adaptivity to local strog covexity provided by averagig ad b) the use of self-cocordace to obtai a simple boud that does ot ivolve a term which is explicitly expoetial i R θ 0 θ, which could be obtaied by costraiig the domai of the iterates. Our results could be exteded i several ways: a) with a fiite ad kow horizo N, we cosidered a costat step-size proportioal to /R 2 N; it thus seems atural to study the decayig step size γ = O/R 2 ), which should, up to logarithmic terms, lead to similar results ad thus likely provide a solutio to a a recetly posed ope problem for olie logistic regressio McMaha ad Streeter, 202); b) a alterative would be to cosider a doublig trick where the step-sizes are piecewise costat; also, c) it may be possible to cosider other assumptios, such as exp-cocavity Haza ad Kale, 200) or uiform covexity Juditsky ad Nesterov, 200), to derive similar or improved results. Fially, by departig from a plai averaged stochastic gradiet recursio, Bach ad Moulies 203) have cosidered a olie Newto algorithm with the same ruig-time complexity, which leads to a rate of O/) without strog covexity assumptios for logistic regressio though with additioal assumptios regardig the distributios of the iputs). It would be iterestig to uderstad if simple assumptios such as the oes made i the preset paper are possible while preservig the improved covergece rate. Ackowledgmets The author was partially supported by the Europea Research Coucil SIERRA Project), ad thaks Simo Lacoste-Julie, Eric Moulies ad Mark Schmidt for helpful discussios. Morever, Alekh Agarwal suggested ad provided a detailed outlie of the proof techique based o Freedma s iequality; this was greatly appreciated. Appedix A. Probability Lemmas I this appedix, we prove simple lemmas relatig bouds o momets to tail bouds, with the traditioal use of Markov s iequality. See more geeral results by Bouchero et al. 203). 605

12 Bach Lemma Let X be a o-egative radom variable such that for some positive costats A ad B, ad all p {,..., }, EX p A + Bp) p. The, if t 2, PX 3Bt + 2A) 2 exp t). Proof We have, by Markov s iequality, for ay p {,..., }: PX 2Bp + 2A) EX p 2Bp + 2A) p For u, ], we cosider p = u, so that A + Bp)p = exp log2)p). 2A + 2Bp) p PX 2Bu + 2A) PX 2Bp + 2A) exp log2)p) 2 exp log2)u). We take t = log2)u ad use 2/ log 2 3. This is thus valid if t 2. Lemma 2 Let X be a o-egative radom variable such that for some positive costats A, B ad C, ad for all p {,..., }, EX p A p + Bp + C) 2p. The, if t, PX 2A t + 2Bt + 2C) 2 ) 4 exp t). Proof We have, by Markov s iequality, for ay p {,..., }: PX 2A p + 2Bp + 2C) 2 ) EX p 2A p + 2Bp + 2C) 2p A p + Bp + C) 2p 2A exp log4)p). p + 2Bp + 2C) 2p For u, ], we cosider p = u, so that PX 2A u + 2Bu + 2C) 2 ) PX 2A u + 2Bu + 2C) 2 ) exp log2)p) 4 exp log4)u). We take t = log4)u ad use log 4. This is thus valid if t. 606

13 Adaptivity of Averaged Stochastic Gradiet Descet Appedix B. Self-Cocordace Properties I this appedix, we show two lemmas regardig our geeralized otio of self-cocordace, as well as Lemma 9. For more details, see Bach 200) ad refereces therei. The followig lemma provide a upper-boud o a oe-dimesioal self-cocordat fuctio at a give poit which is based o the gradiet at this poit ad the value ad the Hessia at the global miimum. This is key to goig i Sectio 5 from a covergece of gradiets to a covergece of fuctio values. Lemma 3 Let ϕ : 0, ] R a strictly covex three-times differetiable fuctio such that for some S > 0, t 0, ], ϕ t) Sϕ t). Assume ϕ 0) = 0, ϕ 0) > 0. The: ϕ ) ϕ 0) S e S ad ϕ) ϕ0) + ϕ ) 2 ϕ + S). 0) Moreover, if α = ϕ )S ϕ 0) <, the ϕ) ϕ0) + ϕ ) 2 ϕ 0) α log α. If i additio α 3 4, the ϕ) ϕ0) + 2 ϕ ) 2 ϕ 0) ad ϕ 0) 2ϕ ). Proof By self-cocordace, we obtai that the derivative of u log ϕ u) is lower-bouded by S. By itegratig betwee 0 ad t 0, ], we get log ϕ t) log ϕ 0) St, that is, ϕ t) ϕ 0)e St, 2) ad by itegratig betwee 0 ad, we obtai ote that we have assumed ϕ 0) = 0): ϕ ) ϕ 0) e S. 3) S We the get with a first iequality from covexity of ϕ, ad the last iequality from e S + S): ϕ) ϕ0) ϕ ) ϕ ) ϕ ) S ϕ 0) e S = ϕ ) 2 ϕ S + S ) 0) e S ϕ ) 2 ϕ + S). 0) Equatio 3) implies that α e S, which implies, if α <, S log that ϕ) ϕ0) ϕ ) ϕ ) S ϕ 0) e S ϕ ) 2 ϕ 0) α log α, S α. This implies usig the mootoicity of S. Fially the last bouds are a cosequece of S e S α α log α 2, which is valid for α 3 4. Note that i Equatio 2), we do cosider a lower-boud o the Hessia with a expoetial factor e St. The key feature of usig self-cocordace properties is to get aroud this expoetial factor i the fial boud. The followig lemma upper-bouds the remaider i the first-order Taylor expasio of the gradiet by the remaider i the first-order Taylor expasio of the fuctio. This is importat whe fuctio values behave well i.e., coverge to the miimal value) while the iterates may ot. 607

14 Bach Lemma 4 Let f be a covex three-times differetiable fuctio from H to R, such that for all θ, θ 2 H, the fuctio ϕ : t f θ + tθ 2 θ ) ] satisfies: t R, ϕ t) R θ θ 2 ϕ t). For ay θ, θ 2 H, we have: f θ ) f θ 2 ) f θ 2 )θ 2 θ ) R fθ ) fθ 2 ) f θ 2 ), θ 2 θ ]. Proof For a give z H of uit orm, let ϕt) = z, f θ 2 + tθ θ 2 ) ) f θ 2 ) tf θ 2 )θ 2 θ ) ad ψt) = R fθ 2 + tθ θ 2 )) fθ 2 ) t f θ 2 ), θ 2 θ ]. We have ϕ0) = ψ0) = 0. Moreover, we have the followig derivatives: ϕ t) = z, f θ 2 + tθ θ 2 ) ) f θ 2 ), θ θ 2 ϕ t) = f θ 2 + tθ θ 2 ) ) z, θ θ 2, θ θ 2 ] R z 2 f θ 2 + tθ θ 2 ) ) θ θ 2, θ θ 2 ], usig the Appedix A of Bach 200), = R θ 2 θ, f θ 2 + tθ θ 2 ) ) θ θ 2 ) ψ t) = R f θ 2 + tθ θ 2 ) ) f θ 2 ), θ θ 2 ψ t) = R θ 2 θ, f θ 2 + tθ θ 2 ) ) θ θ 2 ), where f θ) is the third order tesor of third derivatives. This leads to ϕ 0) = ψ 0) = 0 ad ϕ t) ψ t). We thus have ϕ) ψ) by itegratig twice, which leads to the desired result by maximizig with respect to z. B. Proof of Lemma 9 We follow the stadard proof techiques i self-cocordat aalysis ad defie a appropriate fuctio of a sigle real variable ad apply simple lemmas like the oes above. Defie ϕ : t f θ + tθ θ ) ] fθ ). We have ϕ t) = f θ + tθ θ ) ], θ θ ϕ t) = θ θ, f θ + tθ θ ) ] θ θ ) ϕ t) = f θ + tθ θ ) ] θ θ, θ θ, θ θ ]. We thus have: ϕ0) = ϕ 0) = 0, 0 ϕ ) = f θ), θ θ f θ) θ θ, ϕ 0) = θ θ, f θ )θ θ ) µ θ θ 2, ad ϕt) 0 for all t 0, ]. Moreover, ϕ t) R θ θ ϕ t) for all t 0, ], that is, Lemma 3 applies with S = R θ θ. This leads to the desired result, with α = ϕ )S ϕ 0) iequality i Lemma 3), for all θ H ad without ay assumptio o θ): Appedix C. Proof of Propositio 3 f θ) R µ. Note that we also have usig the secod fθ) fθ ) + R θ θ ) f θ) 2. µ We provide two alterative proofs of the same result: a) our origial somewhat tedious proof i Appedices C.3 ad C.4, based o takig powers of the iequality i Equatio ) 608

15 Adaptivity of Averaged Stochastic Gradiet Descet ad usig martigale momet iequalities, b) a shorter proof i Appedix C.5, later derived by Bach ad Moulies 203), that uses Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.). Aother proof techique was suggested ad outlied by Alekh Agarwal persoal commuicatio), that uses Freedma s iequality for martigales Freedma, 975, Theorem.6); it allows to directly get a tail boud like i Propositio 5. This proof will be preseted i Appedix C.6. Note that the two shorter proofs curretly lead to slightly worse costats or to extra logarithmic factors), that may be improved with more refied derivatios. All proofs start from a similar martigale set-up that we describe i Appedix C. ad use a almost-sure boud whe p gets large Appedix C.2). C. Boudig Martigales From the proof of Lemma, we have the recursio: 2γ fθ ) fθ ) ] + θ θ 2 θ θ 2 + γ 2 R 2 + M, with M = 2γ θ θ, f θ ) f θ ). This leads to, by summig from to, ad usig the covexity of f: with 2γf ) θ k 2γfθ ) + θ θ 2 A, A = θ 0 θ 2 + γ 2 R 2 + M k 0. Note that A may also be defied recursively as A 0 = θ 0 θ 2 ad A = A + γ 2 R 2 + M. 4) The radom variables M ) ad A ) satisfy the followig properties that will proved useful throughout the proof: a) Martigale icremet: for all k, EM k F k ) = 0. This implies that S = M k is a martigale. b) Boudedess: M k 4γR θ k θ 4γRA /2 k almost surely. C.2 Almost Sure Boud I this sectio, we derive a almost sure boud that will be valid for small. From the stochastic gradiet recursio θ = θ γf θ ), we get, usig Assumptio A4) ad the triagle iequality: θ θ θ θ + γ f θ ) θ θ + γr almost surely. 609

16 Bach This leads to θ θ θ 0 θ + γr for all 0. This i tur implies that A θ 0 θ 2 + γ 2 R 2 + 4γR θ 0 θ 2 + γ 2 R 2 + 4γR θ k θ usig M k 4γR θ k θ, θ0 θ + k )γr ] usig the iequality above, θ 0 θ 2 + γ 2 R 2 + 4γR θ 0 θ + 2γ 2 R 2 2 by summig over the first itegers, θ 0 θ 2 + γ 2 R 2 + 2γ 2 2 R θ 0 θ 2 + 2γ 2 R 2 2 usig ab a2 2 + b2 2, 3 θ 0 θ γ 2 R 2 almost surely. 5) This implies that the boud is show for all p 4. C.3 Derivatio of p-th Order Recursio The first proof works as follows: a) derive a recursio betwee the p-th momets ad the lower-order momets this sectio) ad c) prove the result by iductio o p Appedix C.4). Note that we have to treat separately small values o i the recursio, for which we use the almost sure boud from Appedix C.2. Startig from Equatio 4), usig the biomial expasio formula, we get: ) p p ) p A p A + γ 2 R 2 + M = A + γ 2 R 2) p k M k k A + γ 2 R 2) p + p A + γ 2 R 2) p ) p p M + A + γ 2 R 2) p k /2 k. 4γRA k ) This leads to, usig EM F ) = 0, upper boudig γ 2 R 2 by 4γ 2 R 2, ad usig the biomial expasio formula several times: k=2 E A p ] F A + 4γ 2 R 2) p ) p p + A + 4γ 2 R 2) p k /2 ) k 4γRA k k=2 = A + 4γ 2 R 2 + 4γRA /2 ) p 4γRp A + 4γ 2 R 2) p /2 A by isolatig the term k = i the biomial formula, = A /2 + 2γR) 2p 4γRp A + 4γ 2 R 2) p /2 A 2p ) 2p = A k/2 k 2γR)2p k 4γRpA /2 p p k = 2p A k/2 2γR)2p k C k, ) A k 2γR) 2p k) 60

17 Adaptivity of Averaged Stochastic Gradiet Descet with the costats C k defied as: ) 2p C 2q = for q {0,..., p}, 2q ) ) 2p p C 2q+ = 2p for q {0,..., p }. 2q + q I particular, C 0 =, C 2p =, C = 0 ad C 2p = ) 2p 2p 2p p ) = 0. Our goal is ow to boudig the values of C k to obtai Equatio 8) below. This will be doe by boudig the odd-idexed elemet by the eve-idexed elemets. We have, for q {,..., p 2}, C 2q+ 2q + 2p 2q = = ) 2p 2q + 2q + 2p 2q 2p)! 2q + )!2p 2q )! 2p)! 2p 2q 2q)!2p 2q)! 2p 2q = 2q + 2p 2q 2p 2q ) 2p 2q 2p 2q. 6) 2q+ For the ed of the iterval above i q, that is, q = p 2, we obtai C 2q+ 2p 2q C 2q 4 3, 2q+ while for q p 3, we obtai C 2q+ 2p 2q C 2q 6 5. Moreover, for q {,..., p 2}, C 2q+ 2p 2q 2q + = = 2p 2q + ) 2p 2q 2q + 2p)! 2q + )!2p 2q )! 2p)! 2q + 2)!2p 2q 2)! 2p 2q 2q + 2q + 2 2q + = ) 2p 2q + 2 2q + 2 2q +. 7) 2p 2q 4 For the ed of the iterval above i q, that is, q =, we obtai C 2q+ 2q+ C 2q+2 3, 2p 2q 6 while for q 2, we obtai C 2q+ 2q+ C 2q+2 5. We have moreover, by usig the boud 2γRA /2 α 2 2γR)2 + 2α A for α = 2q+ 2p 2q : C 2q+ A q+/2 2γR)2p 2q = C 2q+ A q 2γR)2p 2q 2 A /2 2γR) C 2q+ A q ] 2q + 2p 2q 2γR)2p 2q 2 2 2p 2q 2γR)2 + A 2q + = 2 C 2p 2q 2q+ A q+ 2q + 2γR)2p 2q C 2q + 2q+ 2p 2q Aq 2γR)2p 2q. By combiig the previous iequality with Equatio 6) ad Equatio 7), we get that the terms idexed by 2q + are bouded by the terms idexed by 2q + 2 ad 2q. All terms with q {2,..., p 3} are expaded with costats 3 5, while for q = ad q = p 2, this is 6

18 Bach 2 3. Overall each eve term receives a cotributio which is less tha max{ 6 5, , 2 3 } = 9 This leads to p 2 C 2q+ A q+/2 2γR)2p 2q 9 p C 2q A q 2γR)2p 2q, q= leadig to the recursio that will allow us to derive our result: C.4 Proof by Iductio E A p ] F A p + 34 p q=0 q=0. ) 2p A q 2q 2γR)2p 2q. 8) We ow proceed by iductio o p. If we assume that EA q k 3 θ 0 θ 2 + kqγ 2 R 2 B ) q for all q < p, ad a certai B which we will choose to be equal to 20). We first ote that if 4p, the from Equatio 5), we have EA p 3 θ 0 θ γ 2 R 2 ) p 3 θ 0 θ pγ 2 R 2 ) p. Thus, we oly eed to cosider 4p. We the get from Equatio 8): E θ θ 2p θ 0 θ 2p + 34 θ 0 θ 2p + 34 p q=0 p q=0 ) 2p EA q 2q k 2γR)2p 2q ) 2p 3 θ0 θ 2 + kqγ 2 R 2 B ) q 2γR) 2p 2q, 2q usig the iductio hypothesis. We may ow sum with respect to k: E θ θ 2p θ 0 θ 2p + 34 θ 0 θ 2p + 34 usig p q=0 p q=0 2p 2q k α α+ α + = θ 0 θ 2p + 34 p j=0 )2γR) 2p 2q ) 2p 2γR) 2p 2q 2q for ay α > 0, 3 θ0 θ 2 + kqγ 2 R 2 B ) q q ) q qγ 3 j θ 0 θ 2j 2 R 2 B ) q j q j+ j q j + j=0 p 3 j θ 0 θ 2j 4γ 2 R 2 ) p j q=j 2p 2q ) q j ) qb 4 by chagig the order of summatios. We ow aim to show that it is less tha j=0 ) q j q p+ q j +, p p ) p 3 θ 0 θ 2 + kpγ 2 R B) 2 = 3 p θ 0 θ 2p + 3 j θ 0 θ 2j γ 2 R 2 ) p j Bp) p j. j 62

19 Adaptivity of Averaged Stochastic Gradiet Descet By comparig all terms i θ 0 θ 2j, this is true as soo as for all j {0,..., p }, 34 p q=j ) ) 2p q qb/4 ) q j 2q j q j + Bp/4)p j p q ) p j 34 p j 2p 2k + 2 ) p k j ) p k)b/4 ) p k j p k j Bp/4)p j k ) p, j obtaied by usig the chage of variable k = p q. This is implied by, usig 4p: 36 p j ) ) p k 2p B k p k p+j j 2k + 2 p j) p k ) p k j p k j. By expadig the biomial coefficiets ad simplifyig by p k j, this is equivalet to 36 p j ) 2p p k) p k j + ) B k p k p+j ) p k j p k. 2k + 2 p p j + ) We may ow write p k) p k j + ) p p j + ) = = p k)! p j)! p k)! p j)! = p k j)! p! p! p k j)! p j) p k j + ), p p k) so that we oly eed to show that 36 p j ) 2p p j) p k j + ) B k p k p+j ) p k j p k. 2k + 2 p p k) 63

20 Bach We have, by boudig all terms the tha p by p: = 36 = 36 = p j p j p j ) 2p p j) p k j + ) A k p k p+j ) p k j p k 2k + 2 p p k) ) 2p A k p k p+j 2k + 2 ) 2p A k p k 2k + 2 p j A k p j p j p k 2k + 2)! A k p 2 2 2k+2 2k + 2)! A k 2 2k+2 2k + 2)! p k p p k) pp k j p p k) 2p2p ) 2p 2k ) p p k) pp /2) p k /2) p p k) by associatig all 2k + 2 terms i ratios which are all less tha, + 2/ A) 2k+2 2k + 2)! = 36 ] cosh2/ A) < if A 20. We thus get the desired result EA p 3 θ 0 θ pγ 2 R 2) p, ad the propositio is proved by iductio. C.5 Alterative Proof Usig Burkholder-Rosethal-Pielis Iequality I this sectio, we preset a slightly modified versio of) the proof from Bach ad Moulies 203) which is based o Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.), which we ow recall. C.5. BRP Iequality Throughout the proof, we use the otatio for X H a radom vector, ad p ay real umber greater tha, X p = E X p) /p. We first recall the Burkholder-Rosethal- Pielis BRP) iequality Pielis, 994, Theorem 4.). Let p R, p 2 ad F ) 0 be a sequece of icreasig σ-fields, ad X ) a adapted sequece of elemets of H, such that E ] X F = 0, ad X p is fiite. The, sup k {,...,} k p X j p j= p E X k 2 ] /2 F k + p p/2 E X k 2 ] /2 F k + p p/2 sup k {,...,} sup k {,...,} X k 9) p /2 X k 2. p/2 64

21 Adaptivity of Averaged Stochastic Gradiet Descet C.5.2 Proof of Propositio 3 With Slightly Worse Costats) We use BRP s iequality i Equatio 9) to get, for p 2, /4]: sup k {0,...,} Thus if B = p A k θ 0 θ 2 + γ 2 R 2 + p 6γ2 R 2 +p sup /2 θ k θ 2 k {,...,} θ 0 θ 2 + γ 2 R 2 + p 4γR θ 0 θ 2 + γ 2 R 2 + 4γR +p 4γR sup k {0,..., } p/2 4γR θ k θ sup k {0,..., } A k /2 p/2 p sup A /2 k k {0,..., } p /2 ) A k p + p. sup k {0,...,} A k p, we have usig p /4, which implies p + p 3 2 p): p/2 By solvig this quadratic iequality, we get: B θ 0 θ 2 + γ 2 R 2 + 6γRB /2 p. B /2 3γR p ) 2 θ0 θ 2 + γ 2 R 2 + 9γ 2 R 2 p, which implies B /2 3γR p + θ 0 θ 2 + γ 2 R 2 + 9γ 2 R 2 p B 2 9γ 2 R 2 p + 2 θ 0 θ 2 + γ 2 R 2 + 9γ 2 R 2 p ) 40γ 2 R 2 p + 2 θ 0 θ 2. The previous statemet is valid for p 2 ad trivial for p =. From Appedix C.2, we oly eed to have the result for p 4. Thus the boud is slightly worse but could be clearly improved with more care, for example, by usig iductio o ). C.6 Alterative Proof Usig Freedma s Iequality I the previous sectio, we have used p-th order momet martigale iequalities that relate the orm of a martigale to the orm of its predictable quadratic variatio process. Similar results may be obtaied for tail bouds through Freedma s iequality Freedma, 975, Theorem.6). This proof techique was suggested ad outlied by Alekh Agarwal persoal commuicatio). C.6. Freedma s Iequality ad Extesios Let X ) be a real-valued martigale icremet adapted to the icreasig sequece of σ- fields F ), that is, such that EX F ) = 0, that is almost surely bouded, that is, X R 6

22 Bach almost surely. Let Σ = EX2 k F k ) the predictable quadratic variatio process. The for ay costats t ad σ 2, P max k {,...,} k i= X i t, Σ σ 2) 2 exp t 2 2σ 2 + Rt/3) Whe X ) are idepedet radom variables, this recovers Berstei s iequality. From this boud, oe may derive the followig boud Kakade ad Tewari, 2009); with probability 4log )δ, we have: max k {,...,} k i= { X i max 2 Σ, 3R log δ } log δ 2 Σ log δ + 3R log δ. 0) Note that the result of Kakade ad Tewari 2009) cosiders oly max k {,...,} k X i, but that the extesio of their proof is straightforward. i= C.6.2 Proof of Propositio 5 With Slightly Worse Costats ad Scaligs) ). i= X i rather tha We ca ow apply the iequality i Equatio 0) to M ). We have M 4γR θ θ 4γR θ 0 θ +γr ) almost surely. Moreover, EM F 2 ) 6γ 2 R 2 θ θ 2 6γ 2 R 2 A. This leads to with probability greater tha 4log )δ, max A k k {,...,} θ 0 θ 2 + γ 2 R 2 + 8γR A k log δ + 2γR θ 0 θ + γr ) log δ θ 0 θ 2 + γ 2 R 2 + 8γR max Ak log δ k {,...,} +2γR θ 0 θ + γr ) log δ. We may ow solve the quadratic iequality i max k {,...,} Ak. This leads to The max k {,...,} Ak 4γR ) 2 log δ θ 0 θ 2 + γ 2 R 2 + 2γR θ 0 θ + γr ) log δ + 6γ2 R 2 log δ = θ 0 θ 2 + γ 2 R 2 + 2γR θ 0 θ + 28γ 2 R 2) log δ. max k {,...,} Ak log δ + θ 0 θ + γr + 2γR θ 0 θ + 28γ 2 R 2 4γR log δ 66

23 Adaptivity of Averaged Stochastic Gradiet Descet ad max k {,...,} A k 64γ 2 R 2 log δ + 4 θ 0 θ 2 + 4γ 2 R γR θ 0 θ + 28γ 2 R 2) log δ 4 θ 0 θ 2 + 4γ 2 R γ 2 R γR θ 0 θ + 2γ 2 R 2) log δ ) 4 θ 0 θ 2 + 4γ 2 R γ 2 R γR θ 0 θ log δ. We thus recover a tail boud which is very similar to the oe obtaied i Propositio 5, with the followig differeces: the additioal term 48γR θ 0 θ is uimportat because γ = ON /2 ); however, because the extesio of Freedma s iequality is satisfied with probability 4log )δ, this proof techique loses a logarithmic factor. Appedix D. Proof of Propositio 7 The proof is orgaized i two parts: we first show a boud o the averaged gradiet f θ k ), the relate it to the gradiet at the averaged iterate, that is, f ) θ k, usig self-cocordace. D. Boud o f θ k ) We have, followig Polyak ad Juditsky 992) ad Bach ad Moulies 20): f θ ) = γ θ θ ), which implies, by summig over all itegers betwee ad : f θ k ) = f θ k ) f k θ k ) ] + γ θ 0 θ ) + γ θ θ ). We deote X k = f θ k ) f k θ k ) ] H. We have: X k 2R almost surely ad EX k F k ) = 0, with E X k 2 F k ) ) /2 2R. We may thus apply the Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.), ad get: E f θ k ) f k θ k ) ] 2p] /2p 2p 2R + 2p 2R /2. 67

24 Bach This leads to, usig Propositio 3 ad Mikowski s iequality: E f 2p] /2p θ k ) E f θ k ) f k θ k ) ] 2p] /2p + γ θ 0 θ + E θ θ 2p] /2p γ 2p 2R + 2p 2R /2 + γ θ 0 θ + 3 θ0 θ γ pγ 2 R 2] 2p 2R + 2p 2R /2 + γ θ 0 θ + 3 γ θ 0 θ + ] 20pγR γ 4pR + 2p 2R /2 + 2 γ θ 0 θ + 20pγR γ 4pR + p R ] γ θ 0 θ 4pR + 8 p R + 3 γ θ 0 θ. ) D.2 Usig Self-Cocordace Usig the self-cocordace property of Lemma 4 several times, we obtai: ) f θ k ) f θ k = f θ k ) f θ ) f θ )θ k θ ) ] ) ) f θ k + f θ ) + f θ ) θ k θ R fθk ) fθ ) f θ ), θ k θ ] ) +R f θ k fθ ) + f θ ), 2R ) fθ k ) fθ ) usig the covexity of f. θ k θ ] This leads to, usig Propositio 3: E ) 2p) /2p f θ k ) f θ k ] 2p ) /2p 2R E fθ k ) fθ ) 2R ) 3 θ 0 θ pγ 2 R 2. 2) 2γ Summig Equatio ) ad Equatio 2) leads to the desired result. 68

25 Adaptivity of Averaged Stochastic Gradiet Descet Appedix E. Results for Small p I Propositio 3, we may replace the boud 3 θ 0 θ pγ 2 R 2 with a boud with smaller costats for p =, 2, 3 to be used i proofs of results i Sectio 5). This is doe usig the same proof priciple but fier derivatios, as follows. We deote γ 2 R 2 = b ad θ θ 2 = a, ad cosider the followig iequalities which we have cosidered i the proof of Propositio 3: A p A + b + M ) p M 4b /2 A /2 ad EM F ) = 0, A 0 = a. We simply take expasios of the p-th power above, ad sum for all first itegers. We have: EA EA 2 EA + b a + b, EA 2 + b 2 + 2bA + M) 2 EA 2 + 2EA b + b 2 + 6bEA ] a 2 + 8b a + kb + b 2 a 2 + 8ba b] + b2 usig the result about EA, = a 2 + 8ba + b ) a + 9b) 2. We may ow pursue for the third order momets: EA 3 EA + b) 3 + 3EA + b) 2 M 2 + 3EA + b) 3 M + EM 3 EA + b) 3 + 3EA + b) 2 6bA b 3/2 EA 3/2 EA 3 + 3EA 2 b + 3EA b 2 + b 3 ) + 3EA + b)6ba + 64b 3/2 EA 3/2 = EA 3 + 3EA 2 b + 3EA b 2 + b 3 ) + 3EA + b)6ba By expadig, we get EA 3 +32bEA 2b /2 A /2 ]. EA 3 + 3EA 2 b + 3EA b 2 + b 3 ) + 3EA + b)6ba +32EbA A + 4b] 4 = EA 3 + EA 2 b ] + EA b ] + b 3 = EA EA 2 b + 79EA b 2 + b 3 ] ] a b a 2 + 8bka + b 2 k + 9k 2 ) + 79b 2 a + kb + b 3 a ba 2 + 9b 2 a + b 2 2 / )] + 79b 2 a + b 2 /2] + b 3 = a ba 2 + b 2 a ] + b 3 59/ /2 2 + ] = a ba 2 + b 2 a ] + b ] a + 20b) 3. 69

26 Bach We the obtai: E 2γ f θ ) fθ ) ] ] 2 + θ θ 2 θ 0 θ 2 + 9γ 2 R 2) 2 E 2γ f θ ) fθ ) ] ] 3 + θ θ 2 θ 0 θ γ 2 R 2) 3. Appedix F. Proof of Propositio 0 The proof follows from applyig self-cocordace properties Lemma 9) to θ. We thus eed to provide a cotrol o the probability that f θ ) 3µ 4R. F. Tail Boud for f θ ) We derive a large deviatio boud, as a cosequece of the boud o all momets of f θ ) Propositio 7) ad Lemma 2, that allows to go from momets to tail bouds: f P θ ) 2R 0 t + 40R 2 γt + 3 γ θ 0 θ ]) γr θ 0 θ 4 exp t). I order to derive the boud above, we eed to assume that p /4 so that 4p/ 2 p/ ), ad thus, whe applyig Lemma 2, the boud above is valid as log as t /4. It is however valid for all t, because the gradiets are bouded by R, ad for t >, we have 2R 0 t R, ad the iequality is satisfied with zero probability. F.2 Boudig the Fuctio Values From Lemma 9, if f θ ) 3µ 4R, the f θ ) fθ ) 2 f θ ) 2 µ. This will allow us to derive a tail boud for f θ ) fθ ), for sufficietly small deviatios. For larger deviatios, we will use the tail boud which does ot use strog covexity Propositio 5). We cosider the evet { A t = f θ ) 2R 0 t + 40R 2 γt + 3 γ θ 0 θ ]} γr θ 0 θ. We make the followig two assumptios regardig γ ad t: 0 t + 40R 2 γt 2 3µ 3 4R 2R = µ 4R 2 3) 3 ad γ θ 0 θ γr θ 0 θ 3µ 3 4R 2R = µ 8R 2, so that the upper-boud o f θ ) i the defiitio of A t is less tha 3µ 4R so that we ca apply Lemma 9). We thus have: A t {f θ ) fθ ) 8R2 0 t + 40R 2 γt + 3 µ γ θ 0 θ ] 2 } γr θ 0 θ {f θ ) fθ ) 8R2 0 ] 2 } t + 20 t +, µ 620

27 Adaptivity of Averaged Stochastic Gradiet Descet with = 2γR 2 ad = 3 γ θ 0 θ γr θ 0 θ. This implies that for all t 0, such that 0 t + 20 t µ 4R 2, that is, our assumptio i Equatio 3), we may apply the tail boud from Appedix F. to get: P f θ ) fθ ) 8R2 0 ] 2 ) t + 20 t + 4e t. 4) µ Moreover, we have for all v 0 from Propositio 5): P f θ ) fθ ) 30γR 2 v + 3 θ 0 θ 2 ) 2 exp v). ) γ We may ow use the last two iequalities to boud the expectatio Ef θ ) fθ )]. We first express the expectatio as a itegral of the tail boud ad split it ito three parts: E f θ ) fθ ) ] = = R 2 µ R 2 µ 2 8R2 µ + P f θ ) fθ ) u ] du P f θ ) fθ ) u ] du 6) µ 4R 2 + ) 2 8R 2 µ ) 2 P µ 4R 2 + P f θ ) fθ ) u ] du f θ ) fθ ) u ] du. 2 8R 2 µ We may ow boud the three terms separately. For the first itegral, we boud the probability by oe to get P f θ ) fθ ) u ] du 2 8R2 0 µ. For the third term i Equatio 6), we use the tail boud i Equatio ) to get = 2 + 8R 2 µ ) 2 P µ 4R R 2 µ ) 2 µ 4R f θ ) fθ ) u ] du P γ θ 0 θ 2 8R 2 µ ) 2 exp µ 4R γ θ 0 θ 2 We may apply Equatio ) because 8R 2 µ µ 4R 2 + ) 2 3 γ θ 0 θ 2 8R2 µ µ f θ ) fθ ) u + 3 γ θ 0 θ 2 u 30γR 2 ) du. 4R 2 + ) 2 µ 8R2 8R2 µ 62 µ 4R 2 ) 2 µ ] du 8R 2 = 3µ 0. 8R2

28 Bach We ca ow compute the boud explicitly to get + 8R 2 µ µ ) 2 P f θ ) fθ ) u ] du 4R 2 + 8R 60γR 2 2 µ exp 30γR 2 µ 4R 2 + ) γR 2 exp µ ) 80γR 4 60γR 2 80γR4 2µ = 2400γ2 R 6. µ ]) γ θ 0 θ 2 usig e α 2α µ ) 60γR 2 3µ exp 30γR 2 8R 2 for all α > 0 We ow cosider the secod term i Equatio 6) for which we will use Equatio 4). We cosider the chage of variable u = 8R2 0 2, t + 20 t + ] for which u 2 8R2 µ, 8R2 µ µ 4R 2 8R 2 µ 2 8R2 µ + ) 2 ] implies t 0, + ). This implies that µ 4R 2 + ) 2 P f θ ) fθ ) u ] du 8R 4e t 2 d 0 ] 2 ) t + 20 t + 0 µ = 32R2 e 00 t t µ 0 2 t/ ) 2 t / dt = 00Γ) 32R Γ2) ) µ Γ3/2) Γ/2) + 40 Γ) with Γ deotig the Gamma fuctio, = 32R ) π + 20 π µ We may ow combie the three bouds to get, from Equatio 6), E f θ ) fθ ) ] 2 8R2 µ γ2 R 6 µ + 32R µ 2 32R2 µ ) π + 20 π γ2 R π+0 ] π+40. For γ = 2R 2, with α = R θ N 0 θ, = ad = 6α 2 + 6α, we obtai E f θ N ) fθ ) ] ] 32R2 Nµ ] 32R2 9α 4 + 8α 3 + 9α α α Nµ R2 625α α α α ) = R2 ) 4. 5α + Nµ Nµ 622

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but