Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression

Size: px
Start display at page:

Download "Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression"

Transcription

1 Joural of Machie Learig Research 204) Submitted 0/3; Revised 2/3; Published 2/4 Adaptivity of Averaged Stochastic Gradiet Descet to Local Strog Covexity for Logistic Regressio Fracis Bach INRIA - Sierra Project-team Départemet d Iformatique de l Ecole Normale Supérieure Paris, Frace fracis.bach@es.fr Editor: Léo Bottou Abstract I this paper, we cosider supervised learig problems such as logistic regressio ad study the stochastic gradiet method with averagig, i the usual stochastic approximatio settig where observatios are used oly oce. We show that after N iteratios, with a costat step-size proportioal to /R 2 N where N is the umber of observatios ad R is the maximum orm of the observatios, the covergece rate is always of order O/ N), ad improves to OR 2 /µn) where µ is the lowest eigevalue of the Hessia at the global optimum whe this eigevalue is greater tha R 2 / N). Sice µ does ot eed to be kow i advace, this shows that averaged stochastic gradiet is adaptive to ukow local strog covexity of the objective fuctio. Our proof relies o the geeralized selfcocordace properties of the logistic loss ad thus exteds to all geeralized liear models with uiformly bouded features. Keywords: stochastic approximatio, logistic regressio, self-cocordace. Itroductio The miimizatio of a objective fuctio which is oly available through ubiased estimates of the fuctio values or its gradiets is a key methodological problem i may disciplies. Its aalysis has bee attacked maily i three scietific commuities: stochastic approximatio Fabia, 968; Ruppert, 988; Polyak ad Juditsky, 992; Kusher ad Yi, 2003; Broadie et al., 2009), optimizatio Nesterov ad Vial, 2008; Nemirovski et al., 2009), ad machie learig Bottou ad Le Cu, 2005; Shalev-Shwartz et al., 2007; Bottou ad Bousquet, 2008; Shalev-Shwartz ad Srebro, 2008; Shalev-Shwartz et al., 2009; Duchi ad Siger, 2009; Xiao, 200). The mai algorithms which have emerged are stochastic gradiet descet a.k.a. Robbis-Moro algorithm), as well as a simple modificatio where iterates are averaged a.k.a. Polyak-Ruppert averagig). For covex optimizatio problems, the covergece rates of these algorithms depeds primarily o the potetial strog covexity of the objective fuctio Nemirovski ad Yudi, 983). For µ-strogly covex fuctios, after iteratios i.e., observatios), the optimal rate of covergece of fuctio values is O/µ) while for covex fuctios the optimal rate is O/ ), both of them achieved by averaged stochastic gradiet with step size respectively proportioal to /µ or / Nemirovski ad Yudi, 983; Agarwal et al., c 204 Fracis Bach.

2 Bach 202). For smooth fuctios, averaged stochastic gradiet with step sizes proportioal to / achieves them up to logarithmic terms Bach ad Moulies, 20). Covex optimizatio problems comig from supervised machie learig are typically of the form fθ) = E ly, θ, x ) ], where ly, θ, x ) is the loss betwee the respose y R ad the predictio θ, x R, where x is the iput data i a Hilbert space H ad liear predictios parameterized by θ H are cosidered. They may or may ot have strogly covex objective fuctios. This most ofte depeds o a) the correlatios betwee covariates x, ad b) the strog covexity of the loss fuctio l. The logistic loss l : u log + e u ) is ot strogly covex uless restricted to a compact set ideed, restricted to u U, U], we have l u) = e u + e u ) 2 4 e U ). Moreover, i the sequetial observatio model, the correlatios are ot kow at traiig time. Therefore, may theoretical results based o strog covexity do ot apply addig a squared orm µ 2 θ 2 is a possibility, however, i order to avoid addig too much bias, µ has to be small ad typically much smaller tha /, which the makes all strogly-covex bouds vacuous). The goal of this paper is to show that with proper assumptios, amely self-cocordace, oe ca readily obtai favorable theoretical guaratees for logistic regressio, amely a rate of the form OR 2 /µ) where µ is the lowest eigevalue of the Hessia at the global optimum, without ay expoetially icreasig costat factor e.g., with the otatios above, without terms of the form e U ). Aother goal of this paper is to desig a algorithm ad provide a aalysis that beefit from hidde local strog covexity without requirig to kow the local strog covexity costat i advace. I smooth situatios, the results of Bach ad Moulies 20) imply that the averaged stochastic gradiet method with step sizes of the form O/ ) is adaptive to the strog covexity of the problem. However the depedece i µ i the strogly covex case is of the form O/µ 2 ), which is sub-optimal. Moreover, the fial rate is rather complicated, otably because all possible step-sizes are cosidered. Fially, it does ot apply here because eve i low-correlatio settigs, the objective fuctio of logistic regressio caot be globally strogly covex. I this paper, we provide a aalysis for stochastic gradiet with averagig for geeralized liear models such as logistic regressio, with a step size proportioal to /R 2 where R is the radius of the data ad the umber of observatios, showig such adaptivity. I particular, we show that the algorithm ca adapt to the local strog-covexity costat, that is, the lowest eigevalue of the Hessia at the optimum. The aalysis is doe for a fiite horizo N ad a costat step size decreasig i N as /R 2 N, sice the aalysis is the slightly easier, though a) a decayig stepsize could be cosidered as well, ad b) it could be classically exteded to varyig step-sizes by a doublig trick Haza ad Kale, 200). 2. Stochastic Approximatio for Geeralized Liear Models I this sectio, we preset the assumptios our work relies o, as well as related work. 2. Assumptios Throughout this paper, we make the followig assumptios. We cosider a fuctio f defied o a Hilbert space H, equipped with a orm. Throughout the paper, we idetify the Hilbert space ad its dual; thus, the gradiets of f also belogs to H ad we 596

3 Adaptivity of Averaged Stochastic Gradiet Descet use the same orm o these. Moreover, we cosider a icreasig family of σ-fields F ) ad we assume that we are give a determiistic θ 0 H, ad a sequece of fuctios f : H R, for. We make the followig assumptios, for a certai R > 0: A) Covexity ad differetiability of f: f is covex ad three-times differetiable. A2) Geeralized self-cocordace of f Bach, 200): for all θ, θ 2 H, the fuctio ϕ : t f θ + tθ 2 θ ) ] satisfies: t R, ϕ t) R θ θ 2 ϕ t). A3) Attaied global miimum: f has a global miimum attaied at θ H. A4) Lipschitz-cotiuity of f ad f: all gradiets of f ad f are bouded by R, that is, for all θ H, f θ) R ad, f θ) R almost surely. A5) Adapted measurability:, f is F -measurable. A6) Ubiased gradiets:, Ef θ ) F ) = f θ ). A7) Stochastic gradiet recursio:, θ = θ γ f θ ), where γ ) is a determiistic sequece. I this paper, we will also cosider the averaged iterate θ = θ k, which may be trivially computed o-lie through the recursio θ = θ + θ. Amog the seve assumptios above, the o-stadard oe is A2): the otio of selfcocordace is a importat tool i covex optimizatio ad i particular for the study of Newto s method Nesterov ad Nemirovskii, 994). It correspods to havig the third derivative bouded by the 3 2-th power of the secod derivative. For machie learig, Bach 200) has geeralized the otio of self-cocordace by removig the 3 2-th power, so that it is applicable to cost fuctios arisig from probabilistic modelig, as show below. The key cosequece of our otio of self-cocordace is a relatioship show i Lemma 9 Sectio 5) betwee the orm of a gradiet f θ) ad the excess cost fuctio fθ) fθ ), which is the same tha for strogly covex fuctios, but with the local strog covexity costat rather tha the global oe which is equal to zero here). Our set of assumptios correspods to the followig examples with i.i.d. data, ad F equal to the σ-field geerated by x, y,..., x, y ): Logistic regressio: f θ) = log + exp y x, θ )), with data x uiformly almost surely bouded by R ad y {, }. The orm cosidered here is also the orm of the Hilbert space. Note that this icludes other biary classificatio losses, such as f θ) = y x, θ + + x, θ 2. Geeralized liear models with uiformly bouded features: f θ) = θ, Φx, y ) + log hy) exp θ, Φx, y) ) dy, with Φx, y) H almost surely bouded i orm by R, for all observatios x ad all potetial resposes y i a measurable space. This icludes multiomial regressio ad coditioal radom fields Lafferty et al., 200). Robust regressio: we may use f θ) = ϕy x, θ ), with ϕt) = log cosh t = log et +e t 2, with a similar boudedess assumptio o x. 597

4 Bach 2.2 Ruig-time Complexity The stochastic gradiet descet recursio θ = θ γ f θ ) operates i full geerality i the potetially ifiite-dimesioal Hilbert space H. There are two practical set-ups where this recursio ca be implemeted. Whe H is fiite-dimesioal with dimesio d, the the complexity of a sigle iteratio is Od), ad thus Od) after iteratios. Whe H is ifiite-dimesioal, the recursio ca be readily implemeted whe a) all fuctios f deped o oe-dimesioal projectios x, θ, that is, are of the form f θ) = ϕ x, θ ) for certai radom fuctios ϕ e.g., ϕ u) = ly, u) i machie learig), ad b) all scalar products K ij = x i, x j betwee x i ad x j, for i, j, ca be computed. This may be doe through the classical applicatio of the kerel trick Schölkopf ad Smola, 200; Shawe-Taylor ad Cristiaii, 2004): if θ 0 = 0, we may represet θ as a liear combiatio of vectors x,..., x, that is, θ = i= α ix i, ad the recursio may be writte i terms of the weights α, through α = γ x ϕ α i K i ). A key elemet to otice here is that without regularizatio, the weights α i correspodig to previous observatios remai costat. The overall complexity of the algorithm is O 2 ) times the cost of evaluatig a sigle kerel fuctio. See Bordes et al. 2005) ad Wag et al. 202) for approaches aimig at reducig the computatioal load i this settig. Fially, ote that i the kerel settig, the fuctio fθ) caot be strogly covex because the covariace operator of x is typically a compact operator, with a sequece of eigevalues tedig to zero some regularizatio is the eeded). 3. Related Work I this sectio, we review related work, first for o-strogly covex problems the for strogly covex problems. 3. No-strogly-covex Fuctios Whe oly covexity of the objective fuctio is assumed, several authors Nesterov ad Vial, 2008; Nemirovski et al., 2009; Shalev-Shwartz et al., 2009; Xiao, 200) have show that usig a step-size proportioal to /, together with some form of averagig, leads to the miimax optimal rate of O/ ) Nemirovski ad Yudi, 983; Agarwal et al., 202). Without averagig, the kow covergeces rates are suboptimal, that is, averagig is key to obtaiig the optimal rate Bach ad Moulies, 20). Note that the smoothess of the loss does ot chage the rate, but may help to obtai better costats, with the potetial use of acceleratio La, 202). Recet work Bach ad Moulies, 203) has cosidered algorithms which improve o the rate O/ ) for smooth self-cocordat losses, such as the square ad logistic losses. Their aalysis relies o some of the results proved i this paper i particular the high-order bouds i Sectio 4). The compactess of the domai is ofte used withi the algorithm by usig orthogoal projectios) ad withi the aalysis i particular to optimize the step size ad obtai high-probability bouds). I this paper, we do ot make such compactess assumptios, i= 598

5 Adaptivity of Averaged Stochastic Gradiet Descet sice i a machie learig cotext, the available boud would be loose ad hurt practical performace. Note that the aalysis of the related dual averagig methods Nesterov, 2009; Xiao, 200) has also bee carried without compactess assumptios, ad previous aalyses would also go through i the same set-up for stochastic mirror descet Nemirovski ad Yudi, 983), at least for bouds i expectatio. I the preset paper, we derive higherorder bouds ad bouds i high-probability where the lack of compactess is harder to deal with. Aother differece betwee several aalyses is the use of decayig step sizes of the form γ / vs. the use of a costat step size of the form γ / N for a fiite kow horizo N of iteratios. The use of a doublig trick as doe by Haza ad Kale 200) for strogly covex optimizatio, where a costat step size is used for iteratios betwee 2 p ad 2 p+, with a costat that is proportioal to / 2 p, would allow to obtai a aytime algorithm from a fiite horizo oe. I order to simplify our aalysis, we oly cosider a fiite horizo N ad a costat step-size that will be proportioal to / N. 3.2 Strogly-covex Fuctios Whe the fuctio is µ-strogly covex, that is, θ fθ) µ 2 θ 2 is covex, there are essetially two approaches to obtaiig the miimax-optimal rate of O/µ) Nemirovski ad Yudi, 983; Agarwal et al., 202): a) usig a step size proportioal to /µ with averagig for o-smooth problems Nesterov ad Vial, 2008; Nemirovski et al., 2009; Xiao, 200; Shalev-Shwartz et al., 2009; Duchi ad Siger, 2009; Lacoste-Julie et al., 202) or a step size proportioal to /R 2 + µ) also with averagig, for smooth problems, where R 2 is the smoothess costat of the loss of a sigle observatio Le Roux et al., 202); b) for smooth problems, usig loger step-sizes proportioal to / α for α /2, ) with averagig Polyak ad Juditsky, 992; Ruppert, 988; Bach ad Moulies, 20). Note that the ofte advocated step size, that is, of the form C/ where C is larger tha /µ, leads, without averagig to a covergece rate of O/µ 2 ) Fabia, 968; Bach ad Moulies, 20), hece with a worse depedece o µ. The solutio a) requires to have a good estimate of the strog-covexity costat µ, while the secod solutio b) does ot require to kow such estimate ad leads to a covergece rate achievig asymptotically the Cramer-Rao lower boud Polyak ad Juditsky, 992). Thus, this last solutio is adaptive to ukow but positive) amout of strog covexity. However, uless we take the limitig settig α = /2, it is ot adaptive to lack of strog covexity. While the o-asymptotic aalysis of Bach ad Moulies 20) already gives a covergece rate i that situatio, the boud is rather complicated ad also has a suboptimal depedece o µ. Aother goal of this paper is to cosider a less geeral result, but more compact ad, as already metioed, a better depedece o the strog covexity costat µ moreover, as reviewed below, we cosider the local strog covexity costat, which is much larger). Fially, ote that uless we restrict the support, the objective fuctio for logistic regressio caot be globally strogly covex sice the Hessia teds to zero whe θ teds to ifiity). I this paper we show that stochastic gradiet descet with averagig is adaptive to the local strog covexity costat, that is, the lowest eigevalue of the Hessia 599

6 Bach of f at the global optimum, without ay expoetial terms i RD which would be preset if a compact domai of diameter D was imposed ad traditioal aalyses were performed). 3.3 Adaptivity to Ukow Costats The desirable property of adaptivity to the difficulty of a optimizatio problem has also bee studied i several settigs. Gradiet descet with costat step size is for example aturally adaptive to the strog covexity of the problem see, e.g., Nesterov, 2004). I the stochastic cotext, Juditsky ad Nesterov 200) provide aother strategy tha averagig with loger step sizes, but for uiform covexity costats. 4. No-Strogly Covex Aalysis I this sectio, we study the averaged stochastic gradiet method i the o-strogly covex case, that is, without ay global or local) strog covexity assumptios. We first recall existig results i Sectio 4., that boud the expectatio of the excess risk leadig to a boud i O/ N). We the show usig martigale momet iequalities how all higherorder momets may be bouded i Sectio 4.2, still with a rate of O/ N). However, i Sectio 4.3, we cosider the covergece of the squared gradiet, with ow a rate of O/N). This last result is key to obtaiig the adaptivity to local strog covexity i Sectio Existig Results I this sectio, we review existig results for Lipschitz-cotiuous o-strogly covex problems Nesterov ad Vial, 2008; Nemirovski et al., 2009; Shalev-Shwartz et al., 2009; Duchi ad Siger, 2009; Xiao, 200). Note that smoothess is ot eeded here. We cosider a θ k the averaged costat step size γ = γ > 0, for all, ad we deote by θ = iterate. We prove the followig propositio, which provides a boud o the expectatio of f θ ) fθ ) that decays at rate Oγ + /γ), hece the usual choice γ / : Lemma Assume A) ad A3-7). With costat step size equal to γ, for ay 0, we have: ) Ef θ k fθ ) + 2γ E θ θ 2 2γ θ 0 θ 2 + γ 2 R2. Proof We have the followig recursio, obtaied from the Lipschitz-cotiuity of f : with θ θ 2 = θ θ 2 2γ θ θ, f θ ) + γ 2 f θ ) 2 θ θ 2 2γ θ θ, f θ ) + γ 2 R 2 + M, M = 2γ θ θ, f θ ) f θ ). We thus get, usig the classical result from covexity fθ ) fθ ) θ θ, f θ ) : 2γ fθ ) fθ ) ] θ θ 2 θ θ 2 + γ 2 R 2 + M. ) 600

7 Adaptivity of Averaged Stochastic Gradiet Descet Summig over itegers less tha, this implies: fθ k ) fθ ) + 2γ θ θ 2 2γ θ 0 θ 2 + γ 2 R2 + 2γ M k. We get the desired result by takig expectatio i the last iequality, ad usig the expectatio EM k = EEM k F k )) = 0 ad f θ ) k fθ k). The followig corollary cosiders a specific choice of the step size ote that the boud is oly true for the last iterate): Corollary 2 Assume A) ad A3-7). With costat step size equal to γ = 2R 2 N, we have: {,..., N}, E θ θ 2 θ 0 θ 2 + 4R 2, N ) Ef θ k fθ ) R2 θ 0 θ 2 + N N 4 N. Note that if θ 0 θ 2 was kow, the a better step-size would be γ = θ 0 θ R, leadig to N. However, this requires a estimate or simply a covergece rate proportioal to R θ 0 θ N a upper-boud) of θ 0 θ 2, which is typically ot available. We are goig to improve this result i several ways: All momets of θ θ 2 ad f θ ) fθ ) will be bouded, leadig to a subexpoetial behavior. Note that we do ot assume that the iterates are restricted to a predefied bouded set, which is the usual assumptio made to derive tail bouds for stochastic approximatio Nesterov ad Vial, 2008; Nemirovski et al., 2009; Kakade ad Tewari, 2009). We are goig to show that the squared orm of the gradiet at θ = θ k coverges at rate O/), eve i the o-strogly covex case. This will allow us to derive fier covergece rates i presece of local strog covexity i Sectio 5. The bouds above do ot explicitly deped o the dimesio of the problem, however, i practice, the quatity R 2 θ 0 θ 2 typically implicitly scales liearly i the problem dimesio. 4.2 Higher-Order ad Tail Boud I this sectio, we prove ovel higher-order bouds see the proof i Appedix C), both for ay costat step-sizes ad the for the specific choice γ = 2R 2. This will immediately N lead to tail bouds. Propositio 3 Assume A) ad A3-7). With costat step size equal to γ, for ay 0 ad iteger p, we have: E 2γ f θ ) fθ ) ] ) p + θ θ 2 3 θ 0 θ pγ 2 R 2) p. 60

8 Bach Corollary 4 Assume A) ad A3-7). With costat step size equal to γ = 2R 2 N, for ay iteger p, we have: {,..., N}, E θ θ 2p E f θ N ) fθ ) ] p R 2 3R 2 θ 0 θ 2 + 5p )] p, N 3R 2 θ 0 θ 2 + 5p )] p. I Appedix C, we first provide two alterative proofs of the same result: a) our origial somewhat tedious proof based o takig powers of the iequality i Equatio ) ad usig martigale momet iequalities, b) a shorter proof later derived by Bach ad Moulies 203), that uses Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.). We also provide i Appedix C a direct proof of the large deviatio boud that we ow preset. Havig a boud o all momets allows immediately to derive large deviatio bouds i the same two cases by applyig Lemma from Appedix A): Propositio 5 Assume A) ad A3-7). With costat step size equal to γ, for ay 0 ad t 0, we have: P f θ ) fθ ) 30γR 2 t + 3 θ 0 θ 2 ) 2 exp t), γ P θ θ 2 60γ 2 R 2 t + 6 θ 0 θ 2) 2 exp t). Corollary 6 Assume A) ad A3-7). With costat step size equal to γ = 2R 2 N, for ay t 0 we have: P f θ N ) fθ ) t + 6R2 θ 0 θ 2 ) 2 exp t), N N P θ N θ 2 R 2 t + 6 θ 0 θ 2) 2 exp t). We ca make the followig observatios: The results above are obtaied by direct applicatio of Propositio 3. I Appedix C, we also provide a alterative direct proof of a slightly weaker result, which was suggested ad outlied by Alekh Agarwal persoal commuicatio), ad that uses Freedma s iequality for martigales Freedma, 975, Theorem.6). The results above boudig the orm betwee the last iterate ad a global optimum exted to the averaged iterate. The iterates θ ad θ do ot ecessarily coverge to θ ote that θ may ot be uique i geeral ayway). Give that Ef θ ) fθ )] p ) /p is affie i p, we obtai a subexpoetial behavior, that is, tail bouds similar to a expoetial distributio. The same decay was obtaied by Nesterov ad Vial 2008) ad Nemirovski et al. 2009), but with a extra orthogoal projectio step that is equivalet i our settig to kow a boud o θ, which is i practice ot available. 602

9 Adaptivity of Averaged Stochastic Gradiet Descet The costats i the bouds of of Propositio 3 ad thus other results as well) could clearly be improved. I particular, we have, for p =, 2, 3 see proof i Appedix E): E 2γ f θ ) fθ ) ] + θ θ 2) θ 0 θ 2 + γ 2 R 2, E 2γ f θ ) fθ ) ] + θ θ 2) 2 θ 0 θ 2 + 9γ 2 R 2) 2, E 2γ f θ ) fθ ) ] + θ θ 2) 3 θ 0 θ γ 2 R 2) Covergece of Gradiets I this sectio, we prove higher-order bouds o the covergece of the gradiet, with a improved rate O/) for f θ ) 2. I this sectio, we will eed the self-cocordace property i Assumptio A2). Propositio 7 Assume A-7). With costat step size equal to γ, for ay 0 ad iteger p, we have: E f ) 2p) /2p θ k R 8 p + 4p + 40R 2 γp + 3 γ θ 0 θ ] γr θ 0 θ. Corollary 8 Assume A-7). With costat step size equal to γ = p, we have: E f N N 2R 2 N, for ay iteger ) 2p) /2p θ k R 8 p + 4p ] + 20p + 6R 2 θ 0 θ 2 + 6R θ 0 θ. N We ca make the followig observatios: The squared orm of the gradiet f θ N ) 2 coverges at rate O/N). Give that E f θ N ) 2p ) /2p is affie i p, we obtai a subexpoetial behavior for f θ N ), that is, tail bouds similar to a expoetial distributio. The proof of Propositio 7 makes use of the self-cocordace assumptio that allows to upperboud deviatios of gradiets by deviatios of fuctio values) together with the proof techique of Polyak ad Juditsky 992). 5. Self-Cocordace Aalysis for Strogly-Covex Problems I the previous sectio, we have show that f θ N ) 2 is of order O/N). If the fuctio f was strogly covex with costat µ > 0, this would immediately lead to the boud f θ N ) fθ ) 2µ f θ N ) 2, of order O/µN). However, because of the Lipschitzcotiuity of f o the full Hilbert space H, it caot be strogly covex. I this sectio, we show how the self-cocordace assumptio may be used to obtai the exact same behavior, but with µ replaced by the local strog covexity costat, which is more likely to be strictly positive. The required property is summarized i the followig propositio about geeralized) self-cocordat fuctio see proof i Appedix B.): 603

10 Bach Lemma 9 Let f be a covex three-times differetiable fuctio from H to R, such that for all θ, θ 2 H, the fuctio ϕ : t f θ + tθ 2 θ ) ] satisfies: t R, ϕ t) R θ θ 2 ϕ t). Let θ be a global miimizer of f ad µ the lowest eigevalue of f θ ), which is assumed strictly positive. If f θ) R µ 3 4, the θ θ 2 4 f θ) 2 µ 2 ad fθ) fθ ) 2 f θ) 2 µ. We may ow use this propositio for the averaged stochastic gradiet. For simplicity, we oly cosider the step-size γ = 2R 2, ad the last iterate see proof i Appedix F): N Propositio 0 Assume A-7). Assume γ = 2R 2. Let µ > 0 be the lowest eigevalue N of the Hessia of f at the uique global optimum θ. The: Ef θ N ) fθ ) R2 4, 5R θ 0 θ + ) Nµ E θn θ 2 R2 4. Nµ 2 6R θ 0 θ + 2) We ca make the followig observatios: The proof relies o Lemma 9 ad requires a cotrol of the probability that f θ N ) R µ 3 4, which is obtaied from Propositio 7. We cojecture a boud of the form R 2 Nµ R θ 0 θ + ] p p) 4 for the p-th order momet of f θ N ) fθ ), for some scalar costats ad. The ew boud ow has the term R θ 0 θ with a fourth power compared to the boud i Lemma, which has a secod power), which typically grows with the dimesio of the uderlyig space or the slowess of the decay of eigevalues of the covariace operator whe H is ifiite-dimesioal). It would be iterestig to study whether this depedece ca be reduced. The key elemets i the previous propositio are that a) the costat µ is the local covexity costat, ad b) the step-size does ot deped o that costat µ, hece the claimed adaptivity. The bouds are oly better tha the o-strogly-covex bouds from Lemma, whe the Hessia lowest eigevalue is large eough, that is, µr 2 N larger tha a fixed costat. I the cotext of logistic regressio, eve whe the covariace matrix of the iputs is ivertible, the the oly available lower boud o µ is equal to the lowest eigevalue of the covariace matrix times exp R θ ), which is expoetially small. However, the previous boud is overly pessimistic sice it is based o a upper boud o the largest possible value of x, θ. I practice, the actual value of µ is much larger ad oly a small costat smaller tha the lowest eigevalue of the covariace matrix. I order to assess if this result ca be improved, it is iterestig to look at the asymptotic result from Polyak ad Juditsky 992) for logistic regressio, which leads to a limit rate of / times tr f θ ) Ef θ )f θ ) ) ; ote that this rate holds both for the 604

11 Adaptivity of Averaged Stochastic Gradiet Descet stochastic approximatio algorithm ad for the global optimum of the traiig cost, usig stadard asymptotic statistics results Va der Vaart, 998). Whe the model is well-specified, that is, the log-odds ratio of the coditioal distributio of the label give the iput is liear, the Ef θ )f θ ) = Ef θ ) = f θ ), ad the asymptotic rate is exactly d/, where d is the dimesio of H which has to be fiite-dimesioal for the covariace matrix to be ivertible). It would be iterestig to see if makig the extra assumptio of well-specificatio, we ca also get a improved o-asymptotic result. Whe the model is mis-specified however, the quatity Ef θ )f θ ) may be large eve whe f θ ) is small, ad the asymptotic regime does ot readily lead to a improved boud. 6. Coclusio I this paper, we have provided a ovel aalysis of averaged stochastic gradiet for logistic regressio ad related problems. The key aspects of our result are a) the adaptivity to local strog covexity provided by averagig ad b) the use of self-cocordace to obtai a simple boud that does ot ivolve a term which is explicitly expoetial i R θ 0 θ, which could be obtaied by costraiig the domai of the iterates. Our results could be exteded i several ways: a) with a fiite ad kow horizo N, we cosidered a costat step-size proportioal to /R 2 N; it thus seems atural to study the decayig step size γ = O/R 2 ), which should, up to logarithmic terms, lead to similar results ad thus likely provide a solutio to a a recetly posed ope problem for olie logistic regressio McMaha ad Streeter, 202); b) a alterative would be to cosider a doublig trick where the step-sizes are piecewise costat; also, c) it may be possible to cosider other assumptios, such as exp-cocavity Haza ad Kale, 200) or uiform covexity Juditsky ad Nesterov, 200), to derive similar or improved results. Fially, by departig from a plai averaged stochastic gradiet recursio, Bach ad Moulies 203) have cosidered a olie Newto algorithm with the same ruig-time complexity, which leads to a rate of O/) without strog covexity assumptios for logistic regressio though with additioal assumptios regardig the distributios of the iputs). It would be iterestig to uderstad if simple assumptios such as the oes made i the preset paper are possible while preservig the improved covergece rate. Ackowledgmets The author was partially supported by the Europea Research Coucil SIERRA Project), ad thaks Simo Lacoste-Julie, Eric Moulies ad Mark Schmidt for helpful discussios. Morever, Alekh Agarwal suggested ad provided a detailed outlie of the proof techique based o Freedma s iequality; this was greatly appreciated. Appedix A. Probability Lemmas I this appedix, we prove simple lemmas relatig bouds o momets to tail bouds, with the traditioal use of Markov s iequality. See more geeral results by Bouchero et al. 203). 605

12 Bach Lemma Let X be a o-egative radom variable such that for some positive costats A ad B, ad all p {,..., }, EX p A + Bp) p. The, if t 2, PX 3Bt + 2A) 2 exp t). Proof We have, by Markov s iequality, for ay p {,..., }: PX 2Bp + 2A) EX p 2Bp + 2A) p For u, ], we cosider p = u, so that A + Bp)p = exp log2)p). 2A + 2Bp) p PX 2Bu + 2A) PX 2Bp + 2A) exp log2)p) 2 exp log2)u). We take t = log2)u ad use 2/ log 2 3. This is thus valid if t 2. Lemma 2 Let X be a o-egative radom variable such that for some positive costats A, B ad C, ad for all p {,..., }, EX p A p + Bp + C) 2p. The, if t, PX 2A t + 2Bt + 2C) 2 ) 4 exp t). Proof We have, by Markov s iequality, for ay p {,..., }: PX 2A p + 2Bp + 2C) 2 ) EX p 2A p + 2Bp + 2C) 2p A p + Bp + C) 2p 2A exp log4)p). p + 2Bp + 2C) 2p For u, ], we cosider p = u, so that PX 2A u + 2Bu + 2C) 2 ) PX 2A u + 2Bu + 2C) 2 ) exp log2)p) 4 exp log4)u). We take t = log4)u ad use log 4. This is thus valid if t. 606

13 Adaptivity of Averaged Stochastic Gradiet Descet Appedix B. Self-Cocordace Properties I this appedix, we show two lemmas regardig our geeralized otio of self-cocordace, as well as Lemma 9. For more details, see Bach 200) ad refereces therei. The followig lemma provide a upper-boud o a oe-dimesioal self-cocordat fuctio at a give poit which is based o the gradiet at this poit ad the value ad the Hessia at the global miimum. This is key to goig i Sectio 5 from a covergece of gradiets to a covergece of fuctio values. Lemma 3 Let ϕ : 0, ] R a strictly covex three-times differetiable fuctio such that for some S > 0, t 0, ], ϕ t) Sϕ t). Assume ϕ 0) = 0, ϕ 0) > 0. The: ϕ ) ϕ 0) S e S ad ϕ) ϕ0) + ϕ ) 2 ϕ + S). 0) Moreover, if α = ϕ )S ϕ 0) <, the ϕ) ϕ0) + ϕ ) 2 ϕ 0) α log α. If i additio α 3 4, the ϕ) ϕ0) + 2 ϕ ) 2 ϕ 0) ad ϕ 0) 2ϕ ). Proof By self-cocordace, we obtai that the derivative of u log ϕ u) is lower-bouded by S. By itegratig betwee 0 ad t 0, ], we get log ϕ t) log ϕ 0) St, that is, ϕ t) ϕ 0)e St, 2) ad by itegratig betwee 0 ad, we obtai ote that we have assumed ϕ 0) = 0): ϕ ) ϕ 0) e S. 3) S We the get with a first iequality from covexity of ϕ, ad the last iequality from e S + S): ϕ) ϕ0) ϕ ) ϕ ) ϕ ) S ϕ 0) e S = ϕ ) 2 ϕ S + S ) 0) e S ϕ ) 2 ϕ + S). 0) Equatio 3) implies that α e S, which implies, if α <, S log that ϕ) ϕ0) ϕ ) ϕ ) S ϕ 0) e S ϕ ) 2 ϕ 0) α log α, S α. This implies usig the mootoicity of S. Fially the last bouds are a cosequece of S e S α α log α 2, which is valid for α 3 4. Note that i Equatio 2), we do cosider a lower-boud o the Hessia with a expoetial factor e St. The key feature of usig self-cocordace properties is to get aroud this expoetial factor i the fial boud. The followig lemma upper-bouds the remaider i the first-order Taylor expasio of the gradiet by the remaider i the first-order Taylor expasio of the fuctio. This is importat whe fuctio values behave well i.e., coverge to the miimal value) while the iterates may ot. 607

14 Bach Lemma 4 Let f be a covex three-times differetiable fuctio from H to R, such that for all θ, θ 2 H, the fuctio ϕ : t f θ + tθ 2 θ ) ] satisfies: t R, ϕ t) R θ θ 2 ϕ t). For ay θ, θ 2 H, we have: f θ ) f θ 2 ) f θ 2 )θ 2 θ ) R fθ ) fθ 2 ) f θ 2 ), θ 2 θ ]. Proof For a give z H of uit orm, let ϕt) = z, f θ 2 + tθ θ 2 ) ) f θ 2 ) tf θ 2 )θ 2 θ ) ad ψt) = R fθ 2 + tθ θ 2 )) fθ 2 ) t f θ 2 ), θ 2 θ ]. We have ϕ0) = ψ0) = 0. Moreover, we have the followig derivatives: ϕ t) = z, f θ 2 + tθ θ 2 ) ) f θ 2 ), θ θ 2 ϕ t) = f θ 2 + tθ θ 2 ) ) z, θ θ 2, θ θ 2 ] R z 2 f θ 2 + tθ θ 2 ) ) θ θ 2, θ θ 2 ], usig the Appedix A of Bach 200), = R θ 2 θ, f θ 2 + tθ θ 2 ) ) θ θ 2 ) ψ t) = R f θ 2 + tθ θ 2 ) ) f θ 2 ), θ θ 2 ψ t) = R θ 2 θ, f θ 2 + tθ θ 2 ) ) θ θ 2 ), where f θ) is the third order tesor of third derivatives. This leads to ϕ 0) = ψ 0) = 0 ad ϕ t) ψ t). We thus have ϕ) ψ) by itegratig twice, which leads to the desired result by maximizig with respect to z. B. Proof of Lemma 9 We follow the stadard proof techiques i self-cocordat aalysis ad defie a appropriate fuctio of a sigle real variable ad apply simple lemmas like the oes above. Defie ϕ : t f θ + tθ θ ) ] fθ ). We have ϕ t) = f θ + tθ θ ) ], θ θ ϕ t) = θ θ, f θ + tθ θ ) ] θ θ ) ϕ t) = f θ + tθ θ ) ] θ θ, θ θ, θ θ ]. We thus have: ϕ0) = ϕ 0) = 0, 0 ϕ ) = f θ), θ θ f θ) θ θ, ϕ 0) = θ θ, f θ )θ θ ) µ θ θ 2, ad ϕt) 0 for all t 0, ]. Moreover, ϕ t) R θ θ ϕ t) for all t 0, ], that is, Lemma 3 applies with S = R θ θ. This leads to the desired result, with α = ϕ )S ϕ 0) iequality i Lemma 3), for all θ H ad without ay assumptio o θ): Appedix C. Proof of Propositio 3 f θ) R µ. Note that we also have usig the secod fθ) fθ ) + R θ θ ) f θ) 2. µ We provide two alterative proofs of the same result: a) our origial somewhat tedious proof i Appedices C.3 ad C.4, based o takig powers of the iequality i Equatio ) 608

15 Adaptivity of Averaged Stochastic Gradiet Descet ad usig martigale momet iequalities, b) a shorter proof i Appedix C.5, later derived by Bach ad Moulies 203), that uses Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.). Aother proof techique was suggested ad outlied by Alekh Agarwal persoal commuicatio), that uses Freedma s iequality for martigales Freedma, 975, Theorem.6); it allows to directly get a tail boud like i Propositio 5. This proof will be preseted i Appedix C.6. Note that the two shorter proofs curretly lead to slightly worse costats or to extra logarithmic factors), that may be improved with more refied derivatios. All proofs start from a similar martigale set-up that we describe i Appedix C. ad use a almost-sure boud whe p gets large Appedix C.2). C. Boudig Martigales From the proof of Lemma, we have the recursio: 2γ fθ ) fθ ) ] + θ θ 2 θ θ 2 + γ 2 R 2 + M, with M = 2γ θ θ, f θ ) f θ ). This leads to, by summig from to, ad usig the covexity of f: with 2γf ) θ k 2γfθ ) + θ θ 2 A, A = θ 0 θ 2 + γ 2 R 2 + M k 0. Note that A may also be defied recursively as A 0 = θ 0 θ 2 ad A = A + γ 2 R 2 + M. 4) The radom variables M ) ad A ) satisfy the followig properties that will proved useful throughout the proof: a) Martigale icremet: for all k, EM k F k ) = 0. This implies that S = M k is a martigale. b) Boudedess: M k 4γR θ k θ 4γRA /2 k almost surely. C.2 Almost Sure Boud I this sectio, we derive a almost sure boud that will be valid for small. From the stochastic gradiet recursio θ = θ γf θ ), we get, usig Assumptio A4) ad the triagle iequality: θ θ θ θ + γ f θ ) θ θ + γr almost surely. 609

16 Bach This leads to θ θ θ 0 θ + γr for all 0. This i tur implies that A θ 0 θ 2 + γ 2 R 2 + 4γR θ 0 θ 2 + γ 2 R 2 + 4γR θ k θ usig M k 4γR θ k θ, θ0 θ + k )γr ] usig the iequality above, θ 0 θ 2 + γ 2 R 2 + 4γR θ 0 θ + 2γ 2 R 2 2 by summig over the first itegers, θ 0 θ 2 + γ 2 R 2 + 2γ 2 2 R θ 0 θ 2 + 2γ 2 R 2 2 usig ab a2 2 + b2 2, 3 θ 0 θ γ 2 R 2 almost surely. 5) This implies that the boud is show for all p 4. C.3 Derivatio of p-th Order Recursio The first proof works as follows: a) derive a recursio betwee the p-th momets ad the lower-order momets this sectio) ad c) prove the result by iductio o p Appedix C.4). Note that we have to treat separately small values o i the recursio, for which we use the almost sure boud from Appedix C.2. Startig from Equatio 4), usig the biomial expasio formula, we get: ) p p ) p A p A + γ 2 R 2 + M = A + γ 2 R 2) p k M k k A + γ 2 R 2) p + p A + γ 2 R 2) p ) p p M + A + γ 2 R 2) p k /2 k. 4γRA k ) This leads to, usig EM F ) = 0, upper boudig γ 2 R 2 by 4γ 2 R 2, ad usig the biomial expasio formula several times: k=2 E A p ] F A + 4γ 2 R 2) p ) p p + A + 4γ 2 R 2) p k /2 ) k 4γRA k k=2 = A + 4γ 2 R 2 + 4γRA /2 ) p 4γRp A + 4γ 2 R 2) p /2 A by isolatig the term k = i the biomial formula, = A /2 + 2γR) 2p 4γRp A + 4γ 2 R 2) p /2 A 2p ) 2p = A k/2 k 2γR)2p k 4γRpA /2 p p k = 2p A k/2 2γR)2p k C k, ) A k 2γR) 2p k) 60

17 Adaptivity of Averaged Stochastic Gradiet Descet with the costats C k defied as: ) 2p C 2q = for q {0,..., p}, 2q ) ) 2p p C 2q+ = 2p for q {0,..., p }. 2q + q I particular, C 0 =, C 2p =, C = 0 ad C 2p = ) 2p 2p 2p p ) = 0. Our goal is ow to boudig the values of C k to obtai Equatio 8) below. This will be doe by boudig the odd-idexed elemet by the eve-idexed elemets. We have, for q {,..., p 2}, C 2q+ 2q + 2p 2q = = ) 2p 2q + 2q + 2p 2q 2p)! 2q + )!2p 2q )! 2p)! 2p 2q 2q)!2p 2q)! 2p 2q = 2q + 2p 2q 2p 2q ) 2p 2q 2p 2q. 6) 2q+ For the ed of the iterval above i q, that is, q = p 2, we obtai C 2q+ 2p 2q C 2q 4 3, 2q+ while for q p 3, we obtai C 2q+ 2p 2q C 2q 6 5. Moreover, for q {,..., p 2}, C 2q+ 2p 2q 2q + = = 2p 2q + ) 2p 2q 2q + 2p)! 2q + )!2p 2q )! 2p)! 2q + 2)!2p 2q 2)! 2p 2q 2q + 2q + 2 2q + = ) 2p 2q + 2 2q + 2 2q +. 7) 2p 2q 4 For the ed of the iterval above i q, that is, q =, we obtai C 2q+ 2q+ C 2q+2 3, 2p 2q 6 while for q 2, we obtai C 2q+ 2q+ C 2q+2 5. We have moreover, by usig the boud 2γRA /2 α 2 2γR)2 + 2α A for α = 2q+ 2p 2q : C 2q+ A q+/2 2γR)2p 2q = C 2q+ A q 2γR)2p 2q 2 A /2 2γR) C 2q+ A q ] 2q + 2p 2q 2γR)2p 2q 2 2 2p 2q 2γR)2 + A 2q + = 2 C 2p 2q 2q+ A q+ 2q + 2γR)2p 2q C 2q + 2q+ 2p 2q Aq 2γR)2p 2q. By combiig the previous iequality with Equatio 6) ad Equatio 7), we get that the terms idexed by 2q + are bouded by the terms idexed by 2q + 2 ad 2q. All terms with q {2,..., p 3} are expaded with costats 3 5, while for q = ad q = p 2, this is 6

18 Bach 2 3. Overall each eve term receives a cotributio which is less tha max{ 6 5, , 2 3 } = 9 This leads to p 2 C 2q+ A q+/2 2γR)2p 2q 9 p C 2q A q 2γR)2p 2q, q= leadig to the recursio that will allow us to derive our result: C.4 Proof by Iductio E A p ] F A p + 34 p q=0 q=0. ) 2p A q 2q 2γR)2p 2q. 8) We ow proceed by iductio o p. If we assume that EA q k 3 θ 0 θ 2 + kqγ 2 R 2 B ) q for all q < p, ad a certai B which we will choose to be equal to 20). We first ote that if 4p, the from Equatio 5), we have EA p 3 θ 0 θ γ 2 R 2 ) p 3 θ 0 θ pγ 2 R 2 ) p. Thus, we oly eed to cosider 4p. We the get from Equatio 8): E θ θ 2p θ 0 θ 2p + 34 θ 0 θ 2p + 34 p q=0 p q=0 ) 2p EA q 2q k 2γR)2p 2q ) 2p 3 θ0 θ 2 + kqγ 2 R 2 B ) q 2γR) 2p 2q, 2q usig the iductio hypothesis. We may ow sum with respect to k: E θ θ 2p θ 0 θ 2p + 34 θ 0 θ 2p + 34 usig p q=0 p q=0 2p 2q k α α+ α + = θ 0 θ 2p + 34 p j=0 )2γR) 2p 2q ) 2p 2γR) 2p 2q 2q for ay α > 0, 3 θ0 θ 2 + kqγ 2 R 2 B ) q q ) q qγ 3 j θ 0 θ 2j 2 R 2 B ) q j q j+ j q j + j=0 p 3 j θ 0 θ 2j 4γ 2 R 2 ) p j q=j 2p 2q ) q j ) qb 4 by chagig the order of summatios. We ow aim to show that it is less tha j=0 ) q j q p+ q j +, p p ) p 3 θ 0 θ 2 + kpγ 2 R B) 2 = 3 p θ 0 θ 2p + 3 j θ 0 θ 2j γ 2 R 2 ) p j Bp) p j. j 62

19 Adaptivity of Averaged Stochastic Gradiet Descet By comparig all terms i θ 0 θ 2j, this is true as soo as for all j {0,..., p }, 34 p q=j ) ) 2p q qb/4 ) q j 2q j q j + Bp/4)p j p q ) p j 34 p j 2p 2k + 2 ) p k j ) p k)b/4 ) p k j p k j Bp/4)p j k ) p, j obtaied by usig the chage of variable k = p q. This is implied by, usig 4p: 36 p j ) ) p k 2p B k p k p+j j 2k + 2 p j) p k ) p k j p k j. By expadig the biomial coefficiets ad simplifyig by p k j, this is equivalet to 36 p j ) 2p p k) p k j + ) B k p k p+j ) p k j p k. 2k + 2 p p j + ) We may ow write p k) p k j + ) p p j + ) = = p k)! p j)! p k)! p j)! = p k j)! p! p! p k j)! p j) p k j + ), p p k) so that we oly eed to show that 36 p j ) 2p p j) p k j + ) B k p k p+j ) p k j p k. 2k + 2 p p k) 63

20 Bach We have, by boudig all terms the tha p by p: = 36 = 36 = p j p j p j ) 2p p j) p k j + ) A k p k p+j ) p k j p k 2k + 2 p p k) ) 2p A k p k p+j 2k + 2 ) 2p A k p k 2k + 2 p j A k p j p j p k 2k + 2)! A k p 2 2 2k+2 2k + 2)! A k 2 2k+2 2k + 2)! p k p p k) pp k j p p k) 2p2p ) 2p 2k ) p p k) pp /2) p k /2) p p k) by associatig all 2k + 2 terms i ratios which are all less tha, + 2/ A) 2k+2 2k + 2)! = 36 ] cosh2/ A) < if A 20. We thus get the desired result EA p 3 θ 0 θ pγ 2 R 2) p, ad the propositio is proved by iductio. C.5 Alterative Proof Usig Burkholder-Rosethal-Pielis Iequality I this sectio, we preset a slightly modified versio of) the proof from Bach ad Moulies 203) which is based o Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.), which we ow recall. C.5. BRP Iequality Throughout the proof, we use the otatio for X H a radom vector, ad p ay real umber greater tha, X p = E X p) /p. We first recall the Burkholder-Rosethal- Pielis BRP) iequality Pielis, 994, Theorem 4.). Let p R, p 2 ad F ) 0 be a sequece of icreasig σ-fields, ad X ) a adapted sequece of elemets of H, such that E ] X F = 0, ad X p is fiite. The, sup k {,...,} k p X j p j= p E X k 2 ] /2 F k + p p/2 E X k 2 ] /2 F k + p p/2 sup k {,...,} sup k {,...,} X k 9) p /2 X k 2. p/2 64

21 Adaptivity of Averaged Stochastic Gradiet Descet C.5.2 Proof of Propositio 3 With Slightly Worse Costats) We use BRP s iequality i Equatio 9) to get, for p 2, /4]: sup k {0,...,} Thus if B = p A k θ 0 θ 2 + γ 2 R 2 + p 6γ2 R 2 +p sup /2 θ k θ 2 k {,...,} θ 0 θ 2 + γ 2 R 2 + p 4γR θ 0 θ 2 + γ 2 R 2 + 4γR +p 4γR sup k {0,..., } p/2 4γR θ k θ sup k {0,..., } A k /2 p/2 p sup A /2 k k {0,..., } p /2 ) A k p + p. sup k {0,...,} A k p, we have usig p /4, which implies p + p 3 2 p): p/2 By solvig this quadratic iequality, we get: B θ 0 θ 2 + γ 2 R 2 + 6γRB /2 p. B /2 3γR p ) 2 θ0 θ 2 + γ 2 R 2 + 9γ 2 R 2 p, which implies B /2 3γR p + θ 0 θ 2 + γ 2 R 2 + 9γ 2 R 2 p B 2 9γ 2 R 2 p + 2 θ 0 θ 2 + γ 2 R 2 + 9γ 2 R 2 p ) 40γ 2 R 2 p + 2 θ 0 θ 2. The previous statemet is valid for p 2 ad trivial for p =. From Appedix C.2, we oly eed to have the result for p 4. Thus the boud is slightly worse but could be clearly improved with more care, for example, by usig iductio o ). C.6 Alterative Proof Usig Freedma s Iequality I the previous sectio, we have used p-th order momet martigale iequalities that relate the orm of a martigale to the orm of its predictable quadratic variatio process. Similar results may be obtaied for tail bouds through Freedma s iequality Freedma, 975, Theorem.6). This proof techique was suggested ad outlied by Alekh Agarwal persoal commuicatio). C.6. Freedma s Iequality ad Extesios Let X ) be a real-valued martigale icremet adapted to the icreasig sequece of σ- fields F ), that is, such that EX F ) = 0, that is almost surely bouded, that is, X R 6

22 Bach almost surely. Let Σ = EX2 k F k ) the predictable quadratic variatio process. The for ay costats t ad σ 2, P max k {,...,} k i= X i t, Σ σ 2) 2 exp t 2 2σ 2 + Rt/3) Whe X ) are idepedet radom variables, this recovers Berstei s iequality. From this boud, oe may derive the followig boud Kakade ad Tewari, 2009); with probability 4log )δ, we have: max k {,...,} k i= { X i max 2 Σ, 3R log δ } log δ 2 Σ log δ + 3R log δ. 0) Note that the result of Kakade ad Tewari 2009) cosiders oly max k {,...,} k X i, but that the extesio of their proof is straightforward. i= C.6.2 Proof of Propositio 5 With Slightly Worse Costats ad Scaligs) ). i= X i rather tha We ca ow apply the iequality i Equatio 0) to M ). We have M 4γR θ θ 4γR θ 0 θ +γr ) almost surely. Moreover, EM F 2 ) 6γ 2 R 2 θ θ 2 6γ 2 R 2 A. This leads to with probability greater tha 4log )δ, max A k k {,...,} θ 0 θ 2 + γ 2 R 2 + 8γR A k log δ + 2γR θ 0 θ + γr ) log δ θ 0 θ 2 + γ 2 R 2 + 8γR max Ak log δ k {,...,} +2γR θ 0 θ + γr ) log δ. We may ow solve the quadratic iequality i max k {,...,} Ak. This leads to The max k {,...,} Ak 4γR ) 2 log δ θ 0 θ 2 + γ 2 R 2 + 2γR θ 0 θ + γr ) log δ + 6γ2 R 2 log δ = θ 0 θ 2 + γ 2 R 2 + 2γR θ 0 θ + 28γ 2 R 2) log δ. max k {,...,} Ak log δ + θ 0 θ + γr + 2γR θ 0 θ + 28γ 2 R 2 4γR log δ 66

23 Adaptivity of Averaged Stochastic Gradiet Descet ad max k {,...,} A k 64γ 2 R 2 log δ + 4 θ 0 θ 2 + 4γ 2 R γR θ 0 θ + 28γ 2 R 2) log δ 4 θ 0 θ 2 + 4γ 2 R γ 2 R γR θ 0 θ + 2γ 2 R 2) log δ ) 4 θ 0 θ 2 + 4γ 2 R γ 2 R γR θ 0 θ log δ. We thus recover a tail boud which is very similar to the oe obtaied i Propositio 5, with the followig differeces: the additioal term 48γR θ 0 θ is uimportat because γ = ON /2 ); however, because the extesio of Freedma s iequality is satisfied with probability 4log )δ, this proof techique loses a logarithmic factor. Appedix D. Proof of Propositio 7 The proof is orgaized i two parts: we first show a boud o the averaged gradiet f θ k ), the relate it to the gradiet at the averaged iterate, that is, f ) θ k, usig self-cocordace. D. Boud o f θ k ) We have, followig Polyak ad Juditsky 992) ad Bach ad Moulies 20): f θ ) = γ θ θ ), which implies, by summig over all itegers betwee ad : f θ k ) = f θ k ) f k θ k ) ] + γ θ 0 θ ) + γ θ θ ). We deote X k = f θ k ) f k θ k ) ] H. We have: X k 2R almost surely ad EX k F k ) = 0, with E X k 2 F k ) ) /2 2R. We may thus apply the Burkholder-Rosethal-Pielis iequality Pielis, 994, Theorem 4.), ad get: E f θ k ) f k θ k ) ] 2p] /2p 2p 2R + 2p 2R /2. 67

24 Bach This leads to, usig Propositio 3 ad Mikowski s iequality: E f 2p] /2p θ k ) E f θ k ) f k θ k ) ] 2p] /2p + γ θ 0 θ + E θ θ 2p] /2p γ 2p 2R + 2p 2R /2 + γ θ 0 θ + 3 θ0 θ γ pγ 2 R 2] 2p 2R + 2p 2R /2 + γ θ 0 θ + 3 γ θ 0 θ + ] 20pγR γ 4pR + 2p 2R /2 + 2 γ θ 0 θ + 20pγR γ 4pR + p R ] γ θ 0 θ 4pR + 8 p R + 3 γ θ 0 θ. ) D.2 Usig Self-Cocordace Usig the self-cocordace property of Lemma 4 several times, we obtai: ) f θ k ) f θ k = f θ k ) f θ ) f θ )θ k θ ) ] ) ) f θ k + f θ ) + f θ ) θ k θ R fθk ) fθ ) f θ ), θ k θ ] ) +R f θ k fθ ) + f θ ), 2R ) fθ k ) fθ ) usig the covexity of f. θ k θ ] This leads to, usig Propositio 3: E ) 2p) /2p f θ k ) f θ k ] 2p ) /2p 2R E fθ k ) fθ ) 2R ) 3 θ 0 θ pγ 2 R 2. 2) 2γ Summig Equatio ) ad Equatio 2) leads to the desired result. 68

25 Adaptivity of Averaged Stochastic Gradiet Descet Appedix E. Results for Small p I Propositio 3, we may replace the boud 3 θ 0 θ pγ 2 R 2 with a boud with smaller costats for p =, 2, 3 to be used i proofs of results i Sectio 5). This is doe usig the same proof priciple but fier derivatios, as follows. We deote γ 2 R 2 = b ad θ θ 2 = a, ad cosider the followig iequalities which we have cosidered i the proof of Propositio 3: A p A + b + M ) p M 4b /2 A /2 ad EM F ) = 0, A 0 = a. We simply take expasios of the p-th power above, ad sum for all first itegers. We have: EA EA 2 EA + b a + b, EA 2 + b 2 + 2bA + M) 2 EA 2 + 2EA b + b 2 + 6bEA ] a 2 + 8b a + kb + b 2 a 2 + 8ba b] + b2 usig the result about EA, = a 2 + 8ba + b ) a + 9b) 2. We may ow pursue for the third order momets: EA 3 EA + b) 3 + 3EA + b) 2 M 2 + 3EA + b) 3 M + EM 3 EA + b) 3 + 3EA + b) 2 6bA b 3/2 EA 3/2 EA 3 + 3EA 2 b + 3EA b 2 + b 3 ) + 3EA + b)6ba + 64b 3/2 EA 3/2 = EA 3 + 3EA 2 b + 3EA b 2 + b 3 ) + 3EA + b)6ba By expadig, we get EA 3 +32bEA 2b /2 A /2 ]. EA 3 + 3EA 2 b + 3EA b 2 + b 3 ) + 3EA + b)6ba +32EbA A + 4b] 4 = EA 3 + EA 2 b ] + EA b ] + b 3 = EA EA 2 b + 79EA b 2 + b 3 ] ] a b a 2 + 8bka + b 2 k + 9k 2 ) + 79b 2 a + kb + b 3 a ba 2 + 9b 2 a + b 2 2 / )] + 79b 2 a + b 2 /2] + b 3 = a ba 2 + b 2 a ] + b 3 59/ /2 2 + ] = a ba 2 + b 2 a ] + b ] a + 20b) 3. 69

26 Bach We the obtai: E 2γ f θ ) fθ ) ] ] 2 + θ θ 2 θ 0 θ 2 + 9γ 2 R 2) 2 E 2γ f θ ) fθ ) ] ] 3 + θ θ 2 θ 0 θ γ 2 R 2) 3. Appedix F. Proof of Propositio 0 The proof follows from applyig self-cocordace properties Lemma 9) to θ. We thus eed to provide a cotrol o the probability that f θ ) 3µ 4R. F. Tail Boud for f θ ) We derive a large deviatio boud, as a cosequece of the boud o all momets of f θ ) Propositio 7) ad Lemma 2, that allows to go from momets to tail bouds: f P θ ) 2R 0 t + 40R 2 γt + 3 γ θ 0 θ ]) γr θ 0 θ 4 exp t). I order to derive the boud above, we eed to assume that p /4 so that 4p/ 2 p/ ), ad thus, whe applyig Lemma 2, the boud above is valid as log as t /4. It is however valid for all t, because the gradiets are bouded by R, ad for t >, we have 2R 0 t R, ad the iequality is satisfied with zero probability. F.2 Boudig the Fuctio Values From Lemma 9, if f θ ) 3µ 4R, the f θ ) fθ ) 2 f θ ) 2 µ. This will allow us to derive a tail boud for f θ ) fθ ), for sufficietly small deviatios. For larger deviatios, we will use the tail boud which does ot use strog covexity Propositio 5). We cosider the evet { A t = f θ ) 2R 0 t + 40R 2 γt + 3 γ θ 0 θ ]} γr θ 0 θ. We make the followig two assumptios regardig γ ad t: 0 t + 40R 2 γt 2 3µ 3 4R 2R = µ 4R 2 3) 3 ad γ θ 0 θ γr θ 0 θ 3µ 3 4R 2R = µ 8R 2, so that the upper-boud o f θ ) i the defiitio of A t is less tha 3µ 4R so that we ca apply Lemma 9). We thus have: A t {f θ ) fθ ) 8R2 0 t + 40R 2 γt + 3 µ γ θ 0 θ ] 2 } γr θ 0 θ {f θ ) fθ ) 8R2 0 ] 2 } t + 20 t +, µ 620

27 Adaptivity of Averaged Stochastic Gradiet Descet with = 2γR 2 ad = 3 γ θ 0 θ γr θ 0 θ. This implies that for all t 0, such that 0 t + 20 t µ 4R 2, that is, our assumptio i Equatio 3), we may apply the tail boud from Appedix F. to get: P f θ ) fθ ) 8R2 0 ] 2 ) t + 20 t + 4e t. 4) µ Moreover, we have for all v 0 from Propositio 5): P f θ ) fθ ) 30γR 2 v + 3 θ 0 θ 2 ) 2 exp v). ) γ We may ow use the last two iequalities to boud the expectatio Ef θ ) fθ )]. We first express the expectatio as a itegral of the tail boud ad split it ito three parts: E f θ ) fθ ) ] = = R 2 µ R 2 µ 2 8R2 µ + P f θ ) fθ ) u ] du P f θ ) fθ ) u ] du 6) µ 4R 2 + ) 2 8R 2 µ ) 2 P µ 4R 2 + P f θ ) fθ ) u ] du f θ ) fθ ) u ] du. 2 8R 2 µ We may ow boud the three terms separately. For the first itegral, we boud the probability by oe to get P f θ ) fθ ) u ] du 2 8R2 0 µ. For the third term i Equatio 6), we use the tail boud i Equatio ) to get = 2 + 8R 2 µ ) 2 P µ 4R R 2 µ ) 2 µ 4R f θ ) fθ ) u ] du P γ θ 0 θ 2 8R 2 µ ) 2 exp µ 4R γ θ 0 θ 2 We may apply Equatio ) because 8R 2 µ µ 4R 2 + ) 2 3 γ θ 0 θ 2 8R2 µ µ f θ ) fθ ) u + 3 γ θ 0 θ 2 u 30γR 2 ) du. 4R 2 + ) 2 µ 8R2 8R2 µ 62 µ 4R 2 ) 2 µ ] du 8R 2 = 3µ 0. 8R2

28 Bach We ca ow compute the boud explicitly to get + 8R 2 µ µ ) 2 P f θ ) fθ ) u ] du 4R 2 + 8R 60γR 2 2 µ exp 30γR 2 µ 4R 2 + ) γR 2 exp µ ) 80γR 4 60γR 2 80γR4 2µ = 2400γ2 R 6. µ ]) γ θ 0 θ 2 usig e α 2α µ ) 60γR 2 3µ exp 30γR 2 8R 2 for all α > 0 We ow cosider the secod term i Equatio 6) for which we will use Equatio 4). We cosider the chage of variable u = 8R2 0 2, t + 20 t + ] for which u 2 8R2 µ, 8R2 µ µ 4R 2 8R 2 µ 2 8R2 µ + ) 2 ] implies t 0, + ). This implies that µ 4R 2 + ) 2 P f θ ) fθ ) u ] du 8R 4e t 2 d 0 ] 2 ) t + 20 t + 0 µ = 32R2 e 00 t t µ 0 2 t/ ) 2 t / dt = 00Γ) 32R Γ2) ) µ Γ3/2) Γ/2) + 40 Γ) with Γ deotig the Gamma fuctio, = 32R ) π + 20 π µ We may ow combie the three bouds to get, from Equatio 6), E f θ ) fθ ) ] 2 8R2 µ γ2 R 6 µ + 32R µ 2 32R2 µ ) π + 20 π γ2 R π+0 ] π+40. For γ = 2R 2, with α = R θ N 0 θ, = ad = 6α 2 + 6α, we obtai E f θ N ) fθ ) ] ] 32R2 Nµ ] 32R2 9α 4 + 8α 3 + 9α α α Nµ R2 625α α α α ) = R2 ) 4. 5α + Nµ Nµ 622

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A. Radom Walks o Discrete ad Cotiuous Circles by Jeffrey S. Rosethal School of Mathematics, Uiversity of Miesota, Mieapolis, MN, U.S.A. 55455 (Appeared i Joural of Applied Probability 30 (1993), 780 789.)

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4. 4. BASES I BAACH SPACES 39 4. BASES I BAACH SPACES Sice a Baach space X is a vector space, it must possess a Hamel, or vector space, basis, i.e., a subset {x γ } γ Γ whose fiite liear spa is all of X ad

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Chapter 7 Isoperimetric problem

Chapter 7 Isoperimetric problem Chapter 7 Isoperimetric problem Recall that the isoperimetric problem (see the itroductio its coectio with ido s proble) is oe of the most classical problem of a shape optimizatio. It ca be formulated

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology Advaced Aalysis Mi Ya Departmet of Mathematics Hog Kog Uiversity of Sciece ad Techology September 3, 009 Cotets Limit ad Cotiuity 7 Limit of Sequece 8 Defiitio 8 Property 3 3 Ifiity ad Ifiitesimal 8 4

More information

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Advanced Stochastic Processes.

Advanced Stochastic Processes. Advaced Stochastic Processes. David Gamarik LECTURE 2 Radom variables ad measurable fuctios. Strog Law of Large Numbers (SLLN). Scary stuff cotiued... Outlie of Lecture Radom variables ad measurable fuctios.

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Analysis of the Chow-Robbins Game with Biased Coins

Analysis of the Chow-Robbins Game with Biased Coins Aalysis of the Chow-Robbis Game with Biased Cois Arju Mithal May 7, 208 Cotets Itroductio to Chow-Robbis 2 2 Recursive Framework for Chow-Robbis 2 3 Geeralizig the Lower Boud 3 4 Geeralizig the Upper Boud

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

MA131 - Analysis 1. Workbook 3 Sequences II

MA131 - Analysis 1. Workbook 3 Sequences II MA3 - Aalysis Workbook 3 Sequeces II Autum 2004 Cotets 2.8 Coverget Sequeces........................ 2.9 Algebra of Limits......................... 2 2.0 Further Useful Results........................

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose

More information

ENGI Series Page 6-01

ENGI Series Page 6-01 ENGI 3425 6 Series Page 6-01 6. Series Cotets: 6.01 Sequeces; geeral term, limits, covergece 6.02 Series; summatio otatio, covergece, divergece test 6.03 Stadard Series; telescopig series, geometric series,

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

Lecture 3 The Lebesgue Integral

Lecture 3 The Lebesgue Integral Lecture 3: The Lebesgue Itegral 1 of 14 Course: Theory of Probability I Term: Fall 2013 Istructor: Gorda Zitkovic Lecture 3 The Lebesgue Itegral The costructio of the itegral Uless expressly specified

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Binary classification, Part 1

Binary classification, Part 1 Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 3 9//203 Large deviatios Theory. Cramér s Theorem Cotet.. Cramér s Theorem. 2. Rate fuctio ad properties. 3. Chage of measure techique.

More information

Optimal Two-Choice Stopping on an Exponential Sequence

Optimal Two-Choice Stopping on an Exponential Sequence Sequetial Aalysis, 5: 35 363, 006 Copyright Taylor & Fracis Group, LLC ISSN: 0747-4946 prit/53-476 olie DOI: 0.080/07474940600934805 Optimal Two-Choice Stoppig o a Expoetial Sequece Larry Goldstei Departmet

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Lecture 3: August 31

Lecture 3: August 31 36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Stochastic Simulation

Stochastic Simulation Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso

More information

Math 104: Homework 2 solutions

Math 104: Homework 2 solutions Math 04: Homework solutios. A (0, ): Sice this is a ope iterval, the miimum is udefied, ad sice the set is ot bouded above, the maximum is also udefied. if A 0 ad sup A. B { m + : m, N}: This set does

More information

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925

More information

The natural exponential function

The natural exponential function The atural expoetial fuctio Attila Máté Brookly College of the City Uiversity of New York December, 205 Cotets The atural expoetial fuctio for real x. Beroulli s iequality.....................................2

More information

Notes 19 : Martingale CLT

Notes 19 : Martingale CLT Notes 9 : Martigale CLT Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces: [Bil95, Chapter 35], [Roc, Chapter 3]. Sice we have ot ecoutered weak covergece i some time, we first recall

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam will cover.-.9. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for you

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

MAT1026 Calculus II Basic Convergence Tests for Series

MAT1026 Calculus II Basic Convergence Tests for Series MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real

More information

Fall 2013 MTH431/531 Real analysis Section Notes

Fall 2013 MTH431/531 Real analysis Section Notes Fall 013 MTH431/531 Real aalysis Sectio 8.1-8. Notes Yi Su 013.11.1 1. Defiitio of uiform covergece. We look at a sequece of fuctios f (x) ad study the coverget property. Notice we have two parameters

More information

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3 MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Rates of Convergence by Moduli of Continuity

Rates of Convergence by Moduli of Continuity Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity

More information

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities Chapter 5 Iequalities 5.1 The Markov ad Chebyshev iequalities As you have probably see o today s frot page: every perso i the upper teth percetile ears at least 1 times more tha the average salary. I other

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

AN EXTENSION OF SIMONS INEQUALITY AND APPLICATIONS. Robert DEVILLE and Catherine FINET

AN EXTENSION OF SIMONS INEQUALITY AND APPLICATIONS. Robert DEVILLE and Catherine FINET 2001 vol. XIV, um. 1, 95-104 ISSN 1139-1138 AN EXTENSION OF SIMONS INEQUALITY AND APPLICATIONS Robert DEVILLE ad Catherie FINET Abstract This article is devoted to a extesio of Simos iequality. As a cosequece,

More information

Riesz-Fischer Sequences and Lower Frame Bounds

Riesz-Fischer Sequences and Lower Frame Bounds Zeitschrift für Aalysis ud ihre Aweduge Joural for Aalysis ad its Applicatios Volume 1 (00), No., 305 314 Riesz-Fischer Sequeces ad Lower Frame Bouds P. Casazza, O. Christese, S. Li ad A. Lider Abstract.

More information

Application to Random Graphs

Application to Random Graphs A Applicatio to Radom Graphs Brachig processes have a umber of iterestig ad importat applicatios. We shall cosider oe of the most famous of them, the Erdős-Réyi radom graph theory. 1 Defiitio A.1. Let

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

1 Convergence in Probability and the Weak Law of Large Numbers

1 Convergence in Probability and the Weak Law of Large Numbers 36-752 Advaced Probability Overview Sprig 2018 8. Covergece Cocepts: i Probability, i L p ad Almost Surely Istructor: Alessadro Rialdo Associated readig: Sec 2.4, 2.5, ad 4.11 of Ash ad Doléas-Dade; Sec

More information

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5 Ma 42: Itroductio to Lebesgue Itegratio Solutios to Homework Assigmet 5 Prof. Wickerhauser Due Thursday, April th, 23 Please retur your solutios to the istructor by the ed of class o the due date. You

More information

ON POINTWISE BINOMIAL APPROXIMATION

ON POINTWISE BINOMIAL APPROXIMATION Iteratioal Joural of Pure ad Applied Mathematics Volume 71 No. 1 2011, 57-66 ON POINTWISE BINOMIAL APPROXIMATION BY w-functions K. Teerapabolar 1, P. Wogkasem 2 Departmet of Mathematics Faculty of Sciece

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Notes 27 : Brownian motion: path properties

Notes 27 : Brownian motion: path properties Notes 27 : Browia motio: path properties Math 733-734: Theory of Probability Lecturer: Sebastie Roch Refereces:[Dur10, Sectio 8.1], [MP10, Sectio 1.1, 1.2, 1.3]. Recall: DEF 27.1 (Covariace) Let X = (X

More information

Sequences and Limits

Sequences and Limits Chapter Sequeces ad Limits Let { a } be a sequece of real or complex umbers A ecessary ad sufficiet coditio for the sequece to coverge is that for ay ɛ > 0 there exists a iteger N > 0 such that a p a q

More information

4.3 Growth Rates of Solutions to Recurrences

4.3 Growth Rates of Solutions to Recurrences 4.3. GROWTH RATES OF SOLUTIONS TO RECURRENCES 81 4.3 Growth Rates of Solutios to Recurreces 4.3.1 Divide ad Coquer Algorithms Oe of the most basic ad powerful algorithmic techiques is divide ad coquer.

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Lecture 2. The Lovász Local Lemma

Lecture 2. The Lovász Local Lemma Staford Uiversity Sprig 208 Math 233A: No-costructive methods i combiatorics Istructor: Ja Vodrák Lecture date: Jauary 0, 208 Origial scribe: Apoorva Khare Lecture 2. The Lovász Local Lemma 2. Itroductio

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURE 23. SOME CONSEQUENCES OF ONLINE NO-REGRET METHODS I this lecture, we explore some cosequeces of the developed techiques.. Covex optimizatio Wheever

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013 Fuctioal Law of Large Numbers. Costructio of the Wieer Measure Cotet. 1. Additioal techical results o weak covergece

More information

Chapter 10: Power Series

Chapter 10: Power Series Chapter : Power Series 57 Chapter Overview: Power Series The reaso series are part of a Calculus course is that there are fuctios which caot be itegrated. All power series, though, ca be itegrated because

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

Law of the sum of Bernoulli random variables

Law of the sum of Bernoulli random variables Law of the sum of Beroulli radom variables Nicolas Chevallier Uiversité de Haute Alsace, 4, rue des frères Lumière 68093 Mulhouse icolas.chevallier@uha.fr December 006 Abstract Let be the set of all possible

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018 CSE 353 Discrete Computatioal Structures Sprig 08 Sequeces, Mathematical Iductio, ad Recursio (Chapter 5, Epp) Note: some course slides adopted from publisher-provided material Overview May mathematical

More information

7 Sequences of real numbers

7 Sequences of real numbers 40 7 Sequeces of real umbers 7. Defiitios ad examples Defiitio 7... A sequece of real umbers is a real fuctio whose domai is the set N of atural umbers. Let s : N R be a sequece. The the values of s are

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Information Theory and Statistics Lecture 4: Lempel-Ziv code Iformatio Theory ad Statistics Lecture 4: Lempel-Ziv code Łukasz Dębowski ldebowsk@ipipa.waw.pl Ph. D. Programme 203/204 Etropy rate is the limitig compressio rate Theorem For a statioary process (X i)

More information

Lecture Notes for Analysis Class

Lecture Notes for Analysis Class Lecture Notes for Aalysis Class Topological Spaces A topology for a set X is a collectio T of subsets of X such that: (a) X ad the empty set are i T (b) Uios of elemets of T are i T (c) Fiite itersectios

More information