arxiv: v6 [cs.lg] 5 Jul 2012

Size: px

Start display at page:

Download "arxiv: v6 [cs.lg] 5 Jul 2012"

Sherman Knight
6 years ago
Views:

1 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion arxiv:095647v6 [slg] 5 Jul 0 Alexander Rakhlin Universiy of Pennsylvania Ohad Shamir Mirosof Researh New England Karhik Sridharan Universiy of Pennsylvania Absra Sohasi gradien desen SGD) is a simple and popular mehod o solve sohasi opimizaion problems whih arise in mahine learning For srongly onvex problems, is onvergene rae was known o be OlogT )/T ), by running SGD for T ieraions and reurning he average poin However, reen resuls showed ha using a differen algorihm, one an ge an opimal O/T ) rae This migh lead one o believe ha sandard SGD is subopimal, and maybe should even be replaed as a mehod of hoie In his paper, we invesigae he opimaliy of SGD in a sohasi seing We show ha for smooh problems, he algorihm aains he opimal O/T ) rae However, for non-smooh problems, he onvergene rae wih averaging migh really be ΩlogT )/T ), and his is no jus an arifa of he analysis On he flip side, we show ha a simple modifiaion of he averaging sep suffies o reover he O/T ) rae, and no oher hange of he algorihm is neessary We also presen experimenal resuls whih suppor our findings, and poin ou open problems Inroduion Sohasi gradien desen SGD) is one of he simples and mos popular firs-order mehods o solve Appearing in Proeedings of he 9 h Inernaional Conferene on Mahine Learning, Edinburgh, Soland, UK, 0 Copyrigh 0 by he auhors)/owners) rakhlin@wharonupennedu ohadsh@mirosofom skarhik@wharonupennedu onvex learning problems Given a onvex loss funion and a raining se of T examples, SGD an be used o obain a sequene of T prediors, whose average has a generalizaion error whih onverges wih T ) o he opimal one in he lass of prediors we onsider The ommon framework o analyze suh firsorder algorihms is via sohasi opimizaion, where our goal is o opimize an unknown onvex funion F, given only unbiased esimaes of F s subgradiens see Se for a more preise definiion) An imporan speial ase is when F is srongly onvex inuiively, an be lower bounded by a quadrai funion) Suh funions arise, for insane, in Suppor Veor Mahines and oher regularized learning algorihms For suh problems, here is a well-known OlogT )/T ) onvergene guaranee for SGD wih averaging This rae is obained using he analysis of he algorihm in he harder seing of online learning Hazan e al, 007), ombined wih an online-o-bah onversion see Hazan & Kale, 0) for more deails) Surprisingly, a reen paper by Hazan and Kale Hazan & Kale, 0) showed ha in fa, an OlogT )/T ) is no he bes ha one an ahieve for srongly onvex sohasi problems In pariular, an opimal O/T ) rae an be obained using a differen algorihm, whih is somewha similar o SGD bu is more omplex alhough wih omparable ompuaional omplexiy) A very similar algorihm was also presened reenly by Judisky and Neserov Judisky & Neserov, 00) Roughly speaking, he algorihm divides he T ieraions ino exponenially inreasing epohs, and runs sohasi gradien desen wih averaging on eah one The resuling poin of eah epoh is used as he saring poin of he nex epoh The algorihm reurns he resuling poin of he las epoh

2 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion These resuls lef an imporan gap: Namely, wheher he rue onvergene rae of SGD, possibly wih some sor of averaging, migh also be O/T ), and he known OlogT )/T ) resul is jus an arifa of he analysis Indeed, he whole moivaion of Hazan & Kale, 0) was ha he sandard online analysis is oo loose o analyze he sohasi seing properly Perhaps a similar looseness applies o he analysis of SGD as well? This quesion has immediae praial relevane: if he new algorihms enjoy a beer rae han SGD, i migh indiae hey will work beer in praie, and ha praiioners should abandon SGD in favor of hem In his paper, we sudy he onvergene rae of SGD for sohasi srongly onvex problems, wih he following onribuions: Firs, we exend known resuls o show ha if F is no only srongly onvex, bu also smooh wih respe o he opimum), hen SGD wih and wihou averaging ahieves he opimal O/T ) onvergene rae We hen show ha for non-smooh F, here are ases where he onvergene rae of SGD wih averaging is ΩlogT )/T ) In oher words, he OlogT )/T ) bound for general srongly onvex problems is real, and no jus an arifa of he urrenly-known analysis However, we show ha one an reover he opimal O/T ) onvergene rae in expeaion and in high probabiliy) by a simple modifiaion of he averaging sep: Insead of averaging of T poins, we only average he las αt poins, where α 0, ) is arbirary Thus, o obain an opimal rae, one does no need o use an algorihm signifianly differen han SGD, suh as hose disussed earlier We perform an empirial sudy on boh arifiial and real-world daa, whih suppors our findings Moreover, our rae upper bounds are shown o hold in expeaion, as well as in high probabiliy up o a loglogt )) faor) While he fous here is on geing he opimal rae in erms of T, we noe ha our upper bounds are also opimal in erms of oher sandard problem parameers, suh as he srong onvexiy parameer and he variane of he sohasi gradiens Following he paradigm of Hazan & Kale, 0), we analyze he algorihm direly in he sohasi seing, and avoid an online analysis wih an online-o-bah onversion This also allows us o prove resuls whih are more general In pariular, he sandard online analysis of SGD requires he sep size of he algorihm a round o equal /λ, where λ is he srong onvexiy parameer of F In onras, our analysis opes wih any sep size /λ, as long as is no oo small In erms of relaed work, we noe ha he performane of SGD in a sohasi seing has been exensively researhed in sohasi approximaion heory see for insane Kushner & Yin, 003)) However, hese resuls are usually obained under smoohness assumpions, and are ofen asympoi, so we do no ge an explii bound in erms of T whih applies o our seing We also noe ha a finie-sample analysis of SGD in he sohasi seing was reenly presened in Bah & Moulines, 0) However, he fous here was differen han ours, and also obained bounds whih hold only in expeaion raher han in high probabiliy More imporanly, he analysis was arried ou under sronger smoohness assumpions han our analysis, and o he bes of our undersanding, does no apply o general, possibly non-smooh, srongly onvex sohasi opimizaion problems For example, smoohness assumpions may no over he appliaion of SGD o suppor veor mahines as in Shalev- Shwarz e al, 0)), sine i uses a non-smooh loss funion, and hus he underlying funion F we are rying o sohasially opimize may no be smooh Preliminaries We use bold-fae leers o denoe veors Given some veor w, we use w i o denoe is i-h oordinae Similarly, given some indexed veor w, we le w,i denoe is i-h oordinae We le A denoe he indiaor funion for some even A We onsider he sandard seing of onvex sohasi opimizaion, using firs-order mehods Our goal is o minimize a onvex funion F over some onvex domain W whih is assumed o be a subse of some Hilber spae) However, we do no know F, and he only informaion available is hrough a sohasi gradien orale, whih given some w W, produes a veor ĝ, whose expeaion E[ĝ] = g is a subgradien of F a w Using a bounded number T of alls o his orale, we wish o find a poin w T suh ha F w ) is as small as possible In pariular, we will assume ha F aains a minimum a some w W, and our analysis provides bounds on F w ) F w ) eiher in expeaion or in high probabiliy he high probabiliy resuls are sronger, bu require more effor and have slighly worse dependene on some problem parameers) The appliaion of his framework o learning is sraighforward see for insane Shalev-Shwarz e al, 009)):

3 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion given a hypohesis lass W and a se of T iid examples, we wish o find a predior w whose expeed loss F w) is lose o opimal over W Sine he examples are hosen iid, he subgradien of he loss funion wih respe o any individual example an be shown o be an unbiased esimae of a subgradien of F We will fous on an imporan speial ase of he problem, haraerized by F being a srongly onvex funion Formally, we say ha a funion F is λ-srongly onvex, if for all w, w W and any subgradien g of F a w, F w ) F w) + g, w w + λ w w ) Anoher possible propery of F we will onsider is smoohness, a leas wih respe o he opimum w Formally, a funion F is µ-smooh wih respe o w if for all w W, F w) F w ) µ w w ) Suh funions arise, for insane, in logisi and leassquares regression, and in general for learning linear prediors where he loss funion has a Lipshizoninuous gradien The algorihm we fous on is sohasi gradien desen SGD) The SGD algorihm is parameerized by sep sizes η,, η T, and is defined as follows: Iniialize w W arbirarily or randomly) For =,, T : Query he sohasi gradien orale a w o ge a random ĝ suh ha E[ĝ ] = g is a subgradien of F a w Le w + = Π W w η ĝ ), where Π W is he projeion operaor on W This algorihm reurns a sequene of poins w,, w T To obain a single poin, one an use several sraegies Perhaps he simples one is o reurn he las poin, w T + Anoher proedure, for whih he sandard online analysis of SGD applies Hazan e al, 007), is o reurn he average poin w T = T w + + w T ) For sohasi opimizaion of λ-srongly funions, he sandard analysis hrough online learning) fouses on he sep size η being exaly /λ Hazan e al, 007) Our analysis will onsider more general sepsizes /λ, where is a onsan We noe ha a sep size of Θ/) is neessary for he algorihm o obain an opimal onvergene rae see Appendix A) In general, we will assume ha regardless of how w is iniialized, i holds ha E[ ĝ ] G for some fixed onsan G Noe ha his is a somewha weaker assumpion han Hazan & Kale, 0), whih required ha ĝ G wih probabiliy, sine we fous only on bounds whih hold in expeaion These ypes of assumpions are ommon in he lieraure, and are generally implied by aking W o be a bounded domain, or alernaively, assuming ha w is iniialized no oo far from w and F saisfies erain ehnial ondiions see for insane he proof of Theorem in Shalev-Shwarz e al, 0)) Full proofs of our resuls are provided in Appendix B 3 Smooh Funions We begin by onsidering he ase where he expeed funion F ) is boh srongly onvex and smooh wih respe o w Our saring poin is o show a O/T ) for he las poin obained by SGD This resul is well known in he lieraure see for insane Nemirovski e al, 009)) and we inlude a proof for ompleeness Laer on, we will show how o exend i o a highprobabiliy bound Theorem Suppose F is λ-srongly onvex and µ- smooh wih respe o w over a onvex se W, and ha E[ ĝ ] G Then if we pik η = /λ for some onsan > /, i holds for any T ha E[F w T ) F w )] max {4, } µg / λ T The heorem is an immediae orollary of he following key lemma, and he definiion of µ-smoohness wih respe o w Lemma Suppose F is λ-srongly onvex over a onvex se W, and ha E[ ĝ ] G Then if we pik η = /λ for some onsan > /, i holds for any T ha E [ w T w ] max { 4, } G / λ T We now urn o disuss he behavior of he average poin w T = w + + w T )/T, and show ha for smooh F, i also enjoys an opimal O/T ) onvergene rae wih even beer dependene on ) Theorem Suppose F is λ-srongly onvex and µ- smooh wih respe o w over a onvex se W, and ha E[ ĝ ] G Then if we pik η = /λ for

4 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion some onsan > /, E[F w T ) F w )] is a mos max { µg λ, 4µG λ, µg 4 λ / } T A rough proof inuiion is he following: Lemma implies ha he Eulidean disane of w from w is on he order of /, so he squared disane of w T from w is on he order of /T ) T = / ) /T, and he res follows from smoohness 4 Non-Smooh Funions We now urn o he disuss he more general ase where he funion F may no be smooh ie here is no onsan µ whih saisfies Eq ) uniformly for all w W) In he onex of learning, his may happen when we ry o learn a predior wih respe o a non-smooh loss funion, suh as he hinge loss As disussed earlier, SGD wih averaging is known o have a rae of a mos OlogT )/T ) In he previous seion, we saw ha for smooh F, he rae is aually O/T ) Moreover, Hazan & Kale, 0) showed ha for using a differen algorihm han SGD, one an obain a rae of O/T ) even in he non-smooh ase This migh lead us o believe ha an O/T ) rae for SGD is possible in he non-smooh ase, and ha he OlogT )/T ) analysis is simply no igh However, his inuiion urns ou o be wrong Below, we show ha here are srongly onvex sohasi opimizaion problems in Eulidean spae, in whih he onvergene rae of SGD wih averaging is lower bounded by ΩlogT )/T ) Thus, he logarihm in he bound is no merely a shoroming in he sandard online analysis of SGD, bu is really a propery of he algorihm We begin wih he following relaively simple example, whih shows he essene of he idea Le F be he - srongly onvex funion F w) = w + w, over he domain W = [0, ] d, whih has a global minimum a 0 Suppose he sohasi gradien orale, given a poin w, reurns he gradien esimae ĝ = w + Z, 0,, 0), where Z is uniformly disribued over [, 3] I is easily verified ha E[ĝ ] is a subgradien of F w ), and ha E[ ĝ ] d + 5 whih is a bounded quaniy for fixed d The following heorem implies in his ase, he onvergene rae of SGD wih averaging has a ΩlogT )/T ) lower bound The inuiion for his is ha he global opimum lies a a orner of W, so SGD approahes i only from one direion As a resul, averaging he poins reurned by SGD aually hurs us Theorem 3 Consider he srongly onvex sohasi opimizaion problem presened above If SGD is iniialized a any poin in W, and ran wih η = /, hen for any T T 0 +, where T 0 = max{, /}, we have E[F w T ) F w )] 6T =T 0 When is onsidered a onsan, his lower bound is ΩlogT )/T ) While he lower bound sales wih, we remind he reader ha one mus pik η = / wih onsan for an opimal onvergene rae in general see disussion in Se ) This example is relaively sraighforward bu no fully saisfying, sine i ruially relies on he fa ha w is on he border of W In srongly onvex problems, w usually lies in he inerior of W, so perhaps he ΩlogT )/T ) lower bound does no hold in suh ases Our main resul, presened below, shows ha his is no he ase, and ha even if w is well inside he inerior of W, an ΩlogT )/T ) rae for SGD wih averaging an be unavoidable The inuiion is ha we onsru a non-smooh F, whih fores w o approah he opimum from jus one direion, reaing he same effe as in he previous example In pariular, le F be he -srongly onvex funion { F w) = w w 0 w + 7w w < 0, over he domain W = [, ] d, whih has a global minimum a 0 Suppose he sohasi gradien orale, given a poin w, reurns he gradien esimae { Z, 0,, 0) w 0 ĝ = w + 7, 0,, 0) w < 0, where Z is a random variable uniformly disribued over [, 3] I is easily verified ha E[ĝ ] is a subgradien of F w ), and ha E[ ĝ ] d + 63 whih is a bounded quaniy for fixed d Theorem 4 Consider he srongly onvex sohasi opimizaion problem presened above If SGD is iniialized a any poin w wih w, 0, and ran wih η = /, hen for any T T 0 +, where T 0 = max{, 6 + }, we have E [F w T ) F w )] 3 6T =T 0+ ) T 0 T

5 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion When is onsidered a onsan, his lower bound is ΩlogT )/T ) We noe ha he requiremen of w, 0 is jus for onveniene, and he analysis also arries hrough, wih some seond-order faors, if we le w, < 0 5 Reovering an O/T ) Rae for SGD wih α-suffix Averaging In he previous seion, we showed ha SGD wih averaging may have a rae of ΩlogT )/T ) for nonsmooh F To ge he opimal O/T ) rae for any F, we migh urn o he algorihms of Hazan & Kale, 0) and Judisky & Neserov, 00) However, hese algorihms onsiue a signifian deparure from sandard SGD In his seion, we show ha i is aually possible o ge an O/T ) rae using a muh simpler modifiaion of he algorihm: given he sequene of poins w,, w T provided by SGD, insead of reurning he average w T = w + +w T )/T, we average and reurn jus a suffix, namely w α T = w α)t w T αt for some onsan α 0, ) assuming αt and α)t are inegers) We all his proedure α-suffix averaging Theorem 5 Consider SGD wih α-suffix averaging as desribed above, and wih sep sizes η = /λ where > / is a onsan Suppose F is λ-srongly onvex, and ha E[ ĝ ] G for all Then for any T, i holds ha E[F w α T ) F w )] where = max { }, 4 / + + ) log α, α )) G λt, Noe ha for any onsan α 0, ), he bound above is OG /λt ) This applies o any relevan sep size /λ, and mahes he opimal guaranees in Hazan & Kale, 0) up o onsan faors However, his is shown for sandard SGD, as opposed o he more speialized algorihm of Hazan & Kale, 0) Finally, we noe ha i migh be emping o use Thm 5 as a guide o hoose he averaging window, by opimizing he bound for α for insane, for =, he opimum is ahieved around α 065) However, we noe ha he opimal value of α is dependen on he onsans in he bound, whih may no be he ighes or mos orre ones Proof Skeh The proof ombines he analysis of online gradien desen Hazan e al, 007) and Lemma In pariular, saring as in he proof of Lemma, and exraing he inner produs, we ge = α)t + = α)t + E[ g, w w ] = α)t + η G + E[ w w ] E[ w + w ) ] η η 3) Rearranging he rhs, and using he onvexiy of F o relae he lhs o E[F w T α) F w )], we ge a onvergene upper bound of E[ w α)t + w ] + G αt η α)t + + = α)t + = α)t + η E[ w w ] ) η η Lemma ells us ha wih any srongly onvex F, even non-smooh, we have E[ w w ] O/) Plugging his in and performing a few more manipulaions, he resul follows One poenial disadvanage of suffix averaging is ha if we anno sore all he ieraes w in memory, hen we need o know from whih ierae αt o sar ompuing he suffix average in onras, sandard averaging an be ompued on-he-fly wihou knowing he sopping ime T in advane) However, even if T is no known, his an be easily addressed in several ways For example, sine our resuls are robus o he value of α, i is really enough o guess when we passed some onsan porion of all ieraes Alernaively, one an divide he rounds ino exponenially inreasing epohs, and mainain he average jus of he urren epoh Suh an average would always orrespond o a onsan-porion suffix of all ieraes 6 High-Probabiliy Bounds All our previous bounds were on he expeed subopimaliy E[F w) F w )] of an appropriae predior w We now ouline how hese resuls an be srenghened o bounds on F w ) F w ) whih hold wih arbirarily high probabiliy δ, wih he bound depending logarihmially on δ They are slighly worse han our in-expeaion bounds by having worse dependene on he sep size parameer and an addiional loglogt )) faor ineresingly, a similar faor also appears in he analysis of Hazan & Kale, 0), and we do no

6 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion know if i is neessary) The key resul is he following srenghening of Lemma, under slighly sronger ehnial ondiions Lemma Le δ 0, /e) and T 4 Suppose F is λ-srongly onvex over a onvex se W, and ha ĝ G wih probabiliy Then if we pik η = /λ for some onsan > /, suh ha is a whole number, i holds wih probabiliy a leas δ ha for any {4 + 4,, T, T } ha w w G λ + 8G + )G loglog)/δ) λ We noe ha he assumpions on and are only for simplifying he resul To obain high probabiliy versions of Thm, Thm, and Thm 5, we simply need o plug in his lemma in lieu of Lemma in heir proofs This leads overall o raes of he form OloglogT )/δ)/t ) whih hold wih probabiliy δ 7 Experimens We now urn o empirially sudy how he algorihms behave, and ompare i o our heoreial findings We sudied he following four algorihms: Sgd-A: Performing SGD and hen reurning he average poin over all T rounds Sgd-α: Performing SGD wih α-suffix averaging We hose α = / - namely, we reurn he average poin over he las T/ rounds 3 Sgd-L: Performing SGD and reurning he poin obained in he las round 4 Epoh-Gd: The opimal algorihm of Hazan & Kale, 0) for srongly onvex sohasi opimizaion Firs, as a simple saniy hek, we measured he performane of hese algorihms on a simple, srongly onvex sohasi opimizaion problem, whih is also smooh We define W = [, ] 5, and F w) = w The sohasi gradien orale, given a poin w, reurns he sohasi gradien w + z where z is uniformly disribued in [, ] 5 Clearly, his is an unbiased esimae of he gradien of F a w The iniial poin w of all 4 algorihms was hosen uniformly a random from W The resuls are presened in Fig, and i is lear ha all 4 algorihms indeed ahieve a Θ/T ) rae, mahing our heoreial analysis Thm, Thm and Thm 5) The resuls also seem o indiae ha Sgd-A has a somewha worse performane in erms of leading onsans Fw T ) Fw * )) * T SGD A SGD SGD L EPOCH GD log T) Figure Resuls for smooh srongly onvex sohasi opimizaion problem The experimen was repeaed 0 imes, and we repor he mean and sandard deviaion for eah hoie of T The X-axis is he log-number of rounds logt ), and he Y-axis is F w T ) F w )) T The saling by T means ha a roughly onsan graph orresponds o a Θ/T ) rae, whereas a linearly inreasing graph orresponds o a ΘlogT )/T ) rae Seond, as anoher simple experimen, we measured he performane of he algorihms on he non-smooh, srongly onvex problem desribed in he proof of Thm 4 In pariular, we simulaed his problem wih d = 5, and piked w uniformly a random from W The resuls are presened in Fig As our heory indiaes, Sgd-A seems o have an ΘlogT )/T ) onvergene rae, whereas he oher 3 algorihms all seem o have he opimal Θ/T ) onvergene rae Among hese algorihms, he SGD varians Sgd-L and Sgd-α seem o perform somewha beer han Epoh-Gd Also, while he average performane of Sgd-L and Sgd-α are similar, Sgd-α has less variane This is reasonable, onsidering he fa ha Sgd-α reurns an average of many poins, whereas Sgd-L reurn only he very las poin Finally, we performed a se of experimens on realworld daa We used he same 3 binary lassifiaion daases a,ov and asro-ph) used by Shalev- Shwarz e al, 0) and Joahims, 006), o es he performane of opimizaion algorihms for Suppor Veor Mahines using linear kernels Eah of hese daases is omposed of a raining se and a es se Given a raining se of insane-label pairs, {x i, y i } m i=, we defined F o be he sandard non-smooh) objeive funion of Suppor Veor Mahines, namely F w) = λ w + m m max{0, y i x i, w } 4) i= Following Shalev-Shwarz e al, 0) and Joahims, 006), we ook λ = 0 4 for a, λ = 0 6 for ov, and λ = for asro-ph The sohasi gra-

7 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion Fw T ) Fw * )) * T SGD A SGD SGD L EPOCH GD log Fw T )) ASTRO Training Loss SGD A SGD α SGD L EPOCH GD log T) Figure Resuls for he non-smooh srongly onvex sohasi opimizaion problem The experimen was repeaed 0 imes, and we repor he mean and sandard deviaion for eah hoie of T The X-axis is he log-number of rounds logt ), and he Y-axis is F w T ) F w )) T The saling by T means ha a roughly onsan graph orresponds o a Θ/T ) rae, whereas a linearly inreasing graph orresponds o a ΘlogT )/T ) rae dien given w was ompued by aking a single randomly drawn raining example x i, y i ), and ompuing he gradien wih respe o ha example, namely ĝ = λw yi x i,w y i x i Eah daase omes wih a separae es se, and we also repor he objeive funion value wih respe o ha se as in Eq 4), his ime wih {x i, y i } represening he es se examples) All algorihms were iniialized a w = 0, wih W = R d ie no projeions were performed - see he disussion in Se ) The resuls of he experimens are presened in Fig 3,Fig 4 and Fig 5 In all experimens, Sgd- A performed he wors The oher 3 algorihms performed raher similarly, wih Sgd-α being slighly beer on he Cov daase, and Sgd-L being slighly beer on he oher daases In summary, our experimens indiae he following: Sgd-A, whih averages over all T prediors, is worse han he oher approahes This aords wih our heory, as well as he resuls repored in Shalev-Shwarz e al, 0) The Epoh-Gd algorihm does have beer performane han Sgd-A, bu a similar or beer performane was obained using he simpler approahes of α-suffix averaging Sgd-α) or even jus reurning he las predior Sgd-L) The good performane of Sgd-α is suppored by our heoreial resuls, and so does he performane of Sgd-L in he srongly onvex and smooh ase log Fw T )) log T) ASTRO Tes Loss SGD A SGD α SGD L EPOCH GD log T) Figure 3 Resuls for he asro-ph daase The lef row refers o he average loss on he raining daa, and he righ row refers o he average loss on he es daa Eah experimen was repeaed 0 imes, and we repor he mean and sandard deviaion for eah hoie of T The X-axis is he log-number of rounds logt ), and he Y-axis is he log of he objeive funion logf w T )) log Fw T )) log Fw T )) CCAT Training Loss SGD A SGD α SGD L EPOCH GD log T) CCAT Tes Loss SGD A SGD α SGD L EPOCH GD log T) Figure 4 Resuls for he a daase See Fig 3 apion for deails

8 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion log Fw T )) log Fw )) COV Training Loss SGD A SGD α SGD L EPOCH GD log T) COV Tes Loss SGD A SGD α SGD L EPOCH GD log T) Figure 5 Resuls for he a daase See Fig 3 apion for deails Sgd-L also performed raher well wih wha seems like a Θ/T ) rae) on he non-smooh problem repored in Fig, alhough wih a larger variane han Sgd-α Our urren heory does no over he onvergene of he las predior in non-smooh problems - see he disussion below 8 Disussion In his paper, we analyzed he behavior of SGD for srongly onvex sohasi opimizaion problems We demonsraed ha his simple and well-known algorihm performs opimally whenever he underlying funion is smooh, bu he sandard averaging sep an make i subopimal for non-smooh problems However, a simple modifiaion of he averaging sep suffies o reover he opimal rae, and a more sophisiaed algorihm is no neessary Our experimens seem o suppor his onlusion There are several open issues remaining In pariular, he O/T ) rae in he non-smooh ase sill requires some sor of averaging However, in our experimens and oher sudies eg Shalev-Shwarz e al, 0)), reurning he las ierae w T also seems o perform quie well Our urren heory does no over his - a bes, one an use Lemma and Jensen s inequaliy o argue ha he las ierae has a O/ T ) rae, bu he behavior in praie is learly muh beer Does SGD, wihou averaging, obain an O/T ) rae for general srongly onvex problems? Also, a fuller empirial sudy is warraned of wheher and whih averaging sheme is bes in praie Aknowledgemens: We hank Elad Hazan and Sayen Kale for helpful ommens on an earlier version of his paper Referenes Bah, F and Moulines, E Non-asympoi analysis of sohasi approximaion algorihms for mahine learning In NIPS, 0 Barle, PL, Dani, V, Hayes, T, Kakade, S, Rakhlin, A, and Tewari, A High-probabiliy regre bounds for bandi online linear opimizaion In COLT, 008 De La Peña, VH A general lass of exponenial inequaliies for maringales and raios The Annals of Probabiliy, 7): , 999 Hazan, E and Kale, S Beyond he regre minimizaion barrier: An opimal algorihm for sohasi srongly-onvex opimizaion In COLT, 0 Hazan, E, Agarwal, A, and Kale, S Logarihmi regre algorihms for online onvex opimizaion Mahine Learning, 69-3):69 9, 007 Joahims, T Training linear SVMs in linear ime In KDD, 006 Judisky, A and Neserov, Y Primal-dual subgradien mehods for minimizing uniformly onvex funions Tehnial Repor Augus 00), available a hp://halarhives-ouveresfr/dos/00/50 /89/33/PDF/Srong-halpdf, 00 Kushner, H and Yin, G Sohasi Approximaion and Reursive Algorihms and Appliaions Springer, nd ediion, 003 Nemirovski, A, Judisky, A, Lan, G, and Shapiro, A Robus sohasi approximaion approah o sohasi programming SIAM J Opim, 94): , 009 Shalev-Shwarz, S, Shamir, O, Srebro, N, and Sridharan, K Sohasi onvex opimizaion In COLT, 009 Shalev-Shwarz, S, Singer, Y, Srebro, N, and Coer, A Pegasos: primal esimaed sub-gradien solver for svm Mahemaial Programming, 7):3 30, 0

9 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion A Jusifying η = Θ/) Sep-Sizes In his appendix, we jusify our fous on he sep-size regime η = Θ/), by showing ha for oher sep sizes, one anno hope for an opimal onvergene rae in general Le us begin by onsidering he salar, srongly onvex funion F w) = w, in he deerminisi ase where ĝ = g = F w ) = w wih probabiliy, and show ha η anno be smaller han Ω/) Inuiively, suh small sep sizes do no allow he ieraes w o move owards he opimum suffiienly fas More formally, saring from say) w = and using he reursive equaliy w + = w η ĝ, we immediaely ge w T = T = η ) Thus, if we wan o obain a O/T ) onvergene rae using he ieraes reurned by he algorihm, we mus a leas require ha T η ) O/T ) This is equivalen o requiring ha = log η ) ΩlogT )) T = For large enough and small enough η, log η ) η, and we ge ha T = η mus sale a leas logarihmially wih T This requires η Ω/) To show ha η anno be larger han O/), one an onsider he funion F w) = w + w over he domain W = [0, ] - his is a one-dimensional speial ase of he example onsidered in Thm 3 Inuiively, wih an appropriae sohasi gradien model, he random fluuaions in F w + ) ondiioned on w,, w ) are of order η, so we need η = O/) o ge opimal raes In pariular, in he proof of Thm 3, we show ha for an appropriae sohasi gradien model, E[F w )] E[w ] η /6 see Eq 8)) A similar lower bound an also be shown for he unonsrained seing onsidered in Thm 4 B Proofs B Some Tehnial Resuls In his subseion we olle some ehnial Resuls we will need for he oher proofs Lemma 3 Le a >, b 0 and x [0, D] be arbirary onsans Le x, x 3, be a non-negaive sequene whih saisfies x + a ) x + b Then for all, x maxd, b/a )) Proof Le m = maxd, b/a )) The proof is by simple induion The asserion learly holds for = Now, suppose ha x m Then i suffies o show ha a ) m + b m + This an be simplified o whih learly holds as m b/a ) b + ) ma ) + a) Lemma 4 Le b 0, 0 and x [0, D] be arbirary onsans Le x, x 3, be a non-negaive sequene whih saisfies ) b x + x + x ) 3/ + ) 3

10 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion hen for all, x max{d, b + /} Proof Le m = max{d, b + /} As in he previous lemma, he proof is by induion The asserion learly holds for = Now, suppose ha x m Then i suffies o show ha ) m + + b m + ) 3/ + + ) 3 m + This an be simplified o + + )m b + ) m 0 By solving he quadrai inequaliy, we ge ha his holds whenever m b + + b To ensure his holds for all, i suffies o verify for he value of maximizing he righ hand side, namely = : m b + b +, whih follows immediaely from he definiion of m Lemma 5 If E[ ĝ ] G, hen E[ w w ] 4G λ Proof Inuiively, he lemma holds beause he srong onvexiy of F implies ha he expeed value of ĝ mus srily inrease as we ge farher from w More preisely, srong onvexiy implies ha for any w, so by he Cauhy-Shwarz inequaliy, g, w w λ w w g λ 4 w w 5) Also, we have ha wheher g is random or no depending on wheher w is hosen arbirarily or randomly), E[ ĝ ] = E[ g + ĝ g ) ] = E[ g ] + E[ ĝ g ] + E[ ĝ g, g ] E[ g ] Combining his and Eq 5), we ge ha for all, E[ w w ] 4 λ E[ ĝ ] 4G λ The following version of Freedman s inequaliy appears in De La Peña, 999) Theorem A): Theorem 6 Le d,, d T be a maringale differene sequene wih a uniform upper bound b on he seps d i Le V denoe he sum of ondiional varianes, s V s = Vard i d,, d i ) i= Then, for every a, v > 0, s Prob d i a and V s v i= for some s T ) ) a exp v + ba)

11 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion The proof of he following lemma is aken almos verbaim from Barle e al, 008), wih he only modifiaion being he use of Theorem 6 o avoid an unneessary union bound Lemma 6 Le d,, d T be a maringale differene sequene wih a uniform bound d i b Le V s = s = Var d ) be he sum of ondiional varianes of d s Furher, le σ s = V s Then we have, for any δ < /e and T 4, Prob s = { d > max σ s, b ) } ln/δ) ln/δ) for some s T logt )δ 6) Proof Noe ha a rude upper bound on Var d is b Thus, σ s b T We hoose a disreizaion 0 = α < α 0 < < α l suh ha α i+ = α i for i 0 and α l b T We will speify he hoie of α 0 shorly We hen have, s Prob d > max{σ s, α 0 } ln/δ) = = = l s Prob = d > max{σ s, α 0 } ln/δ) & α j < σ s α j j=0 j=0 for some s T ) ) for some s T l s Prob = d ) > α j ln/δ) & αj < V s αj for some s T l s ) Prob d > α j ln/δ) & Vs αj for some s T j=0 = l 4α exp j ln/δ) αj + ) 3 α j ln/δ) b l exp α j ln/δ) ln/δ) ) α j + 3 b j=0 j=0 where he las inequaliy follows from Theorem 6 If we now hoose α 0 = b ln/δ), hen α j b ) ln/δ) for all j Hene every erm in he above summaion is bounded by exp < δ Choosing l = log T ) ensures ha α l b T Thus we have ln/δ) +/3 T Prob X > max{σ s, b ln/δ)} ) ln/δ) = Prob X > max{σ s, α 0 } ) ln/δ) = l + )δ = log T ) + )δ logt )δ B Proof of Lemma By he srong onvexiy of F and he fa ha w minimizes F in W, we have g, w w F w ) F w ) + λ w w, as well as F w ) F w ) λ w w

12 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion Also, by onvexiy of W, for any poin v and any w W we have Π W v) w v w Using hese inequaliies, we have he following: E [ w + w ] = E[ Π W w η ĝ ) w ] E [ w η ĝ w ] = E [ w w ] η E[ ĝ, w w ] + η E[ ĝ ] = E [ w w ] η E[ g, w w ] + η E[ ĝ ] E [ w w ] η E [F w ) F w ) + λ ] w w + η G E [ w w ] [ λ η E w w + λ ] w w + η G = η λ)e [ w w ] + η G Invoking Lemma 3, using he fa ha η = /λ and E[ w w ] 4G λ B3 Proof of Thm E [ w T w ] { G λ T max 4, For any, define w = w + + w )/ Then we have E [ w + w ] [ = E + w + ] + w + w [ = E + w w ) + ] + w + w ) ) = E [ w w ] } + ) E [ w w, w + w ] + ) E [ w w ] + + E [ w w w + w ] + by Lemma 5), we ge ha + ) E [ w + w ] + ) E [ w + w ] Using he inequaliy E[ XY ] E[X ] E[Y ] for any random variables X, Y whih follows from Cauhy- Shwarz), and he bound of Lemma, we ge ha E [ w + w ] is a mos ) E [ w w ] + max{, / /)}G E[ w w + λ + ) ] + 3/ Using Lemma 4, and using he fa ha E[ w w ] 4G /λ by Lemma 5, we ge ha E [ w T w ] T max { 4G λ, 5 max{, } / /)}G λ max{4, / /)}G λ + ) 3 By he assumed smoohness of F wih respe o w, we have F w T ) F w ) µ w T w Combining i wih he inequaliy above, and slighly upper bounding he onsans for readabiliy, he resul follows B4 Proof of Thm 3 The SGD ierae an be wrien separaely for he firs oordinae as w +, = Π [0,] η )w, η Z ) 7)

13 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion Fix some T 0, and suppose firs ha Z / Condiioned on his even, we have w +, Π [0,] η )w, + ) η Π [0,] η /) η /, sine T 0 implies η = / On he oher hand, if Z > /, we are sill guaraneed ha w +, 0 by he domain onsrains Using hese resuls, we ge E[w +, ] = PrZ /)E[w +, Z /] + PrZ > /)E[w +, Z > /] PrZ /)E[w +, Z /] PrZ /) η = 6 η 8) Therefore, Thus, by definiion of F, E[ w T, ] T =T 0+ E[w, ] 6T =T 0+ η = 6T =T 0 η E [F w T ) F w )] = E [F w T )] E [F w T,, 0,, 0))] E [ w T, ] 6T Subsiuing η = / gives he required resul B5 Proof of Thm 4 =T 0 η The SGD ierae for he firs oordinae is { ) w +, = Π [,] ) Z w, w, 0 7 9) w, < 0 The inuiion of he proof is ha whenever w, beomes negaive, hen he large gradien of F auses w +, o always be signifianly larger han 0 This means ha in some sense, w, is onsrained o be larger han 0, mimiking he aual onsrain in he example of Thm 3 and foring he same kind of behavior, wih a resuling ΩlogT )/T ) rae To make his inuiion rigorous, we begin wih he following lemma, whih shows ha w, an never be signifianly smaller han 0, or say below 0 for more han one ieraion Lemma 7 For any T 0 = max{, 6 + }, i holds ha and if w, < 0, hen w,, w +, 5 Proof Suppose firs ha w, 0 Then by Eq 9) and he fa ha Z, we ge w, / ) Moreover, in ha ase, if w, < 0, hen by Eq 9) and he previous observaion, w +, = Π [,] ) w, + 7 ) Π [,] ) + 7 ) Sine, we have / 0, ), whih implies ha he above is lower bounded by Π [,] + 7 )

14 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion Moreover, sine 6, we have / ) + 7/ [, ], so he projeion operaor is unneessary, and we ge overall ha w +, This resul was shown o hold assuming ha w, 0 If w, < 0, hen repeaing he argumen above for insead of, we mus have w, 5/ ) / ), so he saemen in he lemma holds also when w, < 0 We urn o he proof of Thm 4 iself By Lemma 7, if T 0, hen w, < 0 implies w +, 0, and moreover, w, + w +, / ) + 5/ 3/ Therefore, we have he following, where he sums below are only over s whih are beween T 0 and T : E[w T0, + + w T, ] = E[w T 0, + + w T, ] E w, + w +, ) + E E :w,<0 :w, / [0,/] 3 + :w,>/ = :w,<0 :w,>/ w, Prw, / [0, η ]) 0) Now, we laim ha he probabiliies above an be lower bounded by a onsan To see his, onsider eah suh probabiliy for w,, ondiioned on he even w, 0 Using he fa ha / <, we have Prw, [0, /] w, 0) = Pr Π [,] /) w, /)Z ) [0, /] w, 0 ) = Pr /) w, /)Z [0, /] w, 0) = Pr Z [/ ) w,, / ) w, ] w, 0) This probabiliy is a mos /4, sine i asks for Z being onsrained in an inerval of size, whereas Z is uniformly disribued over [, 3], whih is an inerval of lengh 4 As a resul, we ge Prw, / [0, /] w, 0) = Prw, [0, /] w, 0) 3 4 From his, we an lower bound Eq 0) as follows: Prw, / [0, η ]) Prw, / [0, η ], w, 0) = Prw, 0) Prw, / [0, η ] w, 0) [ 3 8 Prw, 0) = 3 T ] 8 E w, 0 =T 0 =T 0 [ = 3 T ] 6 E w, 0 + w, E =T 0 =T 0 = 3 6 E [ T w, 0 + =T 0 =T 0 + w, 0 ] [ T =T 0 [ 3 6 E w, 0 + =T 0+ =T 0+ ] w, 0 ) ] w, 0 + w, 0 + Now, by Lemma 7, for any realizaion of w T0,,, w T,, he indiaors w, 0 anno equal 0 onseuively Therefore, w, 0 + w, 0 mus always be a leas Plugging i in he equaion above, we ge he lower

15 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion bound 3/6) T =T 0+ Summing up, we have shown ha E[w T0, + + w T, ] 3 6 By he boundedness assumpion on W, we have w,, so Overall, we ge Therefore, as required =T 0+ E[w, + + w T0,] T 0 [ E[ w T, ] = T ] T E w, 3 6T = =T 0+ ) T 0 T E [F w T ) F w )] = E [F w T )] E [F w T,, 0,, 0))] E [ w T, ] 3 6T B6 Proof of Thm 5 =T 0+ Proof Using he derivaion as in he proof of Lemma, we an upper bound E[ w + w ] by E[ w w ] η E[ g, w w ] + η G Exraing he inner produ and summing over = α)t +,, T, we ge = α)t + E[ g, w w ] = α)t + η G + = α)t + By onvexiy of F, T = α)t + E[ g, w w ] is lower bounded by = α)t + E[F w ) F w )] αt E [F w α T ) F w ))] ) T 0 T E[ w w ] E[ w + w ) ] η η Subsiuing his lower bound ino Eq ) and slighly rearranging he righ hand side, we ge ha E[F w T α) F w )] an be upper bounded by E[ w α)t + w ] + E[ w w ] ) T + G η αt η α)t + η η = α)t + = α)t + Now, we invoke Lemma, whih ells us ha wih any srongly onvex F, even non-smooh, we have E[ w w ] O/) More speifially, we an upper bound he expression above by { } max 4, / G /η /η αt λ + + G η α)t + )η α)t + αt In pariular, sine we ake η = /λ, we ge { } max 4, / G + αt λ I an be shown ha T bound = α)t + = α)t + = α)t + + G αt λ = α)t + = α)t + log/ α)) Plugging i in and slighly simplifying, we ge he desired To analyze he ase where W is unbounded, one an replae his by a oarse bound on how muh w, an hange in he firs T 0 ieraions, sine he sep sizes are bounded and T 0 is essenially a onsan anyway

16 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion B7 Proof of Lemma To prove his lemma, we will firs prove he following auxiliary resul, whih rewries w + w in a more explii form Lemma 8 Under he ondiions of Lemma, i holds for any ha w + w ) w i w, ẑ i + 3 G λ i j λ j=i+ Proof By he srong onvexiy of F and he fa ha w minimizes F in W, we have as well as g, w w F w ) F w ) + λ w w, F w ) F w ) λ w w Also, by onvexiy of W, for any poin v and any w W we have Π W v) w v w Using hese inequaliies, we have he following: w + w = Π W w η ĝ ) w w η ĝ w = w w η ĝ, w w + η ĝ = w w η g, w w + η ẑ, w w + η ĝ w w η F w ) F w ) + λ w w + η ẑ, w w + η G w w λ η w w + λ w w + η ẑ, w w + η G = η λ) w w + η ẑ, w w + η G For simpliiy of noaion, le us wrie w := w w Plugging in our hoie of η, we ge w + ) w + ) G λ ẑ, w + λ Unwinding his reursive inequaliy ill =, we ge ha for any, w + ) ẑ i, w λ i j i + G λ i We now noe ha for any a >, ) = i i=a i=a j=i+ i i = j=i+ ) j a i=a i ) ) a a i= + i + Plugging his bak and slighly simplifying he upper bound, we ge w + ) ẑ i, w λ i j i + G λ j=i+ ) i i We also noe ha if whih holds assuming > / and ha is a whole number), hen ) i + i ) i ) i di ) + ) ) ) + ) + + )

17 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion For any 4 + 4, his is a mos + ) ) 3 ) 3 Therefore, we ge ha for any 4 + 4, w + λ i j=i+ ) ẑ i, w j i + 3 G λ Wih his resul a hand, we are now in a posiion o prove our high probabiliy bound Denoe X i = w i w and Z i = w i w, ẑ i We have ha he ondiional expeaion of Z i, given previous rounds, is E i [Z i ] = 0, and he ondiional variane Var i Z i ) X i G By Lemma 8, X + λ α ) i Z i + 3 G λ ) where α ) i = i j=i+ ) = i + ) i )i j i + ) ) = β i γ wih β i = i + ) i ) and γ = + ) )) Le us now sudy he sum β i Z i of he maringale differenes d i = β i Z i Observe ha he sum of ondiional varianes saisfies and we also have he uniform bound σ = Var i β i Z i ) βi G X i β i Z i Gβ i Gβ = G + ) ) := b We now apply Lemma 6 o he sum of maringale differenes β iz i We ge ha wih probabiliy a leas δ, { } β i Z i < max σ, b lnlnt )/δ) lnlnt )/δ) for all T Reall ha σ = G β i X i Define a shorhand B = lnlnt )/δ) Muliplying boh sides by γ, we have for all T, α ) i Z i = γ β i Z i max 4γ BG βi X i, γ b B

18 Making Gradien Desen Opimal for Srongly Convex Sohasi Opimizaion Observe ha γ b G Using Eq ), we have X + λ G λ 8BG λ α ) i Z i + 3 G λ max 4γ B βi X i, B β i γ ) X i + 4B G λ + 3 G λ + 3 G λ We also have β i γ ) = ) i + ) i )i i4 i + ) ) 4 Assume by he way of induion ha X i Ai for all i for some onsan A o be defined laer Now, le us prove he resul for X + We have X + 8BG λ = 8BG λ 8BG λ BG λ β i γ ) X i + 4B G λ 4 i 4 3 A + 4B G λ + 3 G λ + 3 G λ + ) 4 A 4 + 4B G + 3 G λ λ A 4 + 4B G λ + 3 G λ, Where in he las sep we used he fa ha sine, hen + /) + /) e We would like he las quaniy o be less han A/ + ) in order o prove he induion sep Sine +, i is enough o find A ha saisfies BG A λ 4 + 4B G + 3 G λ λ A This an be wrien as a quadrai equaion of he form A AΓ Γ 0, wih a feasible soluion for A being any upper bound on Γ + Γ + Γ ) 4Γ + 4Γ This gives BG) A = 4 λ 4 ) + 4B G + 3 G ) λ λ Upper bounding /4 ) by / sine is a whole number and > /), and slighly simplifying, we ge he required bound

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania