Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure

Size: px

Start display at page:

Download "Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure"

Alan Nicholson
6 years ago
Views:

Sochasic Opimizaio wih Variace Reducio for Ifiie Daases wih Fiie-Sum Srucure Albero Biei, Julie Mairal To cie his versio: Albero Biei,

<hal-037586v4> HAL Id: hal-037586 hps://hal.iria.

of scieific research documes, wheher hey are published or o.

1 Sochasic Opimizaio wih Variace Reducio for Ifiie Daases wih Fiie-Sum Srucure Albero Biei, Julie Mairal To cie his versio: Albero Biei, Julie Mairal. Sochasic Opimizaio wih Variace Reducio for Ifiie Daases wih Fiie-Sum Srucure. 07. <hal v4> HAL Id: hal hps://hal.iria.fr/hal v4 Submied o 7 Feb 07 v4), las revised 5 Nov 07 v6) HAL is a muli-discipliary ope access archive for he deposi ad dissemiaio of scieific research documes, wheher hey are published or o. The documes may come from eachig ad research isiuios i Frace or abroad, or from public or privae research ceers. L archive ouvere pluridiscipliaire HAL, es desiée au dépô e à la diffusio de documes scieifiques de iveau recherche, publiés ou o, émaa des éablissemes d eseigeme e de recherche fraçais ou éragers, des laboraoires publics ou privés.

2 Sochasic Opimizaio wih Variace Reducio for Ifiie Daases wih Fiie-Sum Srucure Albero Biei Iria Julie Mairal Iria February 7, 07 Absrac Sochasic opimizaio algorihms wih variace reducio have prove successful for miimizig large fiie sums of fucios. Uforuaely, hese echiques are uable o deal wih sochasic perurbaios of ipu daa, iduced for example by daa augmeaio. I such cases, he objecive is o loger a fiie sum, ad he mai cadidae for opimizaio is he sochasic gradie desce mehod SGD). I his paper, we iroduce a variace reducio approach for hese seigs whe he objecive is srogly covex. Afer a iiial liearly coverge phase, he algorihm achieves a O/) covergece rae i expecaio like SGD, bu wih a cosa facor ha is ypically much smaller, depedig o he variace of gradie esimaes due o perurbaios o a sigle example. We also iroduce exesios of he algorihm o composie objecives ad o-uiform samplig. Iroducio May supervised machie learig problems ca be cas io he miimizaio of a expeced loss over a daa disribuio D wih respec o a vecor x i R p of model parameers: E ζ D [fx, ζ)]. Whe a ifiie amou of daa is available, sochasic opimizaio mehods such as he sochasic gradie desce SGD) or sochasic mirror desce algorihms, or heir varias, are ypically used see, e.g., Boou e al., 06; Nemirovski e al., 009). Neverheless, whe he daase is fiie, icremeal mehods based o variace reducio echiques e.g., Alle-Zhu, 06; Defazio e al., 04a; Johso ad Zhag, 03; La ad Zhou, 05; Li e al., 05; Schmid e al., 06; Shalev-Shwarz ad Zhag, 03) have prove o be sigificaly faser ha SGD a solvig he fiie-sum problem } mi x R p F x) := fx) + hx) = f i x) + hx) where he fucios f i are smooh ad covex, ad h is a simple covex pealy ha eed o be differeiable such as he l orm. A classical seig is f i x) = ly i, x ξ i ) + µ/) x, where ξ i, y i ) is a example-label pair, l is a covex loss fucio, ad µ is a regularizaio parameer. Iroducig radom perurbaios of daa is a fudameal cocep i machie learig; for isace, his is a key o achieve sable feaure selecio Meishause ad Bühlma, 00), or for privacy-aware learig Duchi e al., 0). I his paper, we cosider he augmeaio of fiie raiig ses wih well-chose radom perurbaios of each example, which ca lead o smaller es error heory Wager e al., 04) ad i pracice Loosli e al., 007; va der Maae e al., 03). Examples of such procedures iclude This work was suppored by a gra from ANR MACARON projec uder gra umber ANR-4-CE ) ad from he MSR-Iria joi cere.,

3 radom rasformaios of images i classificaio problems e.g., Loosli e al., 007) ad Dropou Srivasava e al., 04). The objecive describig hese scearios, which is he focus of his paper, is he followig: } mi F x) = E ρ Γ [ x R p f i x, ρ)] + hx), ) ha is, we cosider he fiie-sum problem wih f i x) = E ρ Γ [ f i x, ρ)], where ρ paramerizes he radom perurbaio ad f i, ρ) is a covex smooh fucio wih L-Lipschiz coiuous gradies for all i ad ρ. We also assume ha F is µ-srogly covex. Because each fucio f i is a expecaio, compuig a sigle gradie f i is iracable i geeral ad sadard variace reducio mehods cao be used. A aural way o opimize his objecive whe h = 0 is o use SGD by radomly choosig a idex a ieraio alog wih a perurbaio ρ Γ, ad performig he updae x = x η f i x, ρ ) wih a sep-size η. Uforuaely, his approach igores he fiie-sum srucure ad leads o gradie esimaes wih high variace ad slow covergece. The goal of his paper is o iroduce a algorihm, called sochasic MISO, ha ca exploi he problem srucure usig variace reducio. Our mehod achieves a O/) covergece rae like SGD, bu wih a much smaller cosa erm ypical seigs, oly depedig o he variace of he gradie esimaes due o he radom perurbaios o a sigle example. To he bes of our kowledge, our mehod is he firs hybrid algorihm ha aurally ierpolaes bewee icremeal algorihms for fiie sums whe here are o perurbaios) ad he sochasic approximaio seig whe =); as a firs sep, we ackle he problem from a covex opimizaio poi of view. We also remark ha he sochasic composie case wih =, we obai a ovel algorihm wih he same covergece properies as SGD. Relaed work. Our work is ispired by he rece surge of ieres for sochasic opimizaio mehods ha are dedicaed o he miimizaio of fiie sums, which arise aurally i machie learig. Surprisigly, i has bee show ha by exploiig he fiie-sum srucure he objecive, oe ca develop much faser opimizaio mehods ha previous oes ha did o exploi his srucure, such as SGD or full gradie desce see, amog ohers, Schmid e al., 06; Shalev-Shwarz ad Zhag, 03). May of hese mehods have bee moivaed by he fac ha heir updaes ca be ierpreed as SGD seps wih ubiased esimaes of he full gradie, bu wih a variace ha decreases as he algorihm approaches he opimum Johso ad Zhag, 03); o he oher had, vailla SGD requires decreasig sep-sizes o achieve his reducio of variace, hereby slowig dow covergece. Our work aims a exedig hese echiques, i paricular he MISO/Fiio algorihms Defazio e al., 04b; Li e al., 05; Mairal, 05) o he case where each fucio he fiie sum ca oly be accessed via a firs-order sochasic oracle. Despie is relevace o machie learig va der Maae e al., 03; Wager e al., 04), problem ) is o well sudied he opimizaio lieraure. Mos relaed o our work, rece mehods ha use cluserig iformaio o improve he covergece of variace reducio echiques Alle-Zhu e al., 06; Hofma e al., 05) ca be see as acklig a special case of ), where he expecaios i f i are replaced by empirical averages over pois i a cluser. While he approximaio assumpio of N-SAGA Hofma e al., 05) ca be see as a variace codiio o sochasic gradies as i our case, heir algorihm is asympoically biased ad does o coverge o he opimum. O he oher had, CluserSVRG Alle-Zhu e al., 06) is o biased, bu does o suppor ifiie daases. The mehod proposed by Achab e al. 05) uses variace reducio i a seig where gradies are compued approximaely, bu he algorihm compues a full gradie a every pass, which is o available i our sochasic seig. Paper orgaizaio. I Secio, we prese our algorihm for smooh objecives, ad we aalyze is covergece i Secio 3. We iroduce a exesio of he algorihm o composie objecives ad o-uiform samplig i Secio 4. I Secio 5, we prese empirical resuls o wo sigificaly differe problems: image classificaio wih radom rasformaios of he ipu examples, ad classificaio of biological ad ex daa wih Dropou.

4 Algorihm S-MISO for smooh objecives Ipu: sep-size sequece α ) ; Iiialize x 0 = i z0 i for some z0 i ),...,; for =,... do Sample a idex uiformly a radom, a perurbaio ρ Γ, ad updae wih g = f i x, ρ )): z α )z i + α x µ = g ), if i = z ) i, oherwise. x = z = x + z z ). 3) ed for The Sochasic MISO Algorihm I his secio, we iroduce he sochasic MISO approach for smooh objecives h = 0), which relies o he followig assumpios: A) global srog covexiy: f is µ-srogly covex; A) smoohess: fi, ρ) is L-smooh for all i ad ρ i.e., differeiable wih L-Lipschiz gradies); A3) small variace from perurbaios: for all i, where x is he uique) miimizer of f. E ρ [ fi x, ρ) f i x ) ] σ, Noe ha we will relax he smoohess assumpio A) i Secio 4 by supporig composie objecives wih o-smooh regularizers, ad by exploiig differe smoohess parameers L i o each example, a seig where o-uiform samplig of he raiig pois is ypically helpful o accelerae he covergece of icremeal mehods e.g., Xiao ad Zhag, 04). I is impora o oe ha our variace assumpio A3) is oly affeced by he oise iduced by he perurbaios ρ ad o by he radomess he choice of he idex i. I coras, a similar assumpio for he SGD algorihm o he objecive ) would ake he form E i,ρ [ f i x, ρ) ] σo for all x see Appedix C). The quaiy σo akes io accou he oise iduced by he radom idex i i addiio o ρ, ad ca hus be much larger ha σ, paricularly if he perurbaios o ipu daa are small. I Secio 3, we will show ha afer a iiial liearly coverge phase, ad uder appropriae choice of sep-sizes α ), S-MISO saisfies E[ x x ] ɛ afer ) σ O µ ɛ ieraios. This complexiy is similar o ha of SGD Boou e al., 06; Nemirovski e al., 009), bu wih σ replacig he quaiy σ o, leadig o a much faser rae ha SGD if σ σ o, somehig which we observed i our experimes see Secio 5). Our mehod is give i Algorihm. Wihou he perurbaios ad wih a cosa sep-size, he algorihm resembles he MISO/Fiio algorihms Defazio e al., 04b; Li e al., 05; Mairal, 05), which may be see as primal varias of SDCA Shalev-Shwarz, 06; Shalev-Shwarz ad Zhag, 03). MISO/Fiio are par of a larger body of opimizaio mehods ha ieraively build a model of he objecive fucio, ypically he form of a lower or upper boud o he objecive ha is easier o opimize; for isace, his sraegy is commoly adoped i budle mehods Hiriar-Urruy ad Lemaréchal, 993) or he EM algorihm ad is icremeal varias Neal ad Hio, 998). Specifically, MISO/Fiio 3

5 assumes ha each f i is srogly covex, ad builds a model of he objecive usig lower bouds of he form D x) = d i x), where each d i is a lower boud o f i ad akes he form d ix) = c i, + µ x z i = c i, µ x, z i + µ x. 4) These lower bouds are updaed durig he algorihm usig srog covexiy lower bouds a x of he form l i x) = f ix ) + f i x ), x x + µ x x : d ix) = α )d i x) + α lx), if i = d i x), oherwise, which correspods o a z updae wih g = f i x )) z α )z i + α x µ = g ), if i = z i, oherwise. The ex ierae is he compued as x = arg mi x D x), which is equivale o 3). The origial MISO/Fiio algorihms use α = uder a big daa codiio o he sample size Defazio e al., 04b; Mairal, 05), while he heory was laer exeded i Li e al. 05) o relax his codiio by supporig smaller cosa seps α = α, leadig o a algorihm ha may be ierpreed as a primal varia of SDCA see also Shalev-Shwarz, 06). Noe ha whe f i is a expecaio, i is hard o obai such lower bouds sice he gradie f i x ) is o available i geeral. Neverheless, S-MISO ca exploi approximae lower bouds o each f i usig gradie esimaes g, by leig he sep-sizes α decrease appropriaely as commoly doe i sochasic approximaio. This leads o updae ). Separaely, SDCA Shalev-Shwarz ad Zhag, 03) cosiders he Fechel cojugaes of f i, defied by fi y) = sup x x y f i x). Whe f i is a expecaio, fi is o available i closed form i geeral, or are is gradies, ad i fac exploiig sochasic gradie esimaes is difficul he dualiy framework. I coras, Shalev-Shwarz 06) gives a aalysis of SDCA he primal, aka. wihou dualiy, for smooh fiie sums, ad our work exeds his lie of reasoig o he sochasic approximaio ad composie seigs. Relaioship wih SGD he smooh case. The lik bewee S-MISO he o-composie seig ad SGD ca be see by rewriig he updae 3) as 5) where x = x + z z ) = x + α v, v := x µ f i x, ρ ) z. 6) Noe ha E[v F ] = µ fx ), where F coais all iformaio up o ieraio ; hece, he algorihm ca be see as a isace of he sochasic gradie mehod wih ubiased gradies, which was a key moivaio i SVRG Johso ad Zhag, 03) ad laer i oher variace reducio algorihms Defazio e al., 04a; Shalev-Shwarz, 06). I is also worh oig ha he absece of a fiie-sum srucure =), we have z =x, hece our mehod becomes ideical o vailla SGD, up o a redefiiio of sep-sizes. Memory requiremes ad hadlig of sparse daases. The algorihm requires sorig he vecors z),...,, which akes he same amou of memory as he origial daase, ad is herefore a reasoable requireme i pracice. I he case of sparse daases, i is fair o assume ha he radom perurbaios applied o he ipu daa preserve he sparsiy paers of he origial vecors, as is he case, e.g., whe applyig Dropou o ex documes described wih bag-of-words represeaios Wager e al., 04). If 4

6 we furher assume he ypical seig where he µ-srog covexiy comes from a l regularizer: fi x, ρ) = φ i x ξ ρ i )+µ/) x, where ξ ρ i is he sparse) perurbed example ad φ i ecodes he loss, he he updae ) ca be wrie as z α )z i α µ = φ i x ξ ρ i )ξρ i, if i = z i, oherwise, which shows ha for every idex i, he vecor z preserves he same sparsiy paer as he examples ξρ i hroughou he algorihm assumig he iiializaio zi 0 = 0), makig he updae ) efficie. The updae 3) has he same cos sice v = z z is also sparse. 3 Covergece Aalysis of S-MISO We ow sudy he covergece properies of he S-MISO algorihm. We defer all proofs o he appedix. We sar by defiig he problem-depede quaiies zi := x µ f ix ), ad he iroduce he Lyapuov fucio C = x x + α z zi. 7) Proposiio gives a recursio o C, obaied by upper-boudig separaely is wo erms, ad fidig coefficies o cacel ou oher appearig quaiies whe relaig C o C. To his ed, we borrow elemes of he covergece proof of SDCA wihou dualiy Shalev-Shwarz, 06); our echical coribuio is o exed heir resul o he sochasic approximaio ad composie see Secio 4) cases, ad o sudy he covergece behavior of he algorihm hese seigs. Proposiio Recursio o C ). If α ) is a posiive ad o-icreasig sequece of sep-sizes saisfyig } α mi,, 8) κ ) wih κ = L/µ, he C obeys he recursio E[C ] α ) E[C ] + α ) σ µ. 9) Compariso wih SGD. A simple aalysis of SGD wih sep-sizes η ) 0 gives he followig recursio o B := E[ x x ] we provide a proof i Appedix C): B µη )B + µη ) σ o µ, where we assume E i,ρ [ f i x, ρ) ] σ o. Thus, afer forgeig he iiial codiio C 0, S-MISO miimizes B C a a faser rae if σ σ o. I paricular, if he gradie variace across examples bouded by σ o a he opimum) is much larger ha he gradie variace due o he daa perurbaio oly ρ Γ bouded by σ a he opimum), he our algorihm has a much faser covergece rae. As show he experimeal secio, σ o may be ideed orders of magiude larger ha σ i real scearios, leadig o boh heoreical ad pracical beefis. We ow sae he mai covergece resul, which provides he expeced rae O/) o C based o decreasig sep-sizes, similar o Boou e al., 06) for SGD. Noe ha covergece of objecive fucio values is direcly relaed o ha of he Lyapuov fucio C via smoohess: E[fx ) fx )] L E [ x x ] L E[C ]. 0) Noe ha he cosa L is a upper boud of he smoohess cosa of each fucio f i, ρ); i ca be replaced here by he global smoohess cosa of f, which may be smaller ha L. 5

7 Theorem Covergece of Lyapuov fucio). Le he sequece of sep-sizes α ) be defied by α = β γ+ wih β > ad γ 0 such ha α saisfies 8). For all 0, i holds ha E[C ] ν γ + +, where β σ } ν := max µ β ), γ + )C 0. ) Choice of sep-sizes i pracice. Naurally, we would like ν o be small, i paricular idepede of he iiial codiio C 0 ad equal o he firs erm he defiiio ). We would like he depedece o C 0 o vaish a a faser rae ha O/), as i is he case i variace reducio algorihms o fiie sums. As advised i Boou e al. 06) he coex of SGD, we ca iiially ru he algorihm wih a cosa sep-size ᾱ ad exploi his liear covergece regime uil we reach he level of oise give by σ, ad he sar decayig he sep-size. I is easy o see ha by usig a cosa sep-size ᾱ, C coverges ear a value C := ᾱσ µ. Ideed, Eq. 9) wih α = ᾱ yields E[C C] ᾱ ) E[C C]. Thus, we ca reach a value C 0 wih E[C 0] ɛ := C i O ᾱ log C 0/ ɛ) ieraios. The, if we sar decayig sep-sizes as i Theorem wih β = ad γ large eough so ha α = β γ+ = ᾱ, we have γ + ) E[C 0] γ + ) ɛ = 8σ /µ, makig boh erms i ) smaller ha or equal o ν = 8σ /µ. Cosiderig hese wo phases, wih a iiial sep-size ᾱ give by he upper boud i 8), he fial work complexiy of he algorihm for reachig E[ x x ] ɛ is O + κ) log C ) ) 0 σ + O ɛ µ. ) ɛ We ca use 0) i order o obahe complexiy for reachig E[fx ) fx )] ɛ, ad he secod erm becomes OLσ /µ ɛ). Noe ha followig his sep-size sraegy was foud o be very effecive i pracice see Secio 5). Acceleraio by ierae averagig. Whe oe is ieresed he covergece i fucio values, he complexiy erm OLσ /µ ɛ) meioed above ca be problemaic for ill-codiioed problems large κ = L/µ). The followig heorem preses a ierae averagig scheme which brigs he complexiy erm dow o Oσ /µɛ). Theorem 3 Covergece uder ierae averagig). Le he sep-size sequece α ) be defied by α = γ + } for γ s.. α mi,. 4κ ) We have where E[f x T ) fx )] µγγ )C 0 T γ + T ) + 6σ µγ + T ), x T := T γ + )x. T γ + T ) =0 6

8 The proof uses a similar elescopig sum echique o Lacose-Julie e al. 0). Noe ha if T γ, he firs erm, which depeds o he iiial codiio C 0, decays as /T ad is hus domiaed by he secod erm. Moreover, if we sar averagig afer a iiial phase wih cosa sep-size ᾱ, we ca cosider C 0 4ᾱσ µ. I he ill-codiioed regime, akig ᾱ = α = /γ + ) as large as allowed we have γ of he order of κ. The full covergece rae he becomes E[f x T ) fx )] O σ µγ + T ) + γ T ) ). Whe T is large eough compared o γ, his becomes Oσ /µt ), leadig o a complexiy erm Oσ /µɛ). 4 Exesio o Composie Objecives ad No-Uiform Samplig I his secio, we sudy exesios of S-MISO o differe siuaios where our previous smoohess assumpio A) is o suiable, eiher because of a o-smooh erm h he objecive or because i igores addiioal useful kowledge abou each f i such as he orm of each example. I he presece of o-smooh regularizers such as he l -orm, he objecive is o loger smooh, bu we ca leverage is composie srucure by usig proximal operaors. To his ed, we assume ha oe ca easily compue he proximal operaor of h, defied by prox h z) := arg mi x R p } x z + hx). Whe he smoohess cosas L i vary sigificaly across differe examples ypically hrough he orm of he feaure vecors), he uiform upper boud L = L max = max i L i ca be resricive. I has bee oiced see, e.g., Schmid e al., 06; Xiao ad Zhag, 04) ha whe he L i are kow, oe ca i L i achieve beer covergece raes ypically depedig o he average smoohess cosa L = raher ha L max by samplig examples i a o-uiform way. For ha purpose, we ow make he followig assumpios: A4) srog covexiy: fi, ρ) is µ-srogly covex for all i, ρ; A5) smoohess: fi, ρ) is L i -smooh for all i, ρ; A6) small variace from perurbaios a x : E ρ [ fi x, ρ) f i x ) ] σ i for all i. Noe ha our proof relies o a slighly sroger assumpio A4) ha he global srog covexiy assumpio A) made above, which holds he commo siuaio where srog covexiy comes from a l regularizaio erm. I order o exploi he differe smoohess cosas, we allow he algorihm o sample idices i o-uiformly, from ay disribuio q such ha q i 0 for all i ad i q i =. The exesio of S-MISO o his seig is give i Algorihm. Noe ha he sep-sizes vary depedig o he example, wih larger seps for examples ha are sampled less frequely ypically easier examples wih smaller L i ). Noe ha whe h = 0, he updae direcios are ubiased esimaes of he gradie: we have E[x x F ] = α µ fx ) as he uiform case. However, he composie case, he algorihm cao be wrie i a proximal sochasic gradie form like Prox-SVRG Xiao ad Zhag, 04) or SAGA Defazio e al., 04a). Relaioship wih RDA. Whe =, our algorihm performs similar updaes o Regularized Dual Averagig RDA) Xiao, 00) wih srogly covex regularizers. I paricular, if f x, ρ) = φx ξρ)) + µ/) x, he updaes are he same whe akig α = /, sice prox h/µ z ) = arg mi µ z, x + µ } x x + hx), 7

9 Algorihm S-MISO for composie objecives, wih o-uiform samplig. Ipu: sep-sizes α ), samplig disribuio q; Iiialize x 0 = prox h/µ z 0 ) wih z 0 = i z0 i for some z0 i ),..., ha safisfies 5); for =,... do Sample a idex q, a perurbaio ρ Γ, ad updae wih α i = α /q i, g = f i x, ρ )): z α i = )z i + αx µ g ), if i = z 3) i, oherwise z = z = z + z z ) ed for x = prox h/µ z ). 4) ad µ z is equal o he average of he gradies of he loss erm up o, which appears he same way i he RDA updaes Xiao, 00, Secio.). However, ulike RDA, our mehod suppors arbirary decreasig sep-sizes, i paricular keepig he sep-size cosa, which ca lead o faser covergece he iiial ieraios see Secio 3). Lower-boud model ad covergece aalysis. Agai, we ca view he algorihm as ieraively updaig approximae lower bouds o he objecive F of he form D x) = i d i x) + hx) aalogously o 5), ad miimizig he ew D i 4). Similar o MISO-Prox, we require ha d 0 i is iiialized wih a µ-srogly covex quadraic such ha f i x, ρ i ) d 0 i x) wih ρ i Γ. Give he form of d i i 4), i suffices o choose zi 0 ha saisfies fi x, ρ i ) µ x z0 i + c, 5) for some cosa c. I he commo case of a l regularizer wih a o-egaive loss, oe ca simply choose zi 0 = 0 for all i, oherwise, z0 i ca be obaied by cosiderig a srog covexiy lower boud o f i, ρ i ). Our ew aalysis relies o he miimum D x ) of he lower bouds D hrough he followig Lyapuov fucio: C q = F x ) D x ) + µα The covergece of he ieraes x is corolled by he covergece i C q Lemma 4 Boud o he ieraes). For all, we have q i z i z i. 6) haks o he followig lemma: µ E[ x x ] E[F x ) D x )]. 7) The followig proposiio gives a recursio o C q similar o Proposiio. Proposiio 5 Recursio o C q ). If α ) is a posiive ad o-icreasig sequece of sep-sizes saisfyig qmi α mi, µ }, 8) 4L q wih q mi = mi i q i ad L q = max i L i µ q i, he Cq wih σ q = σ i i q. i E[C q ] α obeys he recursio ) E[C q ] + α ) σq µ, 9) 8

10 F - F* 0 - STL-0 ck, µ = 0 3 S-MISO η = 0. S-MISO η =. 0 SGD η = 0. SGD η =. 0 N-SAGA η = STL-0 scaerig, µ = F - F* 0 - STL-0 ck, µ = STL-0 scaerig, 0 µ = F - F* 0 - STL-0 ck, µ = STL-0 scaerig, 0 µ = Figure : Impac of codiioig o he differe mehods for daa augmeaio o STL-0 corolled by µ, where µ=0 4 gives he bes es accuracy). Values of he raiig loss are show i logarihmic scale ui = facor 0). η = 0. saisfies he heory for all mehods, ad we iclude curves for larger sep-sizes η =. We omi N-SAGA for η = because i plaeaus very quickly ad far from he opimum. For he scaerig represeaio, he problem we sudy is l -regularized, hus we use he composie varias of he algorihms for SGD, we use a varia of FOBOS Duchi ad Siger, 009), see Appedix C). Noe ha if we cosider he quaiy E[C q /µ], which is a upper boud o E[ x x ] by Lemma 4, we have he same recursio as 9), ad hus ca apply Theorem wih he ew codiio 8). If we choose q i = + L i µ i L i µ), 0) we have q mi / ad L q L µ), where L = i L i. The, akig α = mi/4, µ/8 L µ)) saisfies 8), ad usig similar argumes o Secio 3, he complexiy for reachig E[ x x ] ɛ is O + L ) ) ) log Cq 0 σ q + O µ ɛ µ, ɛ where ɛ = 4ᾱσq/µ, ad ᾱ is he iiial cosa sep-size. For he complexiy i fucio subopimaliy, he secod erm becomes Oσq/µɛ) by usig he same averagig scheme preseed i Theorem 3 ad adapig he proof. Noe ha wih our choice of q, we have σq i σ i = σ, for geeral perurbaios, where σ = i σ i is he variace he uiform case. Addiioally, i is ofe reasoable o assume ha he variace from perurbaios icreases wih he orm of examples, for isace Dropou perurbaios ge larger whe coordiaes have larger magiudes. Based o his observaio, if we make he assumpio ha σi L i µ, ha is σi Li µ = σ L µ, he for boh q i = / uiform case) ad q i = L i µ)/ L µ), we have σq = σ, ad hus we have σq σ for he choice of q give i 0), sice σq is covex i q. Thus, we ca expec ha he O/) covergece phase behaves similarly or beer ha for uiform samplig, which is cofirmed by our experimes see Secio 5). 5 Experimes We prese experimes comparig S-MISO wih SGD ad N-SAGA Hofma e al., 05) o four differe scearios, i order o demosrae he wide applicabiliy of our mehod: we cosider a image classificaio 9

11 0 - gee dropou, δ = 0.30 S-MISO η = 0. S-MISO η =. 0 SGD η = 0. SGD η =. 0 N-SAGA η = 0. N-SAGA η = imdb dropou, δ = S-MISO-NU η =. 0 S-MISO η = 0. 0 SGD-NU η =. 0 SGD η = 0. 0 N-SAGA η = gee dropou, δ = imdb dropou, δ = gee dropou, δ = imdb dropou, δ = Figure : Impac of perurbaios o he mehods corolled by he Dropou rae δ). The gee daa is l -ormalized hece we cosider similar sep-sizes as Figure. The IMDB daase is highly heerogeeous, hus we also iclude o-uiform NU) samplig varias. For uiform samplig, heoreical sep-sizes perform very poorly for all mehods, hus we show a larger ued sep-size η = 0. daase wih wo differe image represeaios ad radom rasformaios, ad wo classificaio asks wih Dropou regularizaio, oe o geeic daa, ad oe o sparse) ex daa. Figures ad show he curves we obai for a esimae of he raiig objecive usig 5 sampled perurbaios per example. The plos are show o a logarihmic scale, ad he values are compared o he bes value obaied amog he differe mehods i 500. The srog covexiy cosa µ is he regularizaio parameer. For all mehods, we cosider sep-sizes suppored by he heory as well as larger sep-sizes ha may work beer i pracice. Choices of sep-sizes. For boh S-MISO ad SGD, we use he sep-size sraegy meioed i Secio 3 ad advised by Boou e al. 06), which we have foud o be mos effecive amog may heurisics we have ried: we iiially keep he sep-size cosa corolled by a facor η he figures) for, ad he sar decayig as α = C/γ + ), where C = for S-MISO, C = /µ for SGD, ad γ is chose large eough o mach he previous cosa sep-size. For N-SAGA, we maiai a cosa sep-size hroughou he opimizaio, as suggesed he origial paper Hofma e al., 05). The facor η show he figures is such ha η = correspods o a iiial sep-size µ/l µ) for S-MISO from 8) he uiform case) ad /L for SGD ad N-SAGA wih L isead of L he o-uiform case). Image classificaio wih daa augmeaio. The success of deep eural eworks is ofe limied by he availabiliy of large amous of labeled images. Whe here are may ulabeled images bu few labeled oes, a commo approach is o rai a liear classifier o op of a deep ework leared i a usupervised maer, or pre-raied o a differe ask e.g., o he ImageNe daase). We follow his approach o he STL-0 daase Coaes e al., 0), which coais 5K raiig images from 0 classes ad 00K ulabeled images, usig a -layer usupervised covoluioal kerel ework Mairal, 06), givig represeaios of dimesio 9 6. The perurbaio cosiss of radomly croppig ad scalig he ipu images. We use he squared hige loss i a oe-versus-all seig. The vecor represeaios are l -ormalized such ha we may use he upper boud L = + µ for he smoohess cosa. We also prese resuls o he same daase usig a scaerig represeaio Brua ad Malla, 03) of dimesio 696, wih radom gamma correcios raisig all pixels o he power γ, where γ is chose radomly aroud ). For his represeaio, 0

12 we add a l regularizaio erm ad use he composie varia of S-MISO. Figure shows covergece resuls o oe raiig fold 500 images), for differe values of µ, allowig us o sudy he behavior of he algorihms for differe codiio umbers. The low variace iduced by he daa rasformaios allows S-MISO o reach subopimaliy ha is orders of magiude smaller ha SGD afer he same umber of. Noe ha oe ui o hese plos correspods o oe order of magiude i he logarihmic scale. N-SAGA iiially reaches a smaller subopimaliy ha SGD, bu quickly ges suck due o he bias he algorihm, as prediced by he heory Hofma e al., 05), while S-MISO ad SGD coiue o coverge o he opimum haks o he decreasig sep-sizes. The bes validaio accuracy for boh represeaios is obaied for µ 0 4 middle colum i Figure ), ad we observed relaive gais of up o % from usig daa augmeaio. We compued empirical variaces of he image represeaios for hese wo sraegies, which are closely relaed o he variace i gradie esimaes, ad observed hese rasformaios o accou for abou 0% of he oal variace across muliple images. Dropou o gee expressio daa. We raied a biary logisic regressio model o he breas cacer daase of va de Vijver e al. 00), wih differe Dropou raes δ, i.e., where a every ieraio, each coordiae ξ j of a feaure vecor ξ is se o zero idepedely wih probabiliy δ ad o ξ j / δ) oherwise. The daase cosiss of 95 vecors of dimesio 8 4 of gee expressio daa, which we ormalize i l orm. Figure op) compares S-MISO wih SGD ad N-SAGA for hree values of δ, as a way o corol he variace of he perurbaios. We iclude a Dropou rae of 0.0 o illusrae he impac of δ o he algorihms ad sudy he ifluece of he perurbaio variace σ, eve hough his value of δ is less releva for he ask. The plos show very clearly how he variace iduced by he perurbaios affecs he covergece of S-MISO, givig subopimaliy values ha may be orders of magiude smaller ha SGD. This behavior is cosise wih he heoreical covergece rae esablished i Secio 3 ad shows ha he pracice maches he heory. Dropou o movie review seime aalysis daa. We raied a biary classifier wih a squared hige loss o he IMDB daase Maas e al., 0) wih differe Dropou raes δ. We use he labeled par of he IMDB daase, which cosiss of 5K raiig ad 50K esig movie reviews, represeed as dimesioal sparse bag-of-words vecors. I coras o he previous experimes, we do o ormalize he represeaios, which have grea variabiliy heir orms, i paricular L max is roughly 00 imes larger ha L. Figure boom) compares o-uiform samplig versios of S-MISO ad SGD wih heir uiform samplig couerpars as well as N-SAGA. Noe ha we use a large sep-size η = 0 for he uiform samplig algorihms, sice η = was sigificaly slower for all mehods, likely due o ouliers he daase. I coras, he o-uiform samplig algorihms required o uig ad jus use η =. The curves clearly show ha S-MISO-NU has a much faser covergece he iiial phase, haks o he larger sep-size allowed by o-uiform samplig, ad laer coverges similarly o S-MISO, i.e., a a much faser rae ha SGD whe he perurbaios are small. The value of µ used he experimes was chose by cross-validaio, ad he use of Dropou gave improvemes es accuracy from 88.5% wih o dropou o ± 0.03% wih δ = 0. ad ± 0.% wih δ = 0.3 based o 0 differe rus of S-MISO-NU afer 400 ). Effec of averagig. We also sudy he effec of he ierae averagig scheme of Theorem 3 i Appedix D. 6 Coclusio I his paper, we iroduced he S-MISO mehod, a hybrid sochasic/icremeal opimizaio algorihm, which is able o exploi uderlyig fiie-sum srucures i sochasic opimizaio problems. Our approach uses variace reducio i seigs where radom perurbaios of raiig examples i a fiie daase are cosidered durig learig, hereby makig he daase ifiie, ad hus usuiable for sadard icremeal mehods. The algorihm aurally ierpolaes bewee sochasic approximaio whe = ) ad a classical variace reducio algorihm for fiie sums whe here are o perurbaios). Our mehod suppors composie objecives, o-uiform samplig, ad gives covergece guaraees ha are similar o SGD, bu

13 wih a sigificaly smaller cosa erm ha depeds o he variace of gradie esimaes iduced by he perurbaios o a sigle example, raher ha across all daa. We demosraed he effeciveess of he mehod for daa augmeaio ad Dropou. Aoher promisig applicaio is i usig perurbaios for sable feaure selecio Meishause ad Bühlma, 00), bu his requires aoher saisical aalysis ha goes beyod he scope of his paper. Refereces M. Achab, A. Guilloux, S. Gaïffas, ad E. Bacry. SGD wih Variace Reducio beyod Empirical Risk Miimizaio. arxiv:50.048, 05. Z. Alle-Zhu. Kayusha: The firs direc acceleraio of sochasic gradie mehods. arxiv: , 06. Z. Alle-Zhu, Y. Yua, ad K. Sridhara. Exploiig he Srucure: Sochasic Gradie Mehods Usig Raw Clusers. I Advaces i Neural Iformaio Processig Sysems NIPS), 06. L. Boou, F. E. Curis, ad J. Nocedal. Opimizaio Mehods for Large-Scale Machie Learig. arxiv: , 06. J. Brua ad S. Malla. Ivaria scaerig covoluio eworks. IEEE rasacios o paer aalysis ad machie ielligece, 358):87 886, 03. A. Coaes, H. Lee, ad A. Y. Ng. A Aalysis of Sigle-Layer Neworks i Usupervised Feaure Learig. I Ieraioal Coferece o Arificial Ielligece ad Saisics AISTATS), 0. A. Defazio, F. Bach, ad S. Lacose-Julie. Saga: A fas icremeal gradie mehod wih suppor for o-srogly covex composie objecives. I Advaces i Neural Iformaio Processig Sysems NIPS), 04a. A. Defazio, J. Domke, ad T. S. Caeao. Fiio: A faser, permuable icremeal gradie mehod for big daa problems. I Ieraioal Coferece o Machie Learig ICML), 04b. J. C. Duchi ad Y. Siger. Efficie olie ad bach learig usig forward backward spliig. Joural of Machie Learig Research, 0Dec): , 009. J. C. Duchi, M. I. Jorda, ad M. J. Waiwrigh. Privacy aware learig. I Advaces i Neural Iformaio Processig Sysems NIPS), 0. J.-B. Hiriar-Urruy ad C. Lemaréchal. Covex aalysis ad miimizaio algorihms I: Fudameals. Spriger sciece & busiess media, 993. T. Hofma, A. Lucchi, S. Lacose-Julie, ad B. McWilliams. Variace Reduced Sochasic Gradie Desce wih Neighbors. I Advaces i Neural Iformaio Processig Sysems NIPS), 05. R. Johso ad T. Zhag. Acceleraig sochasic gradie desce usig predicive variace reducio. I Advaces i Neural Iformaio Processig Sysems NIPS), 03. S. Lacose-Julie, M. Schmid, ad F. Bach. A simpler approach o obaiig a O/) covergece rae for he projeced sochasic subgradie mehod. arxiv:.00, 0. G. La ad Y. Zhou. A opimal radomized icremeal gradie mehod. arxiv: , 05. H. Li, J. Mairal, ad Z. Harchaoui. A Uiversal Caalys for Firs-Order Opimizaio. I Advaces i Neural Iformaio Processig Sysems NIPS), 05. G. Loosli, S. Cau, ad L. Boou. Traiig ivaria suppor vecor machies usig selecive samplig. I Large Scale Kerel Machies, pages MIT Press, Cambridge, MA., 007.

14 A. L. Maas, R. E. Daly, P. T. Pham, D. Huag, A. Y. Ng, ad C. Pos. Learig word vecors for seime aalysis. I The 49h Aual Meeig of he Associaio for Compuaioal Liguisics ACL), pages Associaio for Compuaioal Liguisics, 0. J. Mairal. Icremeal Majorizaio-Miimizaio Opimizaio wih Applicaio o Large-Scale Machie Learig. SIAM Joural o Opimizaio, 5):89 855, 05. J. Mairal. Ed-o-Ed Kerel Learig wih Supervised Covoluioal Kerel Neworks. I Advaces i Neural Iformaio Processig Sysems NIPS), 06. N. Meishause ad P. Bühlma. Sabiliy selecio. Joural of he Royal Saisical Sociey: Series B Saisical Mehodology), 74):47 473, 00. R. Neal ad G. E. Hio. A view of he EM algorihm ha jusifies icremeal, sparse, ad oher varias. I Learig i Graphical Models, pages Kluwer Academic Publishers, 998. A. Nemirovski, A. Judisky, G. La, ad A. Shapiro. Robus Sochasic Approximaio Approach o Sochasic Programmig. SIAM Joural o Opimizaio, 94): , 009. Y. Neserov. Iroducory Lecures o Covex Opimizaio. Spriger, 004. M. Schmid, N. Le Roux, ad F. Bach. Miimizig fiie sums wih he sochasic average gradie. Mahemaical Programmig, 06. S. Shalev-Shwarz. SDCA wihou Dualiy, Regularizaio, ad Idividual Covexiy. I Ieraioal Coferece o Machie Learig ICML), 06. S. Shalev-Shwarz ad T. Zhag. Sochasic dual coordiae asce mehods for regularized loss miimizaio. Joural of Machie Learig Research, 4Feb): , 03. N. Srivasava, G. E. Hio, A. Krizhevsky, I. Suskever, ad R. Salakhudiov. Dropou: a simple way o preve eural eworks from overfiig. Joural of Machie Learig Research, 5):99 958, 04. M. J. va de Vijver e al. A Gee-Expressio Sigaure as a Predicor of Survival i Breas Cacer. New Eglad Joural of Medicie, 3475): , Dec. 00. L. va der Maae, M. Che, S. Tyree, ad K. Q. Weiberger. Learig wih margialized corruped feaures. I Ieraioal Coferece o Machie Learig ICML), 03. S. Wager, W. Fihia, S. Wag, ad P. Liag. Aliude Traiig: Srog Bouds for Sigle-layer Dropou. I Advaces i Neural Iformaio Processig Sysems NIPS), 04. L. Xiao. Dual averagig mehods for regularized sochasic learig ad olie opimizaio. Joural of Machie Learig Research, Oc): , 00. L. Xiao ad T. Zhag. A proximal sochasic gradie mehod wih progressive variace reducio. SIAM Joural o Opimizaio, 44): , 04. 3

15 Appedix Secios A ad B of his appedix prese he proofs of he resuls i Secios 3 ad 4 of he paper, respecively. I Secio C, we provide proofs of a simple resul for SGD ad proximal SGD, givig a recursio ha depeds o a variace codiio a he opimum i coras o Boou e al. 06); Nemirovski e al. 009) where his codiio eeds o hold everywhere), for a more aural compariso wih S-MISO. A Proofs for he Smooh Case Secio 3) A. Proof of Proposiio Recursio o Lyapuov fucio C ) We begi by saig he followig lemma, which exeds a key resul of variace reducio mehods see, e.g., Johso ad Zhag, 03) o he siuaio cosidered his paper, where oe oly has access o oisy esimaes of he gradies of each f i. Lemma A.. Le i be uiformly disribued i,..., } ad ρ Γ. Uder assumpios A) ad A3) o he fucios f,..., f ad heir expecaios f,..., f, we have, for all x R p, Proof. We have E i,ρ [ f i x, ρ) f i x ) ] 4Lfx) fx )) + σ. f i x, ρ) f i x ) f i x, ρ) f i x, ρ) + f i x, ρ) f i x ) 4L f i x, ρ) f i x, ρ) f i x, ρ), x x ) + f i x, ρ) f i x ). The firs iequaliy comes from he simple relaio u+v + u v = u + v. The secod iequaliy follows from he smoohess of f i, ρ), i paricular we used he classical relaio gy) gx) + gx), y x + L gy) gx), which is kow o hold for ay covex ad L-smooh fucio g see, e.g., Neserov, 004, Theorem..5). The resul follows by akig expecaios o i ad ρ ad oig ha E i,ρ [ f i x, ρ)] = fx ) = 0, as well as assumpio A3). We ow proceed wih he proof of Proposiio. Proof. Defie he quaiies A = z zi ad B = x x. The proof successively describes recursios o A, B, ad eveually C. Recursio o A. We have A A = z zi z zi ) = α )z zi ) + α x ) µ f i x, ρ ) zi = z z ) α z z + α x µ f i x, ρ ) z α α ) v ), ) 4

16 where we firs use he defiiio of z i ), he he relaio λ)u + λv = λ) u + λ v λ λ) u v, ad he defiiio of v give i 6). A similar relaio is derived he proof of SDCA wihou dualiy Shalev-Shwarz 06). Usig he defiiio of zi, he secod erm ca be expaded as follows x µ f i x, ρ ) zi = x x µ f i x, ρ ) f i x )) = x x µ x x, f i x, ρ ) f i x ) + µ fi x, ρ ) f i x )). ) We he ake codiioal expecaios wih respec o F, defied i Secio. Uless oherwise specified, we will simply wrie E[ ] isead of E[ F ] for hese codiioal expecaios he res of he proof. [ E x ] µ f i x, ρ ) zi x x µ x x, fx ) + 4L µ fx ) fx )) + σ µ x x µ fx ) fx ) + µ x x ) + 4L µ fx ) fx )) + σ µ κ ) = fx ) fx )) + σ µ µ, where we used E[ f i x )] = fx ) = 0, Lemma A., ad he µ-srog covexiy of f. Takig expecaios o he previous relaio o A yields [ E[A A ] = α A + α E x ] µ f i x, ρ ) zi α α ) E[ v ] Recursio o B. α A + α κ ) fx ) fx )) α α ) µ Separaely, we have x x = x x + α v = x x + α x x, v + α ) v E[ x x ] = x x α µ x x, fx ) + α ) E[ v ] x x α µ fx ) fx ) + µ x x ) + usig ha E[v ] = µ fx ) ad he srog covexiy of f. This gives E[B B ] α B α µ fx ) fx )) + E[ v ] + α σ µ. 3) α ) E[ v ], α ) E[ v ]. 4) 5

17 Recursio o C. If we cosider C = p A + B ad C = p A + B, combiig 3) ad 4) yields E[C C ] α C + α µ p κ ) )fx ) fx ))+ α α ) p α ) E[ v ]+ α p σ µ. If we ake p = α, ad if α ) is a decreasig sequece saisfyig 8), he he facors i fro of fx ) fx ) ad E[ v ] are o-posiive ad we ge E[C ] α ) C + α ) σ µ. Fially, sice α α, we have C C. Afer akig oal expecaios o F, we are lef wih he desired recursio. A. Proof of Theorem Covergece of C uder decreasig sep-sizes) Proof. Le us proceed by iducio. We have C 0 ν/γ + ) by defiiio of ν. For, E[C ] α ) α ) σ E[C ] + µ βˆ ) νˆ + β σ wih ˆ := γ + ) ˆ µ ) ˆ β = ν + β σ ˆ ˆ µ ) ) ˆ β = ν ν + β σ ˆ ˆ ˆ µ ) ˆ ν ν ˆ ˆ +, where he las wo iequaliies follow from he defiiio of ν ad from ˆ ˆ + )ˆ ). A.3 Proof of Theorem 3 Covergece i fucio values uder ierae averagig) Proof. From he proof of Proposiio, we have E[C ] α ) E[C ] + α µ α κ ) ) E[fx ) fx )] + α ) σ µ. The resul holds because he choice of sep-sizes α ) safisfies he assumpios of Proposiio. Wih our ew choice of sep-sizes, we have he sroger boud Afer rearragig, we obai α µ E[fx ) fx )] α κ ) 4. α ) E[C ] E[C ] + α ) σ µ. 5) 6

18 Dividig by α µ gives Muliplyig by γ + ) yields [ ) E[fx ) fx )] µ E[C ] ] E[C ] + 4 α σ α α µ = µ γ + ) E[C ] γ + ) E[C ]) + 8 σ γ + µ. γ + ) E[fx ) fx )] µ γ + )γ + ) E[C ] γ + )γ + ) E[C ]) + µ γ + )γ + ) E[C ] γ + )γ + ) E[C ]) + 8σ µ. 8γ + ) σ γ + µ By summig he above iequaliy from = o = T, we have a elescopig sum ha simplifies as follows: [ T ] E γ + )fx ) fx 8T σ )) µ γγ )C 0 γ + T )γ + T ) E[C T ]) + µ = µγγ )C 0 + 8T σ µ. Dividig by T = γ + ) = T γ + T T ))/ ad usig Jese s iequaliy o f x T ) gives he desired resul. B Proofs for Composie Objecives ad No-Uiform Samplig Secio 4) We recall here he updaes o he lower bouds d i he seig of his secio, which are aalogous o 5) bu wih o-uiform weighs ad sochasic perurbaios, ad will be useful he proofs: d α q ix) = i )d i x) + α q f i i x, ρ ) + f i x, ρ ), x x + µ x x ), if i = d i x), oherwise. 6) B. Proof of Lemma 4 Boud o he ieraes) Proof. Le F x) := f x)+hx), where f i 0x) = f i x, ρ i ) where ρ i is used i 5)), ad f aalogously o d i as follows: f α q x) = i )fi x) + α f q i i x, ρ ), if i = f i x), oherwise. is updaed By iducio, we have F x ) D x ) D x ) + µ x x, 7) where he las iequaliy follows from he µ-srog covexiy of D ad he fac ha x is is miimizer. 7

19 Agai by iducio, we ow show ha E[F x )] = F x ). Ideed, we have E[F 0 x )] = F x ) by cosrucio, he F x ) = F x ) + α q i f i x, ρ ) f x )) E[F x ) F ] = F x ) + α fx ) f i x )) = F x ) + α F x ) F x )), Afer akig oal expecaios ad usig he iducio hypohesis, we obai E[F x )] = F x ), ad he resul follows from 7). B. Proof of Proposiio 5 Recursio o Lyapuov fucio C q ) We begi by preseig a lemma ha plays a similar role o Lemma A. i our proof, bu cosiders he composie objecive ad akes io accou he ew srog covexiy ad o-uiformiy assumpios. Lemma B.. Le i q, where q is he samplig disribuio, ad ρ Γ. Uder assumpios A4), A5) ad A6) o he fucios f,..., f ad heir expecaios f,..., f, we have, for all x R p, [ ] E i,ρ q i ) f i x, ρ) µx f i x ) µx ) 4L q F x) F x )) + σq, wih L q = max i L i µ q i ad σ q = σ i i q. i Proof. Sice f i, ρ) is µ-srogly covex ad L i -smooh, we have ha f i, ρ) µ is covex ad L i µ)-smooh his is a sraighforward cosequece of Neserov, 004, Eq...9 ad..). The, we have f i x, ρ) µx f i x ) µx ) f i x, ρ) µx f i x, ρ) µx ) + f i x, ρ) f i x ) 4L i µ) fi x, ρ) µ x f i x, ρ) + µ x f ) i x, ρ) µx, x x + f i x, ρ) f i x ) = 4L i µ) fi x, ρ) f i x, ρ) f i x, ρ), x x µ x x ) + f i x, ρ) f i x ) 4L i µ) fi x, ρ) f i x, ρ) f i x, ρ), x x ) + f i x, ρ) f i x ). The firs iequaliy comes from he classical relaio u + v + u v = u + v. The secod iequaliy follows from he covexiy ad L i µ)-smoohess of f i, ρ) µ. Dividig by q i ) ad akig expecaios yields [ ] E i,ρ q i ) f i x, ρ) µx f i x ) µx ) 4 = 4 q i L i µ) q i ) f i x) f i x ) f i x ), x x ) + L i µ q i f ix) f i x ) f i x ), x x ) + 4L q fx) fx ) fx ), x x ) + σ q q i q i ) σ i σ i q i 4L q fx) fx ) + hx) hx )) + σ q = 4L q F x) F x )) + σ q, where he las iequaliy follows from he opimaliy of x, which implies ha fx ) hx ), ad i ur implies fx ), x x hx) hx ) by covexiy of h. 8

20 We ca ow proceed wih he proof of Proposiio 5. Proof. Defie he quaiies A = q i z i z i ad B = F x ) D x ). The proof successively describes recursios o A, B, ad eveually C we drop he superscrip i C q simpliciy), usig he same approach as for he proof of Proposiio. for Recursio o A. Usig similar echiques as he proof of Proposiio, we have A A = q i z zi ) q i z zi = q i α ) z i q i zi ) + α x µ ) ) q i f i x, ρ ) z i q i z zi = α q i ) z zi + α q i ) x µ f i x, ρ ) zi α q i ) α ) ) v i, q i where v i := x µ f i x, ρ ) z i. Takig codiioal expecaios w.r.. F ad usig Lemma B. o boud he secod erm yields E[A A ] α A + 4α L q µ F x ) F x )) + α σ q µ α α ) v i ) q i q i Recursio o B. We sar by usig a lemma from he proof of MISO-Prox Li e al., 05, Lemma D.4), which oly relies o he form of D ad he fac ha x miimizes i, ad hus holds i our seig: D x ) D x ) µ z z = D x ) We he expad D x ) usig 6) as follows: D x ) = D x ) + α = D x ) + α q i q i 8) µ α ) v q i ) i 9) fi x, ρ ) d x ) ) Afer akig codiioal expecios w.r.. F, 9) becomes fi x, ρ ) + hx ) d x ) hx ) ). E[D x )] D x ) + α F x ) D x )) µ Subracig F x ) ad rearragig yields E[B B ] α B α F x ) F x )) + µ α ) q i v i. α ) q i v i. 30) 9

21 Recursio o C. If we cosider C = µp A +B ad C = µp A +B, combiig 8) ad 30) yields E[C C ] α C + α p L q /µ )F x ) F x )) + µα δi q i v i + α p σq, 3) µ wih δ = α p α ). q i If we ake p = α, ad if α ) is a decreasig sequece saisfyig 8), he we obahe desired recursio afer oicig ha C C ad akig oal expecaios o F. Noe ha if we ake qmi α mi, µ }, 8L q he 3) yields E [ C q ] µ α ) [ C q E µ ] α µ F x ) F x )) + α ) σq µ. This relaio akes he same form as Eq. 5), hece i is sraighforward o adap he proof of Theorem 3 o his seig, ad he same ierae averagig scheme applies. C SGD Recursios Proposiio C. Simple SGD recursio wih variace a opimum). Uder assumpios A) ad A), if η /L, he he SGD recursio x := x η f i x, ρ ) saisfies where B := E[ x x ] ad σ o is such ha Proof. We have B µη )B + η σ o, E i,ρ [ fi x, ρ) ] σ o. x x = x x η f i x, ρ ), x x + η f i x, ρ ) x x η f i x, ρ ), x x + η f i x, ρ ) f i x, ρ ) + η f i x, ρ ) E [ x x ] F x x η fx ), x x + η [ E i,ρ fi x, ρ ) f i x, ρ ) ] + η [ E i,ρ fi x, ρ ) ] ) x x η fx ) fx ) + µ x x ) + 4Lη fx ) fx )) + η σ o = µη ) x x η Lη )fx ) fx )) + η σ o, where iequaliy ) follows from he srog covexiy of f ad E i,ρ [ f i x, ρ ) f i x, ρ ) ] is bouded by Lfx ) fx )) as he proof of Lemma A.. Whe η /L, he secod erm is o-posiive ad we obahe desired resul afer akig oal expecaios. 0

22 Noe ha whe η /4L, we have E [ x x ] µη ) E [ x x ] η fx ) fx )) + η σ o. This akes a similar form o Eq. 5), ad oe ca use he same ierae averagig scheme as Theorem 3 wih sep-sizes η = /µγ + ) by adapig he proof. We ow give a similar recursio for he proximal SGD algorihm see, e.g., Duchi ad Siger, 009). This allows us o apply he resuls of Theorem ad he sep-size sraegy meioed i Secio 3. Proposiio C. Simple recursio for proximal SGD wih variace a opimum). Uder assumpios A) ad A), if η /L, he he proximal SGD recursio x := prox ηhx η f i x, ρ )) saisfies where B := E[ x x ] ad σ o is such ha Proof. We have B µη )B + η σ o, E i,ρ [ fi x, ρ) fx ) ] σ o. x x = prox ηhx η f i x, ρ )) prox ηhx η fx )) x η f i x, ρ ) x + η fx ) = x x η f i x, ρ ) fx ), x x + η f i x, ρ ) fx ) x x η f i x, ρ ) fx ), x x + η f i x, ρ ) f i x, ρ ) + η f i x, ρ ) fx ), where he firs equaliy follows from he opimaliy of x ad he followig iequaliy follows from he o-expasiveess of proximal operaors. Takig codiioal expecaios o F yields E [ x x ] F x x η fx ) fx ), x x + η [ E i,ρ fi x, ρ ) f i x, ρ ) ] + η [ E i,ρ fi x, ρ ) fx ) ] ) x x η fx ) fx ) + µ ) x x fx ), x x + 4Lη fx ) fx ) fx ), x x ) + η σ o = µη ) x x η Lη )fx ) fx ) fx ), x x ) + η σ o, where iequaliy ) follows from he µ-srog covexiy of f ad E i,ρ [ f i x, ρ ) f i x, ρ ) ] is bouded by Lfx ) fx ) fx ), x x ) as he proof of Lemma B.. By covexiy of f, we have fx ) fx ) fx ), x x 0, hece he secod erm is o-posiive whe η /L. We coclude by akig oal expecaios. We oe ha Proposiios C. ad C. ca be easily adaped o o-uiform samplig wih samplig disribuio q ad sep-sizes η /q i, leadig o sep-size codiios η /L q, wih L q = max i L i q i ad variace σ q,o = E i,ρ [ q i) f i x, ρ) fx ) ]. D Averagig Experimes Figure 3 shows a compariso of S-MISO ad SGD wih he averagig scheme proposed i Theorem 3 see Secio C for commes o how i applies o SGD), o he breas cacer daase preseed i Secio 5, for differe values of he regularizaio µ ad hus of he codiio umber κ = L/µ), ad Dropou raes δ. We ca see ha he averagig scheme gives some small improvemes for boh mehods, ad ha he

23 0 - gee dropou, δ = 0.30 S-MISO η =. 0 S-MISO-AVG η =. 0 SGD η =. 0 SGD-AVG η = gee dropou, δ = 0.30 S-MISO η =. 0 S-MISO-AVG η =. 0 SGD η =. 0 SGD-AVG η = gee dropou, δ = µ = gee dropou, δ = µ = gee dropou, δ = gee dropou, δ = Figure 3: Compariso of S-MISO ad SGD wih averagig, for differe codiioig corolled by µ) ad differe Dropou raes δ. We begi sep-size decay ad averagig a epoch 3 op) ad 30 boom). improvemes are more sigifica whe he problem is more ill-codiioed Figure 3, boom). We oe ha he ime a which we sar averagig ca have sigifica impac o he covergece, i paricular, sarig oo early ca sigificaly slow dow he iiial covergece, as commoly oiced for sochasic gradie mehods see, e.g., Nemirovski e al., 009).

Supplement for SADAGRAD: Strongly Adaptive Stochastic Gradient Methods"

Supplement for SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Suppleme for SADAGRAD: Srogly Adapive Sochasic Gradie Mehods" Zaiyi Che * 1 Yi Xu * Ehog Che 1 iabao Yag 1. Proof of Proposiio 1 Proposiio 1. Le ɛ > 0 be fixed, H 0 γi, γ g, EF (w 1 ) F (w ) ɛ 0 ad ieraio