SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

Size: px

Start display at page:

Download "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives"

Sibyl Alexander
6 years ago
Views:

SAGA: A Fast Icremetal Gradet Method Wth Support for No-Strogly Covex Composte Objectves Aaro Defazo, Fracs Bach, Smo Lacoste-Jule To cte ths verso: Aaro

Advaces I Neural Iformato Processg Systems, Nov 2014, Motreal, Caada. <hal-01016843v3> HAL Id: hal-01016843 https://hal.archves-ouvertes.

publshed or ot. The documets may come from teachg ad research sttutos Frace or abroad, or from publc or prvate research ceters.

1 SAGA: A Fast Icremetal Gradet Method Wth Support for No-Strogly Covex Composte Objectves Aaro Defazo, Fracs Bach, Smo Lacoste-Jule To cte ths verso: Aaro Defazo, Fracs Bach, Smo Lacoste-Jule. SAGA: A Fast Icremetal Gradet Method Wth Support for No-Strogly Covex Composte Objectves. Advaces I Neural Iformato Processg Systems, Nov 2014, Motreal, Caada. <hal v3> HAL Id: hal Submtted o 12 Nov 2014 HAL s a mult-dscplary ope access archve for the depost ad dssemato of scetfc research documets, whether they are publshed or ot. The documets may come from teachg ad research sttutos Frace or abroad, or from publc or prvate research ceters. L archve ouverte plurdscplare HAL, est destée au dépôt et à la dffuso de documets scetfques de veau recherche, publés ou o, émaat des établssemets d esegemet et de recherche fraças ou étragers, des laboratores publcs ou prvés.

2 SAGA: A Fast Icremetal Gradet Method Wth Support for No-Strogly Covex Composte Objectves Aaro Defazo Ambata Australa Natoal Uversty, Caberra Fracs Bach INRIA - Serra Project-Team École Normale Supéreure, Pars, Frace Smo Lacoste-Jule INRIA - Serra Project-Team École Normale Supéreure, Pars, Frace Abstract I ths work we troduce a ew optmsato method called SAGA the sprt of SAG, SDCA, MISO ad SVRG, a set of recetly proposed cremetal gradet algorthms wth fast lear covergece rates. SAGA mproves o the theory behd SAG ad SVRG, wth better theoretcal covergece rates, ad has support for composte objectves where a proxmal operator s used o the regularser. Ulke SDCA, SAGA supports o-strogly covex problems drectly, ad s adaptve to ay heret strog covexty of the problem. We gve expermetal results showg the effectveess of our method. 1 Itroducto Remarkably, recet advaces 1, 2] have show that t s possble to mmse strogly covex fte sums provably faster expectato tha s possble wthout the fte sum structure. Ths s sgfcat for mache learg problems as a fte sum structure s commo the emprcal rsk mmsato settg. The requremet of strog covexty s lkewse satsfed mache learg problems the typcal case where a quadratc regularser s used. I partcular, we are terested mmsg fuctos of the form f(x) = 1 f (x), where x R d, each f s covex ad has Lpschtz cotuous dervatves wth costat L. We wll also cosder the case where each f s strogly covex wth costat µ, ad the composte (or proxmal) case where a addtoal regularsato fucto s added: =1 F(x) = f(x)+h(x), where h: R d R d s covex but potetally o-dfferetable, ad where the proxmal operato ofhs easy to compute few cremetal gradet methods are applcable ths settg 3]4]. Our cotrbutos are as follows. I Secto 2 we descrbe the SAGA algorthm, a ovel cremetal gradet method. I Secto 5 we prove theoretcal covergece rates for SAGA the strogly covex case better tha those for SAG 1] ad SVRG 5], ad a factor of 2 from the SDCA 2] covergece rates. These rates also hold the composte settg. Addtoally, we show that The frst author completed ths work whle uder fudg from NICTA. Ths work was partally supported by the MSR-Ira Jot Cetre ad a grat by the Europea Research Coucl (SIERRA project ). 1

3 lke SAG but ulke SDCA, our method s applcable to o-strogly covex problems wthout modfcato. We establsh theoretcal covergece rates for ths case also. I Secto 3 we dscuss the relato betwee each of the fast cremetal gradet methods, showg that each stems from a very small modfcato of aother. 2 SAGA Algorthm We start wth some kow tal vectorx 0 R d ad kow dervatvesf (φ0 ) Rd wthφ 0 = x0 for each. These dervatves are stored a table data-structure of legth, or alteratvely a d matrx. For may problems of terest, such as bary classfcato ad least-squares, oly a sgle floatg pot value stead of a full gradet vector eeds to be stored (see Secto 4). SAGA s spred both from SAG 1] ad SVRG 5] (as we wll dscuss Secto 3). SAGA uses a step sze ofγ ad makes the followg updates, startg wthk = 0: SAGA Algorthm: Gve the value ofx k ad of eachf (φk ) at the ed of teratok, the updates for terato k +1 s as follows: 1. Pck aj uformly at radom. 2. Take φ k+1 j = x k, ad store f j (φk+1 j ) the table. All other etres the table rema s ot explctly stored. uchaged. The quattyφ k+1 j 3. Updatexusgf j (φk+1 j w k+1 = x k γ The proxmal operator we use above s defed as ),f j (φk j ) ad the table average: f j(φ k+1 j ) f j(φ k j)+ 1 ] f (φ k ), (1) =1 x k+1 = prox h ( γ w k+1 ). (2) { h(x)+ 1 }. (3) prox h γ (y) := argm x R 2γ x y 2 d I the strogly covex case, whe a step sze ofγ = 1/(2(µ+L)) s chose, we have the followg covergece rate the composte ad hece also the o-composte case: ) Ex k x 2 k µ x (1 0 x 2 + f(x 0 ) f (x ),x 0 x f(x ) ]]. 2(µ+L) µ+l We prove ths result Secto 5. The requremet of strog covexty ca be relaxed from eedg to hold for each f to just holdg o average, but at the expese of a worse geometrc rate (1 ), requrg a step sze ofγ = 1/(3(µ+L)). µ 6(µ+L) I the o-strogly covex case, we have establshed the covergece rate terms of the average terate, excludg step 0: x k = 1 k k t=1 xt. Usg a step sze ofγ = 1/(3L) we have E F( x k ) ] F(x ) 4 2L x 0 x 2 +f(x 0 ) f (x ),x 0 x ] f(x ). k Ths result s proved the supplemetary materal. Importatly, whe ths step sze γ = 1/(3L) s used, our algorthm automatcally adapts to the level of strog covexty µ > 0 aturally preset, gvg a covergece rate of (see the commet at the ed of the proof of Theorem 1): { }) Ex k x 2 k 1 (1 m 4, µ x 0 x f(x 0 ) f (x ),x 0 x f(x ) ]]. 3L 3L Although ay cremetal gradet method ca be appled to o-strogly covex problems va the addto of a small quadratc regularsato, the amout of regularsato s a addtoal tuable parameter whch our method avods. 3 Related Work We explore the relatoshp betwee SAGA ad the other fast cremetal gradet methods ths secto. By usg SAGA as a mdpot, we are able to provde a more ufed vew tha s avalable the exstg lterature. A bref summary of the propertes of each method cosdered ths secto s gve Fgure 1. The method from 3], whch hadles the o-composte settg, s ot lsted as ts rate s of the slow type ad ca be up totmes smaller tha the oe for SAGA or SVRG 5]. 2

4 SAGA SAG SDCA SVRG FINITO Strogly Covex (SC) Covex, No-SC*?? Prox Reg.? 6] No-smooth Low Storage Cost Smple(-sh) Proof Adaptve to SC?? Fgure 1: Basc summary of method propertes. Questo marks deote uprove, but ot expermetally ruled out cases. (*) Note that ay method ca be appled to o-strogly covex problems by addg a small amout of L2 regularsato, ths row descrbes methods that do ot requre ths trck. SAGA: mdpot betwee SAG ad SVRG/S2GD I 5], the authors make the observato that the varace of the stadard stochastc gradet (SGD) update drecto ca oly go to zero f decreasg step szes are used, thus prevetg a lear covergece rate ulke for batch gradet descet. They thus propose to use a varace reducto approach (see 7] ad refereces there for example) o the SGD update order to be able to use costat step szes ad get a lear covergece rate. We preset the updates of ther method called SVRG (Stochastc Varace Reduced Gradet) (6) below, comparg t wth the o-composte form of SAGA rewrtte (5). They also meto that SAG (Stochastc Average Gradet) 1] ca be terpreted as reducg the varace, though they do ot provde the specfcs. Here, we make ths coecto clearer ad relate t to SAGA. We frst revew a slghtly more geeralzed verso of the varace reducto approach (we allow the updates to be based). Suppose that we wat to use Mote Carlo samples to estmate EX ad that we ca compute effcetlyey for aother radom varabley that s hghly correlated wthx. Oe varace reducto approach s to use the followg estmatorθ α as a approxmato toex: θ α := α(x Y)+EY, for a step szeα 0,1]. We have thateθ α s a covex combato ofex adey : Eθ α = αex +(1 α)ey. The stadard varace reducto approach usesα = 1 ad the estmate s ubased Eθ 1 = EX. The varace of θ α s: Var(θ α ) = α 2 Var(X)+Var(Y) 2Cov(X,Y)], ad so fcov(x,y)s bg eough, the varace ofθ α s reduced compared tox, gvg the method ts ame. By varyg α from 0 to 1, we crease the varace of θ α towards ts maxmum value (whch usually s stll smaller tha the oe forx) whle decreasg ts bas towards zero. Both SAGA ad SAG ca be derved from such a varace reducto vewpot: herex s the SGD drecto sample f j (xk ), whereas Y s a past stored gradet f j (φk j ). SAG s obtaed by usg α = 1/ (update rewrtte our otato (4)), whereas SAGA s the ubased verso wthα = 1 (see (5) below). For the same φ s, the varace of the SAG update s 1/ 2 tmes the oe of SAGA, but at the expese of havg a o-zero bas. Ths o-zero bas mght expla the complexty of the covergece proof of SAG ad why the theory has ot yet bee exteded to proxmal operators. By usg a ubased update SAGA, we are able to obta a smple ad tght theory, wth better costats tha SAG, as well as theoretcal rates for the use of proxmal operators. f (SAG) x k+1 = x k j (x k ) f j γ (φk j ) ] + 1 f (φ k ), (4) =1 ] (SAGA) x k+1 = x k γ f j(x k ) f j(φ k j)+ 1 f (φ k ), (5) =1 ] (SVRG) x k+1 = x k γ f j(x k ) f j( x)+ 1 f ( x). (6) The SVRG update (6) s obtaed by usg Y = f j ( x) wth α = 1 (ad s thus ubased we ote that SAG s the oly method that we preset the related work that has a based update drecto). The vector x s ot updated every step, but rather the loop overk appears sde a outer loop, where x s updated at the start of each outer terato. Essetally SAGA s at the mdpot betwee SVRG ad SAG; t updates the φ j value each tme dex j s pcked, whereas SVRG updates all of φ s as a batch. The S2GD method 8] has the same update as SVRG, just dfferg how the umber of er loop teratos s chose. We use SVRG heceforth to refer to both methods. =1 3

5 SVRG makes a trade-off betwee tme ad space. For the equvalet practcal covergece rate t makes 2x-3x more gradet evaluatos, but dog so t does ot eed to store a table of gradets, but a sgle average gradet. The usage of SAG vs. SVRG s problem depedet. For example for lear predctors where gradets ca be stored as a reduced vector of dmesop 1 forpclasses, SAGA s preferred over SVRG both theoretcally ad practce. For eural etworks, where o theory s avalable for ether method, the storage of gradets s geerally more expesve tha the addtoal backpropagatos, but ths s computer archtecture depedet. SVRG also has a addtoal parameter besdes step sze that eeds to be set, amely the umber of teratos per er loop (m). Ths parameter ca be set va the theory, or coservatvely as m =, however dog so does ot gve aywhere ear the best practcal performace. Havg to tue oe parameter stead of two s a practcal advatage for SAGA. Fto/MISOµ To make the relatoshp wth other pror methods more apparet, we ca rewrte the SAGA algorthm ( the o-composte case) term of a addtoal termedate quatty u k, wth u 0 := x 0 +γ =1 f (x0 ), addto to the usualx k terate as descrbed prevously: SAGA: Equvalet reformulato for o-composte case: Gve the value of u k ad of each f (φk ) at the ed of teratok, the updates for teratok+1, s as follows: 1. Calculatex k : x k = u k γ f (φ k ). (7) 2. Updateuwthu k+1 = u k + 1 (xk u k ). 3. Pck a j uformly at radom. 4. Take φ k+1 j =1 = x k, ad store f j (φk+1 j ) the table replacg f j (φk j ). All other etres s ot explctly stored. the table rema uchaged. The quattyφ k+1 j Elmatgu k recovers the update (5) forx k. We ow descrbe how the Fto 9] ad MISOµ 10] methods are closely related to SAGA. Both Fto ad MISOµ use updates of the followg form, for a step legthγ: x k+1 = 1 φ k γ f (φ k ). (8) The step sze used s of the order of 1/µ. To smplfy the dscusso of ths algorthm we wll troduce the otato φ = 1 φk. SAGA ca be terpreted as Fto, but wth the quatty φ replaced wth u, whch s updated the same way as φ, but expectato. To see ths, cosder how φ chages expectato: E φk+1 ] = E φ k + 1 ( x k φ k ) ] j = φ k + 1 ( x k φ k). The update s detcal expectato to the update for u, u k+1 = u k + 1 (xk u k ). There are three advatages of SAGA over Fto/MISOµ. SAGA does ot requre strog covexty to work, t has support for proxmal operators, ad t does ot requre storg the φ values. MISO has prove support for proxmal operators oly the case where mpractcally small step szes are used 10]. The bg advatage of Fto/MISOµ s that whe usg a per-pass re-permuted access orderg, emprcal speed-ups of up-to a factor of 2x has bee observed. Ths access order ca also be used wth the other methods dscussed, but wth smaller emprcal speed-ups. Fto/MISOµ s partcularly useful whe f s computatoally expesve to compute compared to the extra storage costs requred over the other methods. SDCA The Stochastc Dual Coordate Descet (SDCA) 2] method o the surface appears qute dfferet from the other methods cosdered. It works wth the covex cojugates of the f fuctos. However, ths secto we show a ovel trasformato of SDCA to a equvalet method that oly works wth prmal quattes, ad s closely related to the MISOµ method. =1 4

6 Cosder the followg algorthm: SDCA algorthm the prmal Step k +1: 1. Pck a dex j uformly at radom. 2. Compute φ k+1 j = prox fj γ (z), whereγ = 1 µ adz = γ j f (φk ( ) ). z φ k+1 j the table at locato j. For j, the 3. Store the gradet f j (φk+1 j ) = 1 γ table etres are uchaged (f (φk+1 ) = f (φk )). At completo, returx k = γ f (φk ). We clam that ths algorthm s equvalet to the verso of SDCA where exact block-coordate maxmsato s used o the dual. 1 Frstly, ote that whle SDCA was orgally descrbed for oedmesoal outputs (bary classfcato or regresso), t has bee expaded to cover the multclass predctor case 11] (called Prox-SDCA there). I ths case, the prmal objectve has a separate strogly covex regularser, ad the fuctosf are restrcted to the formf (x) := ψ (X T x), where X s ad p feature matrx, adψ s the loss fucto that takes apdmesoal put, forpclasses. To stay the same geeral settg as the other cremetal gradet methods, we work drectly wth the f (x) fuctos rather tha the more structured ψ (X T x). The dual objectve to maxmse the becomes D(α) = µ 1 2 α 1 f ( α ), 2 µ =1 whereα s ared-dmesoal dual varables. Geeralsg the exact block-coordate maxmsato update that SDCA performs to ths form, we get the dual update for block j (wth x k the curret prmal terate): α k+1 j = α k j +argmax a j R d { f j =1 ( α k j α j ) µ 2 xk + 1 µ α 2} j. (9) I the specal case where f (x) = ψ (X T x), we ca see that (9) gves exactly the same update as Opto I of Prox-SDCA 11, Fgure 1], whch operates stead o the equvalet p-dmesoal dual varables α wth the relatoshp thatα = X α. 2 As oted by Shalev-Shwartz & Zhag 11], the update (9) s actually a stace of the proxmal operator of the covex cojugate of f j. Our prmal formulato explots ths fact by usg a relato betwee the proxmal operator of a fucto ad ts covex cojugate kow as the Moreau decomposto: prox f (v) = v prox f (v). Ths decomposto allows us to compute the proxmal operator of the cojugate va the prmal proxmal operator. As ths s the oly use the basc SDCA method of the cojugate fucto, applyg ths decomposto allows us to completely elmate the dual aspect of the algorthm, yeldg the above prmal form of SDCA. The dual varables are related to the prmal represetatves φ s through α = f (φ ). The KKT codtos esure that f the α values are dual optmal the x k = γ α as defed above s prmal optmal. The same trck s commoly used to terpret Djkstra s set tersecto as a prmal algorthm stead of a dual block coordate descet algorthm 12]. The prmal form of SDCA dffers from the other cremetal gradet methods descrbed ths secto that t assumes strog covexty s duced by a separate strogly covex regularser, rather tha eachf beg strogly covex. I fact, SDCA ca be modfed to work wthout a separate regularser, gvg a method that s at the mdpot betwee Fto ad SDCA. We detal such a method the supplemetary materal. 1 More precsely, to Opto I of Prox-SDCA as descrbed 11, Fgure 1]. We wll smply refer to ths method as SDCA ths paper for brevty. 2 Ths s becausef (α ) = f α s.t. α =X α ψ ( α ). 5

7 SDCA varats The SDCA theory has bee expaded to cover a umber of other methods of performg the coordate step 11]. These varats replace the proxmal operato our prmal terpretato the prevous secto wth a update whereφ k+1 j s chose so that: f j (φk+1 j ) = (1 β)f j (φk j )+βf j (xk ), where x k = 1 µ f (φk ). The varats dffer how β 0,1] s chose. Note that φk+1 j does ot actually have to be explctly kow, just the gradetf j (φk+1 j ), whch s the result of the above terpolato. Varat 5 by Shalev-Shwartz & Zhag 11] does ot requre operatos o the cojugate fucto, t smply usesβ = µ L+µ. The most practcal varat performs a le search volvg the covex cojugate to determe β. As far as we are aware, there s o smple prmal equvalet of ths le search. So cases where we ca ot compute the proxmal operator from the stadard SDCA varat, we ca ether troduce a tueable parameter to the algorthm (β), or use a dual le search, whch requres a effcet way to evaluate the covex cojugates of eachf. 4 Implemetato We brefly dscuss some mplemetato cocers: For may problems each dervatve f s just a smple weghtg of the th data vector. Logstc regresso ad least squares have ths property. I such cases, stead of storg the full dervatvef for each, we eed oly to store the weghtg costats. Ths reduces the storage requremets to be the same as the SDCA method practce. A smlar trck ca be appled to mult-class classfers wthpclasses by storgp 1 values for each. Our algorthm assumes that tal gradets are kow for each f at the startg pot x 0. Istead, a heurstc may be used where durg the frst pass, data-pots are troduced oeby-oe, a o-radomzed order, wth averages computed terms of those data-pots processed so far. Ths procedure has bee successfully used wth SAG 1]. The SAGA update as stated s slower tha ecessary whe dervatves are sparse. A just-tme updatg of u or x may be performed just as s suggested for SAG 1], whch esures that oly sparse updates are doe at each terato. We gve the form of SAGA for the case where each f s strogly covex. However practce we usually have oly covexf, wth strog covexty f duced by the addto of a quadratc regularser. Ths quadratc regularser may be splt amogst the f fuctos evely, to satsfy our assumptos. It s perhaps easer to use a varat of SAGA where the regularser µ 2 x 2 s explct, such as the followg modfcato of Equato (5): ] x k+1 = (1 γµ)x k γ f j(x k ) f j(φ k j)+ 1 f (φ k ). For sparse mplemetatos stead of scalg x k at each step, a separate scalg costat β k may be scaled stead, wth β k x k beg used place of x k. Ths s a stadard trck used wth stochastc gradet methods. For sparse problems wth a quadratc regularser the just--tme updatg ca be a lttle trcate. I the supplemetary materal we provde example pytho code showg a correct mplemetato that uses each of the above trcks. 5 Theory I ths secto, all expectatos are take wth respect to the choce of j at terato k + 1 ad codtoed ox k ad each f (φk ) uless stated otherwse. We start wth two basc lemmas that just state propertes of covex fuctos, followed by Lemma 1, whch s specfc to our algorthm. The proofs of each of these lemmas s the supplemetary materal. Lemma 1. Let f(x) = 1 =1 f (x). Suppose each f s µ-strogly covex ad has Lpschtz cotuous gradets wth costatl. The for allxad x : f (x),x x L µ L f(x ) f(x)] µ 2 x x 2 6

8 1 2L f (x ) f (x) 2 µ L f (x ),x x. Lemma 2. We have that for allφ ad x : ] 1 f (φ ) f (x ) 2 1 2L f (φ ) f(x ) 1 f (x ),φ x. Lemma 3. It holds that for ayφ k,x, x k ad β > 0, wth w k+1 as defed Equato 1: Ew k+1 x k γf (x ) 2 γ 2 (1+β 1 )Ef j(φ k j) f j(x ) 2 +γ 2 (1+β)Ef j(x k ) f j(x ) γ 2 βf (x k ) f (x ) 2. Theorem 1. Wth x the optmal soluto, defe the Lyapuov fuctot as: T k := T(x k,{φ k } =1) := 1 f (φ k ) f(x ) 1 f (x ),φ k x +c x k x 2. 1 The wthγ = 2(µ+L),c = 1 2γ(1 γµ), adκ = 1 γµ, we have the followg expected chage the Lyapuov fucto betwee steps of the SAGA algorthm (codtoal ot k ): ET k+1 ] (1 1 κ )Tk. Proof. The frst three terms T k+1 are straght-forward to smplfy: ] 1 E f (φ k+1 ) = 1 ( f(xk )+ 1 1 ) 1 f (φ k ). E 1 f (x ),φ k+1 x ] = 1 f (x ),x k x ( 1 1 ) 1 f (x ),φ k x. For the chage the last term oft k+1, we apply the o-expasveess of the proxmal operator 3 : c x k+1 x 2 = c proxγ (w k+1 ) prox γ (x γf (x )) 2 c w k+1 x +γf (x ) 2. We expad the quadratc ad applyew k+1 ] = x k γf (x k ) to smplfy the er product term: ce w k+1 x +γf (x ) 2 = ce x k x +w k+1 x k +γf (x ) 2 = c x k x 2 +2cE w k+1 x k +γf (x ),x k x ] +ce w k+1 x k +γf (x ) 2 = c x k x 2 2cγ f (x k ) f (x ),x k x +ce w k+1 x k +γf (x ) 2 c x k x 2 2cγ f (x k ),x k x +2cγ f (x ),x k x cγ 2 β f (x k ) f (x ) 2 + ( 1+β 1) cγ 2 E f j (φ k j) f j(x ) 2 +(1+β)cγ 2 E f j (x k ) f j(x ) 2. (Lemma 7) The value of β shall be fxed later. Now we apply Lemma 1 to boud 2cγ f (x k ),x k x ad Lemma 6 to boud E f j (φ k j ) f j (x ) 2 : ce x k+1 x 2 (c cγµ) x k x 2 + pot. 2cγ(L µ) L +2 ( 1+β 1) cγ 2 L ( (1+β)cγ 2 cγ ) E f L j(x k ) f j(x ) 2 f(x k ) f(x ) f (x ),x k x ] cγ 2 β f (x k ) f (x ) 2 1 f (φ k ) f(x ) 1 f (x ),φ k x ]. 3 Note that the frst equalty below s the oly place the proof where we use the fact thatx s a optmalty 2 7

9 Fucto sub-optmalty Gradet evaluatos / Fto perm Fto SAGA SVRG SAG SDCA Fgure 2: From left to rght we have the MNIST, COVTYPE, IJCNN1 ad MILLIONSONG datasets. Top row s the L2 regularsed case, bottom row the L1 regularsed case. We ca ow combe the bouds that we have derved for each term T, ad pull out a fracto 1 κ of Tk (for ay κ at ths pot). Together wth the equalty f (x k ) f (x ) 2 2µ f(x k ) f(x ) f (x ),x k x ] 13, Thm ], that yelds: ET k+1 ] T k 1 ( 1 κ Tk + 2cγ(L µ) 2cγ µβ) 2 f(x k ) f(x ) f (x ),x k x ] L ( 1 + κ +2(1+β 1 )cγ 2 L 1 ) 1 f (φ k ) f(x ) 1 f (x ),φ k x ] ( ) ( 1 + κ γµ cx k x 2 + (1+β)γ 1 ) cγef j(x k ) f j(x ) 2. (10) L Note that each of the terms square brackets are postve, ad t ca be readly verfed that our 1 assumed values for the costats (γ = 2(µ+L), c = 1 2γ(1 γµ), ad κ = 1 γµ ), together wth β = 2µ+L L esure that each of the quattes roud brackets are o-postve (the costats were determed by settg all the roud brackets to zero except the secod oe see 14] for the detals). Adaptvty to strog covexty result: Note that whe usg the γ = 1 3L step sze, the same c as above ca be used wth β = 2 ad 1 κ = m{ 1 4, µ 3L} to esure o-postve terms. Corollary 1. Note that c x k x 2 T k, ad therefore by chag the expectatos, pluggg the costats explctly ad usgµ( 0.5) µ to smplfy the expresso, we get: x E ] ( ) k x 2 k µ x 1 0 x 2 + f(x 0 ) f (x ),x 0 x f(x ) ]]. 2(µ+L) µ+l Here the expectato s over all choces of dexj k up to step k. 6 Expermets We performed a seres of expermets to valdate the effectveess of SAGA. We tested a bary classfer o MNIST, COVTYPE, IJCNN1 ad a least squares predctor o MILLIONSONG. Detals of these datasets ca be foud 9]. We used the same code base for each method, just chagg the ma update rule. SVRG was tested wth the recalbrato pass used every teratos, as suggested 8]. Each method had ts step sze parameter chose so as to gve the fastest covergece. We tested wth a L2 regularser, whch all methods support, ad wth a L1 regularser o a subset of the methods. The results are show Fgure 2. We ca see that Fto (perm) performs the best o a per epoch equvalet bass, but t ca be the most expesve method per step. SVRG s smlarly fast o a per epoch bass, but whe cosderg the umber of gradet evaluatos per epoch s double that of the other methods for ths problem, t s mddle of the pack. SAGA ca be see to perform smlar to the o-permuted Fto case, ad to SDCA. Note that SAG s slower tha the other methods at the begg. To get the optmal results for SAG, a adaptve step sze rule eeds to be used rather tha the costat step sze we used. I geeral, these tests cofrm that the choce of methods should be doe based o ther propertes as dscussed Secto 3, rather tha ther covergece rate. 8

10 Refereces 1] Mark Schmdt, Ncolas Le Roux, ad Fracs Bach. Mmzg fte sums wth the stochastc average gradet. Techcal report, INRIA, hal , ] Sha Shalev-Shwartz ad Tog Zhag. Stochastc dual coordate ascet methods for regularzed loss mmzato. JMLR, 14: , ] Paul Tseg ad Sagwoo Yu. Icremetally updated gradet methods for costraed ad regularzed optmzato. Joural of Optmzato Theory ad Applcatos, 160:832:853, ] L Xao ad Tog Zhag. A proxmal stochastc gradet method wth progressve varace reducto. Techcal report, Mcrosoft Research, Redmod ad Rutgers Uversty, Pscataway, NJ, ] Re Johso ad Tog Zhag. Acceleratg stochastc gradet descet usg predctve varace reducto. NIPS, ] Taj Suzuk. Stochastc dual coordate ascet wth alteratg drecto method of multplers. Proceedgs of The 31st Iteratoal Coferece o Mache Learg, ] Eva Greesmth, Peter L. Bartlett, ad Joatha Baxter. Varace reducto techques for gradet estmates reforcemet learg. JMLR, 5: , ] Jakub Koečý ad Peter Rchtárk. Sem-stochastc gradet descet methods. ArXv e-prts, arxv: , December ] Aaro Defazo, Tbero Caetao, ad Just Domke. Fto: A faster, permutable cremetal gradet method for bg data problems. Proceedgs of the 31st Iteratoal Coferece o Mache Learg, ] Jule Maral. Icremetal majorzato-mmzato optmzato wth applcato to largescale mache learg. Techcal report, INRIA Greoble Rhôe-Alpes / LJK Laboratore Jea Kutzma, ] Sha Shalev-Shwartz ad Tog Zhag. Accelerated proxmal stochastc dual coordate ascet for regularzed loss mmzato. Techcal report, The Hebrew Uversty, Jerusalem ad Rutgers Uversty, NJ, USA, ] Patrck Combettes ad Jea-Chrstophe Pesquet. Proxmal Splttg Methods Sgal Processg. I Fxed-Pot Algorthms for Iverse Problems Scece ad Egeerg. Sprger, ] Yu. Nesterov. Itroductory Lectures O Covex Programmg. Sprger, ] Aaro Defazo. New Optmzato Methods for Mache Learg. PhD thess, (draft uder examato) Australa Natoal Uversty,

11 A The SDCA/Fto Mdpot Algorthm Usg Lagraga dualty theory, SDCA ca be show at step k as mmsg the followg lower boud: A k (x) = 1 f j(x)+ 1 f (φ k )+ f (φ k ),x φ k µ ] + 2 x 2. j Istead of drectly cludg the regularser ths boud, we ca use the stadard strog covexty lower boud for each f, by removg µ 2 x 2 ad chagg the expresso the summato to f (φ k )+ f (φk ),x φk + µ 2 x φ 2. The trasformato to havg strog covexty wth the f fuctos yelds the followg smple modfcato to the algorthm: φ k+1 j = prox fj (µ( 1)) (z), 1 where: z = 1 1 φ k j 1 µ( 1) f (φ k ). It ca be show that after ths update: x k+1 = φ k+1 j = 1 φ k+1 1 f µ (φ k+1 ). Now the smlarty to Fto s apparet f ths equato s compared Equato 8: x k+1 = 1 φk γ =1 f (φk ). The oly dfferece s that the vectors o the rght had sde of the equato are at ther values at stepk+1 stead ofk. Note that there s a crcular depedecy here, asφ k+1 j := x k+1 but φ k+1 j appears the defto of x k+1. Solvg the proxmal operator s the resoluto of the crcular depedecy. Ths md-pot betwee Fto ad SDCA s terestg t s ow rght, as t appears expermetally to have smlar robustess to permuted ordergs as Fto, but t has o tuable parameters lke SDCA. Whe the proxmal operator above s fast to compute, say o the same order as just evaluatg f j, the SDCA ca be the best method amog those dscussed. It s a lttle slower tha the other methods dscussed here, but t has o tuable parameters at all. It s also the oly choce whe eachf s ot dfferetable. The major dsadvatage of SDCA s that t ca ot hadle o-strogly covex problems drectly. Although lke most methods, addg a small amout of quadratc regularsato ca be used to recover a covergece rate. It s also ot adapted to use proxmal operators for the regularser the composte objectve case. The requremet of computg the proxmal operator of each loss f tally appears to be a bg dsadvatage, however there are varats of SDCA that remove ths requremet, but they troduce addtoal dowsdes. j B Lemmas Lemma 4. Let f be µ-strogly covex ad have Lpschtz cotuous gradets wth costat L. The we have for allxad y: f(x) f(y)+ f 1 (y),x y + 2(L µ) f (x) f (y) 2 + µl 2(L µ) y µ x 2 + (L µ) f (x) f (y),y x. Proof. Defe the fucto g as g(x) = f(x) µ 2 x 2. The the gradet s g (x) = f (x) µx. g has a Lpschtz gradet wth costatl µ. By covexty, we have 13, Thm ]: g(x) g(y)+ g 1 (y),x y + 2(L µ) g (x) g (y) 2. Substtutg the defto ofg ad g, ad smplfyg the terms gves the result. Lemma 5. Let f(x) = 1 =1 f (x). Suppose each f s µ-strogly covex ad has Lpschtz cotuous gradets wth costatl. The for allxad x : f (x),x x L µ L f(x ) f(x)] µ 2 x x 2 1 f (x ) f (x) 2 µ f (x ),x x. 2L L 10

12 Proof. Ths s a straght-forward corollary of Lemma 4, usg y = x, ad averagg over the f fuctos. Lemma 6. We have that for allφ ad x : ] 1 f (φ ) f (x ) 2 1 2L f (φ ) f(x ) 1 f (x ),φ x. Proof. Apply the stadard equalty f(y) f(x) + f (x),y x + 1 2L f (x) f (y) 2, wth y = φ ad x = x, for eachf, ad sum. Lemma 7. It holds that for ayφ k,x, x k ad β > 0, wth w k+1 as defed Equato 1: E w k+1 x k γf (x ) 2 γ 2 (1+β 1 )E f j (φ k j) f j(x ) 2 +γ 2 (1+β)E f j (x k ) f j(x ) 2 γ 2 β f (x k ) f (x ) 2. Proof. We follow a smlar argumet as occurs the SVRG proof 5] for ths term, but wth a tghter argumet. The tghteg comes from usg x+y 2 (1 + β 1 ) x 2 + (1 + β) y 2 stead of the smpler β = 1 case they use. The other key trck s the use of the stadard varace decompostoe X EX] 2 ] = E X 2 ] EX] 2 three tmes. Ew k+1 x k +γf (x ) = E γ 2 f (φ k )+γf (x )+γ f j(φ k j) f j(x k ) } {{ } := γx ] 2 X EX] {}} ] {{}}{ EX] = γ 2 E f j(φ k j) f j(x ) 1 2 {}}{ f (φ k )+f (x ) f j(x k ) f j(x ) f (x k )+f (x )] +γ 2 f (x k ) f (x ) γ 2 (1+β 1 )E f j(φ k j) f j(x ) 1 2 f (φ k )+f (x ) +γ 2 (1+β)Ef j(x k ) f j(x ) f (x k )+f (x ) 2 +γ 2 f (x k ) f (x ) (use varace decomposto twce more): γ 2 (1+β 1 )Ef j(φ k j) f j(x ) 2 +γ 2 (1+β)Ef j(x k ) f j(x ) 2 γ 2 βf (x k ) f (x ) C No-strogly-covex Problems Theorem 2. Whe each f s covex, usgγ = 1 3L, we have for xk = 1 k k t=1 xt that: E F( x k ) ] F(x ) 4 2L x 0 x 2 +f(x 0 ) f (x ),x 0 x ] f(x ). k Here the expectato s over all choces of dexj k up to stepk. Proof. A more detaled verso of ths proof s avalable 14]. We proceed by usg a smlar argumet as Theorem 1, but we add a addtoal α x k x 2 together wth the exstg c x k x 2 term the Lyapuov fucto. We wll boud α x k x 2 a dfferet maer to c x k x 2 (. Defe = 1 γ w k+1 x k) f (x k ), the dfferece betwee our approxmato to the gradet at x k ad 11

13 true gradet. The stead of usg the o-expasveess property at the begg, we use a result proved for prox-svrg 4, 2d eq. o p.12]: αe x k+1 x 2 α x k x 2 2αγE F(x k+1 ) F(x ) ] +2αγ 2 E 2. Although ther quatty s dfferet, they oly use the property that E ] = 0 to prove the above equato. A full proof of ths property for the SAGA algorthm that follows ther argumet appears 14]. To boud the term, a small modfcato of the argumet Lemma 7 ca be used, gvg: E 2 ( 1+β 1) E f j (φ k j) f j(x ) 2 +(1+β)E f j (x k ) f j(x ) 2. Applyg ths gves: αe x k+1 x 2 α x k x 2 2αγE F(x k+1 ) F(x ) ] +2(1+β 1 )αγ 2 E f j (φ k j) f j(x ) 2 +2(1+β)αγ 2 E f j (x k ) f j(x ) 2. As Theorem 1, we the apply Lemma 6 to boud E f j (φ k j ) f j (x ) 2. Combg wth the rest of the Lyapuov fucto as was derved Theorem 1 gves (we bascally add the α terms to equalty (10) wthµ = 0): ET k+1 ] T k ( ) 1 f(x 2cγ k ) f(x ) f (x ),x k x ] 2αγE F(x k+1 ) F(x ) ] ( + 4(1+β 1 )αlγ 2 +2(1+β 1 )clγ 2 1 ) 1 f (φ k ) f(x ) 1 f (x ),φ k x ] ( + (1+β)cγ +2(1+β)αγ c ) γe f L j (x k ) f j(x ) 2. As before, the terms square brackets are postve by covexty. Gve that our choce of step sze s γ = 1 3L (to match the adaptve to strog covexty step sze), we ca set the three roud brackets to zero by usgβ = 1,c = 3L 3L 2 adα = 8. We thus obta: ET k+1 ] T k 1 4 E F(x k+1 ) F(x ) ]. These expectatos are codtoal o formato from step k. We ow take the expectato wth respect to all prevous steps, yeldg ET k+1 ] ET k ] 1 4 E F(x k+1 ) F(x ) ], where all expectatos are ucodtoal. Further egatg ad summg for k from 0 to k 1 results telescopg of thet terms, gvg: k 1 4 E F(x t ) F(x ) ]] T 0 ET k ]. t=1 We ca drop the E T k] term sce T k s always postve. The we apply covexty to pull the summato sde off, ad multply through by4/k, gvg: ] E F( 1 k k x t ) F(x ) 1 k k E F(x t ) F(x ) ]] 4 k T0. t=1 t=1 We get a (c+α) = 15L 8 2L term that we use T0 for smplcty. D Example Code for Sparse Least Squares & Rdge Regresso The SAGA method s qute easy to mplemet for dese gradets, however the mplemetato for sparse gradet problems ca be trcky. The ma complcato s the eed for just--tme updatg of the elemets of the terate vector. Ths s eeded to avod havg to do ay full dese vector operatos at each terato. We provde below a smple mplemetato for the case of least-squares problems that llustrates how to correctly do ths. The code s the compled Pytho (Cytho) laguage. 12

14 mport radom mport umpy as p cmport umpy as p cmport cytho from cytho.vew cmport array as cvarray # Performs the lagged update of x by g. cdef le lagged update(log k, double:] x, double:] g, usged log:] lag, log:] ydces, t yle, double:] lag scalg, double a): cdef usged t cdef log d cdef usged log lagged amout = 0 for rage(yle): d = ydces] lagged amout = k lagd] lagd] = k xd] += lag scalg lagged amout] (a gd]) # Performs x += a y, where x s dese ad y s sparse. cdef le add weghted(double:] x, double:] ydata, log:] ydces, t yle, double a): cdef usged t for rage(yle): xydces]] += a ydata] # Dot product of a dese vector wth a sparse vector cdef le spdot(double:] x, double:] ydata, log:] ydces, t yle): cdef usged t cdef double v = 0.0 for rage(yle): v += ydata] xydces]] retur v def saga lstsq(a, double:] b, usged t maxter, props): # temporares cdef double:] ydata cdef log:] ydces cdef usged t, j, epoch, lagged amout cdef log dstart, ded, yle, d cdef double cew, Ax, cchage, gscalg # Data pots are stored colums CSC format. cdef double:] data = A.data cdef log:] dces = A.dces cdef log:] dptr = A.dptr cdef usged t m = A.shape0] # dmesos cdef usged t = A.shape1] # datapots cdef double:] xk = p.zeros(m) cdef double:] gk = p.zeros(m) 13

15 cdef double eta = props eta ] # Iverse step sze = 1/gamma cdef double reg = props.get( reg, 0.0) # Default 0 cdef double betak = 1.0 # Scalg factor for xk. # Tracks for each etry of x, what terato t was last updated at. cdef usged log:] lag = p.zeros(m, dtype= I ) # Italze gradets cdef double gd = 1.0/ for rage(): dstart = dptr] ded = dptr+1] ydata = datadstart:ded] ydces = dcesdstart:ded] yle = ded dstart add weghted(gk, ydata, ydces, yle, gd b]) # Ths s just a table of the sum the geometrc seres (1 reg/eta) # It s used to correctly do the just tme updatg whe # L2 regularsato s used. cdef double:] lag scalg = p.zeros( maxter+1) lag scalg0] = 0.0 lag scalg1] = 1.0 cdef double geosum = 1.0 cdef double mult = 1.0 reg/eta for rage(2, maxter+1): geosum = mult lag scalg] = lag scalg 1] + geosum # For least squares, we oly eed to store a sgle # double for each data pot, rather tha a full gradet vector. # The value stored s the A betak x product cdef double:] c = p.zeros() cdef usged log k = 0 # Curret terato umber for epoch rage(maxter): for j rage(): f epoch == 0: = j else: = p.radom.radt(0, ) # Selects the (sparse) colum of the data matrx cotag datapot. dstart = dptr] ded = dptr+1] ydata = datadstart:ded] ydces = dcesdstart:ded] yle = ded dstart # Apply the mssed updates to xk just tme lagged update(k, xk, gk, lag, ydces, yle, lag scalg, 1.0/(eta betak)) Ax = betak spdot(xk, ydata, ydces, yle) cew = Ax cchage = cew c] 14

16 c] = cew betak = 1.0 reg/eta # Update xk wth sparse step bt (wth betak scalg) add weghted(xk, ydata, ydces, yle, cchage/( eta betak)) k += 1 # Perform the gradet average part of the step lagged update(k, xk, gk, lag, ydces, yle, lag scalg, 1.0/(eta betak)) # update the gradet average add weghted(gk, ydata, ydces, yle, cchage/) # Perform the just tme updates for the whole xk vector, so that all etres are up to date. gscalg = 1.0/(eta betak) for d rage(m): lagged amout = k lagd] lagd] = k xkd] += lag scalg lagged amout] gscalg gkd] retur betak p.asarray(xk) 15

arxiv: v1 [cs.lg] 22 Feb 2015

arxiv: v1 [cs.lg] 22 Feb 2015 SDCA wthout Dualty Sha Shalev-Shwartz arxv:50.0677v cs.lg Feb 05 Abstract Stochastc Dual Coordate Ascet s a popular method for solvg regularzed loss mmzato for the case of covex losses. I ths paper we