DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization

Size: px

Start display at page:

Download "DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization"

Hester Hensley
6 years ago
Views:

1 DSCOVR: Randomzed Prma-Dua Boc Coordnate Agorthms for Asynchronous Dstrbuted Optmzaton Ln Xao Mcrosoft Research AI Redmond, WA 9805, USA Adams We Yu Machne Learnng Department, Carnege Meon Unversty Pttsburgh, PA 53, USA Qhang Ln Tppe Coege of Busness, The Unversty of Iowa Iowa Cty, IA 545, USA Wezhu Chen Mcrosoft AI and Research Redmond, WA 9805, USA October 3, 07 Abstract Machne earnng wth bg data often nvoves arge optmzaton modes. For dstrbuted optmzaton over a custer of machnes, frequent communcaton and synchronzaton of a mode parameters optmzaton varabes can be very costy. A promsng souton s to use parameter servers to store dfferent subsets of the mode parameters, and update them asynchronousy at dfferent machnes usng oca datasets. In ths paper, we focus on dstrbuted optmzaton of arge near modes wth convex oss functons, and propose a famy of randomzed prma-dua boc coordnate agorthms that are especay sutabe for asynchronous dstrbuted mpementaton wth parameter servers. In partcuar, we wor wth the sadde-pont formuaton of such probems whch aows smutaneous data and mode parttonng, and expot ts structure by douby stochastc coordnate optmzaton wth varance reducton DSCOVR. Compared wth other frst-order dstrbuted agorthms, we show that DSCOVR may requre ess amount of overa computaton and communcaton, and ess or no synchronzaton. We dscuss the mpementaton detas of the DSCOVR agorthms, and present numerca experments on an ndustra dstrbuted computng system. Keywords: asynchronous dstrbuted optmzaton, parameter servers, randomzed agorthms, sadde-pont probems, prma-dua coordnate agorthms, emprca rs mnmzaton. Introducton Agorthms and systems for dstrbuted optmzaton are crtca for sovng arge-scae machne earnng probems, especay when the dataset cannot ft nto the memory or storage of a snge machne. In ths paper, we consder dstrbuted optmzaton probems of the form mnmze w R d m f X w gw, = where X R N d s the oca data stored at the th machne, f : R N R s a convex cost functon assocated wth the near mappng X w, and gw s a convex reguarzaton functon. In addton,

2 we assume that g s separabe,.e., for some nteger n > 0, we can wrte gw = g w, = where g : R d R, and w R d for =,..., n are non-overappng subvectors of w R d wth n = d = d they form a partton of w. Many popuar reguarzaton functons n machne earnng are separabe, for exampe, gw = λ/ w or gw = λ w for some λ > 0. An mportant speca case of s dstrbuted emprca rs mnmzaton ERM of near predctors. Let x, y,..., x N, y N be N tranng exampes, where each x R d s a feature vector and y R s ts abe. The ERM probem s formuated as mnmze w R d N N φ x T w gw, 3 = where each φ : R R s a oss functon measurng the msmatch between the near predcton x T w and the abe y. Popuar oss functons n machne earnng ncude, e.g., for regresson, the squared oss φ t = /t y, and for cassfcaton, the ogstc oss φ t = og exp y t where y {±}. In the dstrbuted optmzaton settng, the N exampes are dvded nto m subsets, each stored on a dfferent machne. For =,..., m, et I denote the subset of {,..., N } stored at machne and et N = I they satsfy m = N = N. Then the ERM probem 3 can be wrtten n the form of by ettng X consst of x T wth I as ts rows and defnng f : R N R as f u I = m φ u, 4 N I where u I R N s a subvector of u R N, consstng of u wth I. The nature of dstrbuted agorthms and ther convergence propertes argey depend on the mode of the communcaton networ that connects the m computng machnes. A popuar settng n the terature s to mode the communcaton networ as a graph, and each node can ony communcate n one step wth ther neghbors connected by an edge, ether synchronousy or asynchronousy e.g., Bertseas and Tstss, 989; Nedć and Ozdagar, 009. The convergence rates of dstrbuted agorthms n ths settng often depend on characterstcs of the graph, such as ts dameter and the egenvaues of the graph Lapacan e.g. Xao and Boyd, 006; Duch et a., 0; Nedć et a., 06; Scaman et a., 07. Ths s often caed the decentrazed settng. Another mode for the communcaton networ s centrazed, where a the machnes partcpate synchronous, coectve communcaton, e.g., broadcastng a vector to a m machnes, or computng the sum of m vectors, each from a dfferent machne AReduce. These coectve communcaton protocos hde the underyng mpementaton detas, whch often nvove operatons on graphs. They are adopted by many popuar dstrbuted computng standards and pacages, such as MPI MPI Forum, 0, MapReduce Dean and Ghemawat, 008 and Aparche Spar Zahara et a., 06, and are wdey used n machne earnng practce e.g., Ln et a., 04; Meng et a., 06. In partcuar, coectve communcatons are very usefu for addressng data paraesm,.e., by aowng dfferent machnes to wor n parae to mprove the same mode w R d usng ther oca dataset. A dsadvantage of coectve communcatons s ther synchronzaton cost: faster machnes or machnes wth ess computng tass have to become de whe watng for other machnes to fnsh ther tass n order to partcpate a coectve communcaton.

3 One effectve approach for reducng synchronzaton cost s to expot mode paraesm here mode refers to w R d, ncudng a optmzaton varabes. The dea s to aow dfferent machnes wor n parae wth dfferent versons of the fu mode or dfferent parts of a common mode, wth tte or no synchronzaton. The mode parttonng approach can be very effectve for sovng probems wth arge modes arge dmenson d. Dedcated parameter servers can be set up to store and mantan dfferent subsets of the mode parameters, such as the w s n, and be responsbe for coordnatng ther updates at dfferent worers L et a., 04; Xng et a., 05. Ths requres fexbe pont-to-pont communcaton. In ths paper, we deveop a famy of randomzed agorthms that expot smutaneous data and mode paraesm. Correspondngy, we adopt a centrazed communcaton mode that support both synchronous coectve communcaton and asynchronous pont-to-pont communcaton. In partcuar, t aows any par of machnes to send/receve a message n a snge step, and mutpe pont-to-pont communcatons may happen n parae n an event-drven, asynchronous manner. Such a communcaton mode s we supported by the MPI standard. To evauate the performance of dstrbuted agorthms n ths settng, we consder the foowng three measures. Computaton compexty: tota amount of computaton, measured by the number of passes over a datasets X for =,..., m, whch can happen n parae on dfferent machnes. Communcaton compexty: the tota amount of communcaton requred, measured by the equvaent number of vectors n R d sent or receved across a machnes. Synchronous communcaton: measured by the tota number of vectors n R d that requres synchronous coectve communcaton nvovng a m machnes. We snge t out from the overa communcaton compexty as a parta measure of the synchronzaton cost. In Secton, we ntroduce the framewor of our randomzed agorthms, Douby Stochastc Coordnate Optmzaton wth Varance Reducton DSCOVR, and summarze our theoretca resuts on the three measures acheved by DSCOVR. Compared wth other frst-order methods for dstrbuted optmzaton, we show that DSCOVR may requre ess amount of overa computaton and communcaton, and ess or no synchronzaton. Then we present the detas of severa DSCOVR varants and ther convergence anayss n Sectons 3-6. We dscuss the mpementaton of dfferent DSCOVR agorthms n Secton 7, and present resuts of our numerca experments n Secton 8.. The DSCOVR Framewor and Man Resuts Frst, we derve a sadde-pont formuaton of the convex optmzaton probem. Let f be the convex conugate of f,.e., f α { = sup u R N α T u f u }, and defne Lw, α α T m X w f = m α gw, 5 = where α = α ;... ; α m ] R N. Snce both the f s and g are convex, Lw, α s convex n w and concave n α. We aso defne a par of prma and dua functons: Pw = max Lw, α = f X w gw, 6 α RN m = Dα = mn Lw, α = f w R d m α g X T α, 7 m 3 = =

4 w w w n α α X X :... α m... X :. Fgure : Partton of prma varabe w, dua varabe α, and the data matrx X. where Pw s exacty the obectve functon n and g s the convex conugate of g. We assume that L has a sadde pont w, α, that s, Lw, α Lw, α Lw, α, w, α R d R N. In ths case, we have w = arg mn Pw and α = arg mn Dα, and Pw = Dα. The DSCOVR framewor s based on sovng the convex-concave sadde-pont probem mn w R d max Lw, α. 8 α RN Snce we assume that g has a separabe structure as n, we rewrte the sadde-pont probem as mn w R d max α R N { m = = α T X w m = f α = } g w, 9 where X R N d for =,..., n are coumn parttons of X. For convenence, we defne the foowng notatons. Frst, et X = X ;... ; X m ] R N d be the overa data matrx, by stacng the X s vertcay. Conformng to the separaton of g, we aso partton X nto boc coumns X : R N d for =,..., n, where each X : = X ;... ; X m ] staced vertcay. For consstency, we aso use X : to denote X from now on. See Fgure for an ustraton. We expot the douby separabe structure n 9 by a douby stochastc coordnate update agorthm outned n Agorthm. Let p = {p,..., p m } and q = {q,..., q n } be two probabty dstrbutons. Durng each teraton t, we randomy pc an ndex {,..., m} wth probabty p, and ndependenty pc an ndex {,..., n} wth probabty q. Then we compute two vectors u t R N and v t R d detas to be dscussed ater, and use them to update the boc coordnates α and w whe eavng other boc coordnates unchanged. The update formuas n 0 and use the proxma mappngs of the scaed functons f and g respectvey. We. More techncay, we need to assume that each f s convex and ower sem-contnuous so that f = f see, e.g., Rocafear, 970, Secton. It automatcay hods f f s convex and dfferentabe, whch we w assume ater. 4

5 Agorthm DSCOVR framewor nput: nta ponts w 0, α 0, and step szes σ for =,..., m and τ for =,..., n. : for t = 0,,,..., do : pc {,..., m} and {,..., n} randomy wth dstrbutons p and q respectvey. 3: compute varance-reduced stochastc gradents u t and v t. 4: update prma and dua boc coordnates: { proxσ f 5: end for α t = w t = α t σ u t f =, α t 0, f, { prox τ g w t τ v t f =, w t, f. reca that the proxma mappng for any convex functon φ : R d R { } s defned as prox φ v = arg mn {φu } u v. u R d There are severa dfferent ways to compute the vectors u t and v t n Step 3 of Agorthm. They shoud be the parta gradents or stochastc gradents of the bnear coupng term n Lw, α wth respect to α and w respectvey. Let Kw, α = α T Xw = = = α T X w, whch s the bnear term n Lw, α wthout the factor /m. We can use the foowng parta gradents n Step 3: ū t = Kwt, α t = X w t α, v t = m Kw t, α t w = = m X T α t We note that the factor /m does not appear n the frst equaton because t mutpes both Kw, α and f α n 9 and hence does not appear n updatng α. Another choce s to use =. u t = q X w t, v t = p m X T α t, whch are unbased stochastc parta gradents, because E u t E v t ] = = ] = = q q X w t = = p p m X T α t = m X w t = = ū t, X T α t = v t, 3 5

6 w w n w w n α α m Fgure : Smutaneous data and mode paraesm. At any gven tme, each machne s busy updatng one parameter boc and ts own dua varabe. Whenever some machne s done, t s assgned to wor on a random boc that s not beng updated. where E and E are expectatons wth respect to the random ndces and respectvey. It can be shown that, Agorthm converges to a sadde pont of Lw, α wth ether choce or 3 n Step 3, and wth sutabe step szes σ and τ. It s expected that usng the stochastc gradents n 3 eads to a sower convergence rate than appyng. However, usng 3 has the advantage of much ess computaton durng each teraton. Specfcay, t empoys ony one boc matrx-vector mutpcaton for both updates, nstead of n and m boc mutpcatons done n. More mportanty, the choce n 3 s sutabe for parae and dstrbuted computng. To see ths, et t, t denote the par of random ndces drawn at teraton t we omt the superscrpt t to smpfy notaton whenever there s no confuson from the context. Suppose for a sequence of consecutve teratons t,..., t s, there s no common ndex among t,..., ts, nor among t,..., ts, then these s teratons can be done n parae and they produce the same updates as beng done sequentay. Suppose there are s processors or machnes, then each can carry out one teraton, whch ncudes the updates n 3 as we as 0 and. These s teratons are ndependent of each other, and n fact can be done n any order, because each ony nvove one prma boc w t and one dua boc α t, for both nput and output varabes on the rght and eft sdes of the assgnments respectvey. In contrast, the nput for the updates n depend on a prma and dua bocs at the prevous teraton, thus cannot be done n parae. In practce, suppose we have m machnes for sovng probem 9, and each hods the data matrx X : n memory and mantans the dua boc α, for =,..., m. We assume that the number of mode parttons n s arger than m, and the n mode bocs {w,..., w n } are stored at one or more parameter servers. In the begnnng, we can randomy pc m mode bocs sampng wthout repacement from {w,..., w n }, and assgn each machne to update one of them. If machne s assgned to update boc, then both α and w are updated, usng ony the matrx X ; moreover, t needs to communcate ony the boc w wth the parameter server that are responsbe to mantan t. Whenever one machne fnshes ts update, a scheduer can randomy pc another parameter boc that s not currenty updated by other machnes, and assgn t to the free machne. Therefore a machnes can wor n parae, n an asynchronous, event-drven manner. Here an event s the competon of a boc update at any machne, as ustrated n Fgure. We w dscuss the mpementaton detas n Secton 7. 6

7 The dea of usng douby stochastc updates for dstrbuted optmzaton n not new. It has been studed by Yun et a. 04 for sovng the matrx competon probem, and by Matsushma et a. 04 for sovng the sadde-pont formuaton of the ERM probem. Despte ther nce features for paraezaton, these agorthms nhert the O/ t or O/t wth strong convexty subnear convergence rate of the cassca stochastc gradent method. They transate nto hgh communcaton and computaton cost for dstrbuted optmzaton. In ths paper, we propose new varants of douby stochastc update agorthms by usng varance-reduced stochastc gradents Step 3 of Agorthm. More specfcay, we borrow the varance-reducton technques from SVRG Johnson and Zhang, 03 and SAGA Defazo et a., 04 to deveop the DSCOVR agorthms, whch enoy fast near rates of convergence. In the rest of ths secton, we summarze our theoretca resuts characterzng the three measures for DSCOVR: computaton compexty, communcaton compexty, and synchronzaton cost. We compare them wth dstrbuted mpementaton of batch frst-order agorthms.. Summary of Man Resuts Throughout ths paper, we use to denote the standard Eucdean norm for vectors. For matrces, denotes the operator spectra norm and F denotes the Frobenus norm. We mae the foowng assumpton regardng the optmzaton probem. Assumpton Each f s convex and dfferentabe, and ts gradent s /γ -Lpschtz contnuous,.e., f u f v γ u v, u, v R N, =,..., m. 4 In addton, the reguarzaton functon g s λ-strongy convex,.e., gw gw ξ T w w λ w w, ξ gw, w, w R d. Under Assumpton, each f s γ -strongy convex see, e.g., Hrart-Urruty and Lemarécha, 00, Theorem 4.., and Lw, α defned n 5 has a unque sadde pont w, α. The condton 4 s often referred to as f beng /γ -smooth. To smpfy dscusson, here we assume γ = γ for =,..., m. Under these assumptons, each composte functon f X w has a smoothness parameter X /γ upper bound on the argest egenvaue of ts Hessan. Ther average /m m = f X w has a smooth parameter X /mγ, whch no arger than the average of the ndvdua smooth parameters /m m = X /γ. We defne a condton number for probem as the rato between ths smooth parameter and the convexty parameter λ of g: κ bat = X mλγ m = X : λγ X max λγ, 5 where X max = max { X : }. Ths condton number s a ey factor to characterze the teraton compexty of batch frst-order methods for sovng probem,.e., mnmzng Pw. Specfcay, to fnd a w such that Pw Pw ɛ, the proxma gradent method requres O κ bat og/ɛ teratons, and ther acceerated varants requre O κ bat og/ɛ teratons e.g., Nesterov, 004; Bec and Teboue, 009; Nesterov, 03. Prma-dua frst order methods for sovng the sadde-pont probem 8 share the same compexty Chamboe and Poc, 0, 05. 7

8 Agorthms Computaton compexty Communcaton compexty number of passes over data number of vectors n R d batch frst-order methods κ bat og/ɛ m κ bat og/ɛ DSCOVR κ rand /m og/ɛ m κ rand og/ɛ acceerated batch frst-order methods κbat og/ɛ m κbat og/ɛ acceerated DSCOVR κrand /m og/ɛ m m κrand og/ɛ Tabe : Computaton and communcaton compextes of batch frst-order methods and DSCOVR for both SVRG and SAGA varants. We omt the O notaton n a entres and an extra og κ rand /m factor for acceerated DSCOVR agorthms. A fundamenta basene for evauatng any dstrbuted optmzaton agorthms s the dstrbuted mpementaton of batch frst-order methods. Let s consder sovng probem usng the proxma gradent method. Durng every teraton t, each machne receves a copy of w t R d from a master machne through Broadcast, and computes the oca gradent z t = X T f X w t R d. at the master Reduce. The master then taes a proxma gradent step, usng z t and the proxma mappng of g, to compute the next terate w t and broadcast t to every machne for the next teraton. We can aso use the AReduce operaton n MPI to obtan z t at each machne wthout a master. In ether case, the tota number of passes over the data s twce the number of teratons Then a coectve communcaton s nvoed to compute the batch gradent z t = /m m = z t due to matrx-vector mutpcatons usng both X and X T, and the number of vectors n Rd sent/receved across a machnes s m tmes the number of teratons see Tabe. Moreover, a communcatons are coectve and synchronous. Snce DSCOVR s a famy of randomzed agorthms for sovng the sadde-pont probem 8, we woud e to fnd w, α such that w t w /m α t α ɛ hods n expectaton and wth hgh probabty. We st the communcaton and computaton compextes of DSCOVR n Tabe, comparng them wth batch frst-order methods. Smar guarantees aso hod for reducng the duaty gap Pw t Dα t, where P and D are defned n 6 and 7 respectvey. The ey quantty characterzng the compextes of DSCOVR s the condton number κ rand, whch can be defned n severa dfferent ways. If we pc the data boc and mode boc wth unform dstrbuton,.e., p = /m for =,..., m and q = /n for =,..., n, then κ rand = n X m n, where X m n = max λγ X. 6, Comparng the defnton of κ bat n 5, we have κ bat κ rand because m X m X m = = = X n X m n. Wth X : = X X m ] R N d and X : = X ;... ; X m ] R N d, we can aso defne κ rand = X max,f λγ { }, where X max,f = max X: F, X : F. 7, 8

9 Agorthms Synchronous Communcaton Asynchronous Communcaton number of vectors n R d equv. number of vectors n R d DSCOVR-SVRG m og/ɛ κ rand og/ɛ DSCOVR-SAGA m m κ rand og/ɛ acceerated DSCOVR-SVRG m og/ɛ m κrand og/ɛ acceerated DSCOVR-SAGA m m κrand og/ɛ Tabe : Breadown of communcaton compextes nto synchronous and asynchronous communcatons for two dfferent types of DSCOVR agorthms. We omt the O notaton and an extra og κ rand /m factor for acceerated DSCOVR agorthms. In ths case, we aso have κ bat κ rand because X max X max,f. Fnay, f we pc the par, wth non-unform dstrbuton p = X : F / X F and q = X : F / X F, then we can defne κ rand = X F mλγ. 8 Agan we have κ bat κ rand because X X F. We may repace κ rand n Tabes and by ether κ rand or κ rand, dependng on the probabty dstrbutons p and q and dfferent proof technques. From Tabe, we observe smar type of speed-ups n computaton compexty, as obtaned by varance reducton technques over the batch frst-order agorthms for convex optmzaton e.g., Le Roux et a., 0; Johnson and Zhang, 03; Defazo et a., 04; Xao and Zhang, 04; Lan and Zhou, 05; Aen-Zhu, 07, as we as for convex-concave sadde-pont probems Zhang and Xao, 07; Baamurugan and Bach, 06. Bascay, DSCOVR agorthms have potenta mprovement over batch frst-order methods by a factor of m for non-acceerated agorthms or m for acceerated agorthms, but wth a worse condton number. In the worst case, the rato between κ rand and κ bat may be of order m or arger, thus canceng the potenta mprovements. More nterestngy, DSCOVR aso has smar mprovements n terms of communcaton compexty over batch frst-order methods. In Tabe, we decompose the communcaton compexty of DSCOVR nto synchronous and asynchronous communcaton. The decomposton turns out to be dfferent dependng on the varance reducton technques empoyed: SVRG Johnson and Zhang, 03 versus SAGA Defazo et a., 04. We note that DSCOVR-SAGA essentay requres ony asynchronous communcaton, because the synchronous communcaton of m vectors are ony necessary for ntazaton wth non-zero startng pont. The comparsons n Tabe and gve us good understandng of the compextes of dfferent agorthms. However, these compextes are not accurate measures of ther performance n practce. For exampe, coectve communcaton of m vectors n R d can often be done n parae over a spannng tree of the underyng communcaton networ, thus ony cost ogm tmes nsted of m tmes compared wth sendng ony one vector. Aso, for pont-to-pont communcaton, sendng one vector n R d atogether can be much faster than sendng n smaer vectors of tota ength d separatey. A far comparson n term of wa-coc tme on a rea-word dstrbuted computng system requres customzed, effcent mpementaton of dfferent agorthms. We w shed some ght on tmng comparsons wth numerca experments n Secton 8. 9

10 . Reated Wor There s an extensve terature on dstrbuted optmzaton. Many agorthms deveoped for machne earnng adopt the centrazed communcaton settng, due to the wde avaabty of supportng standards and patforms such as MPI, MapReduce and Spar as dscussed n the ntroducton. They ncude parae mpementatons of the batch frst-order and second-order methods e.g., Ln et a., 04; Chen et a., 04; Lee et a., 07, ADMM Boyd et a., 0, and dstrbuted dua coordnate ascent Yang, 03; Jagg et a., 04; Ma et a., 05. For mnmzng the average functon /m m = f w, n the centrazed settng and wth ony frst-order oraces.e., gradents of f s or ther conugates, t has been shown that dstrbuted mpementaton of acceerated gradent methods acheves the optma convergence rate and communcaton compexty Arevan and Shamr, 05; Scaman et a., 07. The probem we consder has the extra structure of composton wth a near transformaton by the oca data, whch aows us to expot smutaneous data and mode paraesm usng randomzed agorthms and obtan mproved communcaton and computaton compexty. Most wor on asynchronous dstrbuted agorthms expot mode paraesm n order to reduce the synchronzaton cost, especay n the settng wth parameter servers e.g., L et a., 04; Xng et a., 05; Ayten et a., 06. Besdes, deay caused by the asynchrony can be ncorporated to the step sze to gan practca mprovement on convergence e.g., Agarwa and Duch, 0; McMahan and Streeter, 04; Sra et a., 06, though the theoretca subnear rates reman. There are aso many recent wor on asynchronous parae stochastc gradent and coordnate-descent agorthms for convex optmzaton e.g., Recht et a., 0; Lu et a., 04; Sh et a., 05; Redd et a., 05; Rchtár and Taáč, 06; Peng et a., 06. When the woroads or computng power of dfferent machnes or processors are nonunform, they may sgnfcanty ncrease teraton effcency number of teratons done n unt tme, but often at the cost of requrng more teratons than ther synchronous counterparts due to deays and stae updates. So there s a subte baance between teraton effcency and teraton compexty e.g., Hannah and Yn, 07. Our dscussons n Secton. show that DSCOVR s capabe of mprovng both aspects. For sovng bnear sadde-pont probems wth a fnte-sum structure, Zhang and Xao 07 proposed a randomzed agorthm that wors wth dua coordnate update but fu prma update. Yu et a. 05 proposed a douby stochastc agorthm that wors wth both prma and dua coordnate updates based on equaton. Both of them acheved acceerated near convergence rates, but nether can be ready apped to dstrbuted computng. In addton, Baamurugan and Bach 06 proposed stochastc varance-reducton methods aso based on SVRG and SAGA for sovng more genera convex-concave sadde pont probems. For the speca case wth bnear coupng, they obtaned smar computaton compexty as DSCOVR. However, ther methods requre fu mode updates at each teraton even though worng wth ony one sub-boc of data, thus are not sutabe for dstrbuted computng. Wth addtona assumptons and structure, such as smarty between the oca cost functons at dfferent machnes or usng second-order nformaton, t s possbe to obtan better communcaton compexty for dstrbuted optmzaton; see, e.g., Shamr et a. 04; Zhang and Xao 05; Redd et a. 06. However, these agorthms rey on much more computaton at each machne for sovng a oca sub-probem at each teraton. Wth addtona memory and preprocessng at each machne, Lee et a. 05 showed that SVRG can be adapted for dstrbuted optmzaton to obtan ow communcaton compexty. 0

11 Agorthm DSCOVR-SVRG nput: nta ponts w 0, ᾱ 0, number of stages S and number of teratons per stage M. : for s = 0,,,..., S do : ū s = X w s and v s = m XT ᾱ s 3: w 0 = w s and α 0 = ᾱ s 4: for t = 0,,,..., M do 5: pc {,..., m} and {,..., n} randomy wth dstrbutons p and q respectvey. 6: compute varance-reduced stochastc gradents: u t = ū s X w t q v t = v s p m X T α t 7: update prma and dua boc coordnates: { proxσ f 8: end for α t = w t = 9: w s = w M and ᾱ s = α M. 0: end for output: w S and ᾱ S. α t α t w s, 9 σ u t ᾱ s. 0 f =,, f, { prox τ g w t τ v t f =,, f. w t 3. The DSCOVR-SVRG Agorthm From ths secton to Secton 6, we present severa reazatons of DSCOVR usng dfferent varance reducton technques and acceeraton schemes, and anayze ther convergence propertes. These agorthms are presented and anayzed as sequenta randomzed agorthms. We w dscuss how to mpement them for asynchronous dstrbuted computng n Secton 7. Agorthm s a DSCOVR agorthm that uses the technque of SVRG Johnson and Zhang, 03 for varance reducton. The teratons are dvded nto stages and each stage has a nner oop. Each stage s ntazed by a par of vectors w s R d and ᾱ s R N, whch come from ether ntazaton f s = 0 or the ast terate of the prevous stage f s > 0. At the begnnng of each stage, we compute the batch gradents ū s = ᾱ s ᾱ s T X w s = X w s, v s = w s m ᾱs T X w s = m XT ᾱ s. The vectors ū s and v s share the same parttons as α t and w t, respectvey. Insde each stage s, the varance-reduced stochastc gradents are computed n 9 and 0. It s easy to chec that they

12 are unbased. More specfcay, tang expectaton of u t wth respect to the random ndex gves E u t] = ū s = and tang expectaton of v t E v t] = v s = q X w t q ws = ū s X : w t X : w s = X : w t, wth respect to the random ndex gves p p m X T α t ᾱ s = v s m X : T α t ᾱ s = m X : T α t. In order to measure the dstance of any par of prma and dua varabes to the sadde pont, we defne a weghted squared Eucdean norm on R dn. Specfcay, for any par w, α where w R d and α = α,..., α m ] R N wth α R N, we defne Ωw, α = λ w m γ α. = If γ = γ for a =,..., m, then Ωw, α = λ w γ m α. We have the foowng theorem concernng the convergence rate of Agorthm. Theorem Suppose Assumpton hods, and et w, α be the unque sadde pont of Lw, α. Let Γ be a constant that satsfes { Γ max, p 9 X In Agorthm, f we choose the step szes as q λγ, q 9n X }. mp λγ σ = τ =, γ p Γ =,..., m, 3, λq Γ =,..., n, 4 and the number of teratons durng each stage satsfes M og3γ, then for any s > 0, E Ω w s w, ᾱ s α ] s Ω w 0 w, ᾱ 0 α. 5 3 The proof of Theorem s gven n Appendx A. Here we dscuss how to choose the parameter Γ to satsfy. For smpcty, we assume γ = γ for a =,..., m. If we et X m n = max, { X } and sampe wth the unform dstrbuton across both rows and coumns,.e., p = /m for =,..., m and q = /n for =,..., n, then we can set Γ = max{m, n} where κ rand = n X m n/ λγ as defned n 6. 9n X m n = max{m, n} 9 λγ κ rand,

13 An aternatve condton for Γ to satsfy s shown n Secton A. n the Appendx { Γ max 9 X : F, 9 X } : F. 6, p q mλγ q p mλγ Agan usng unform sampng, we can set Γ = max{m, n} 9 X max,f λγ = max{m, n} 9 κ rand, where X max,f = max, { X : F, X : F } and κ rand = X max,f / λγ as defned n 7. Usng the condton 6, f we choose the probabtes to be proportona to the squared Frobenus norms of the data parttons,.e., then we can choose Γ = mn, {p, q } p = X : F X F 9 X F mλγ, q = X : F X F, 7 = 9 mn, {p, q } κ rand, where κ rand = X F / mλγ. Moreover, we can set the step szes as see Appendx A. σ = mλ 9 X F, τ = mγ 9 X F. For the ERM probem 3, we assume that each oss functon φ, for =,..., N, s /νsmooth. Accordng to 4, the smooth parameter for each f s γ = γ = N/mν. Let R be the argest Eucdean norm among a rows of X or we can normaze each row to have the same norm R, then we have X F NR and κ rand = X F mλγ NR mλγ = R λν. 8 The upper bound R /λν s a condton number used for characterzng the teraton compexty of many randomzed agorthms for ERM e.g., Shaev-Shwartz and Zhang, 03; Le Roux et a., 0; Johnson and Zhang, 03; Defazo et a., 04; Zhang and Xao, 07. In ths case, usng the non-unform sampng n 7, we can set the step szes to be σ = λ m 9R N, τ = γ m 9R N = ν 9R. 9 Next we estmate the overa computaton compexty of DSCOVR-SVRG n order to acheve E Ω w s w, ᾱ s α ] ɛ. From 5, the number of stages requred s og Ω 0 /ɛ / og3/, where Ω 0 = Ω w 0 w, ᾱ 0 α. The number of nner teratons wthn each stage s M = og3γ. At the begnnng of of each stage, computng the batch gradents ū s and v s requres 3

14 gong through the whoe data set X, whose computatona cost s equvaent to m n nner teratons. Therefore, the overa compexty of Agorthm, measured by tota number of nner teratons, s mn Ω 0 O Γ og. ɛ To smpfy dscusson, we further assume m n, whch s aways the case for dstrbuted mpementaton see Fgure and Secton 7. In ths case, we can et Γ = n 9/κ rand. Thus the above teraton compexty becomes O n m κ rand og/ɛ. 30 Snce the teraton compexty n 30 counts the number of bocs X beng processed, the number of passes over the whoe dataset X can be obtaned by dvdng t by mn,.e., O κ rand m og/ɛ. 3 Ths s the computaton compexty of DSCOVR sted n Tabe. We can repace κ rand by κ rand or κ rand dependng on dfferent proof technques and sampng probabtes as dscussed above. We w address the communcaton compexty for DSCOVR-SVRG, ncudng ts decomposton nto synchronous and asynchronous ones, after descrbng ts mpementaton detas n Secton 7. In addton to convergence to the sadde pont, our next resut shows that the prma-dua optmaty gap aso enoys the same convergence rate, under sghty dfferent condtons. Theorem Suppose Assumpton hods, and et Pw and Dα be the prma and dua functons defned n 6 and 7, respectvey. Let Λ and Γ be two constants that satsfy Λ X F, =,..., m, =,..., n, and { Γ max, p In Agorthm, f we choose the step szes as 8Λ, q λγ q 8nΛ }. p mλγ σ = τ =, γ p Γ =,..., m, 3, λq Γ =,..., n, 33 and the number of teratons durng each stage satsfes M og3γ, then E ] P w s Dᾱ s s Γ P w 0 Dᾱ The proof of Theorem s gven n Appendx B. In terms of teraton compexty or tota number of passes to reach E P w s Dᾱ s ] ɛ, we need to add an extra factor of og κ rand to 30 or 3, due to the factor Γ on the rght-hand sde of 34. 4

15 Agorthm 3 DSCOVR-SAGA nput: nta ponts w 0, α 0, and number of teratons M. : ū 0 = Xw 0 and v 0 = m XT α 0 : U 0 = X w 0, V 0 = m α0 T X, for a =,..., m and =,..., K. 3: for t = 0,,,..., M do 4: pc {,..., m} and {,..., n} randomy wth dstrbutons p and q respectvey. 5: compute varance-reduced stochastc gradents: 6: update prma and dua boc coordnates: u t = ū t U t X q w t, 35 q v t = v t V t p T p m X T α t. 36 α t = w t = { proxσ Φ α t α t σ u t f =., f, { prox τ g w t τ v t f =,, f. w t 7: update averaged stochastc gradents: ū t = v t = { ūt ū t { v t U t X w t f =, f, V t T m X T α t f =, v t f, 8: update the tabe of hstorca stochastc gradents: 9: end for output: w M and α M. U t = V t = { X w t f = and =, U t otherwse. { m X T α t T f = and =, V t otherwse. 4. The DSCOVR-SAGA Agorthm Agorthm 3 s a DSCOVR agorthm that uses the technques of SAGA Defazo et a., 04 for varance reducton. Ths s a snge stage agorthm wth teratons ndexed by t. In order to compute the varance-reduced stochastc gradents u t and v t at each teraton, we aso need to mantan and update two vectors ū t R N and v t R d, and two matrces U t R N n and V t R m d. 5

16 The vector ū t shares the same partton as α t nto m bocs, and v t share the same parttons as w t nto n bocs. The matrx U t s parttoned nto m n bocs, wth each boc U t RN. The matrx V t s aso parttoned nto m n bocs, wth each boc V t R d. Accordng to the updates n Steps 7 and 8 of Agorthm 3, we have ū t = v t = = = U t, =,..., m, 37 V t T, =,..., n. 38 Based on the above constructons, we can show that u t s an unbased stochastc gradent of α t T Xw t wth respect to α, and v t s an unbased stochastc gradent of /m α t T Xw t wth respect to w. More specfcay, accordng to 35, we have E u t] = ū t = ū t = ū t = q q U t U t = ū t = X : w t = X w t q q X w t = X : w t = α α t T Xw t, 39 where the thrd equaty s due to 37. Smary, accordng to 36, we have E v t] = v t p V t T p = p = p m X T α t = v t V t X T α t m = v t = v t = m X : T α t = m X : T α t = w α t T Xw t, 40 m where the thrd equaty s due to 38. Regardng the convergence of DSCOVR-SAGA, we have the foowng theorem, whch s proved n Appendx C. Theorem 3 Suppose Assumpton hods, and et w, α be the unque sadde pont of Lw, α. Let Γ be a constant that satsfes { Γ max 9 X, 9n X },. 4, p q λγ q p mλγ p q If we choose the step szes as σ = τ =, γ p Γ =,..., m, 4, λq Γ =,..., n, 43 6

17 Agorthm 4 Acceerated DSCOVR nput: nta ponts w 0, α 0, and parameter δ > 0. : for r = 0,,,..., do : fnd an approxmate sadde pont of 46 usng one of the foowng two optons: 3: end for opton : run Agorthm wth S = ogδ og3/ and M = og3γ δ to obtan w r, α r = DSCOVR-SVRG w r, α r, S, M. opton : run Agorthm 3 wth M = 6 og 8δ 3 Γ δ to obtan w r, α r = DSCOVR-SAGA w r, α r, M. Then the teratons of Agorthm 3 satsfy, for t =,,..., E Ω w t w, α t α ] t 4 3Γ 3 Ω w 0 w, α 0 α. 44 The condton on Γ n 4 s very smar to the one n, except that here we have an addtona term /p q when tang the maxmum over and. Ths resuts n an extra mn term n estmatng Γ under unform sampng. Assumng m n true for dstrbuted mpementaton, we can et Γ = n 9 κ rand mn. Accordng to 44, n order to acheve E Ωw t w, α t α ] ɛ, DSCOVR-SAGA needs O Γ og/ɛ teratons. Usng the above expresson for Γ, the teraton compexty s O n m κ rand og/ɛ, 45 whch s the same as 30 for DSCOVR-SVRG. Ths aso eads to the same computatona compexty measured by the number of passes over the whoe dataset, whch s gven n 3. Agan we can repace κ rand by κ rand or κ rand as dscussed n Secton 3. We w dscuss the communcaton compexty of DSCOVR-SAGA n Secton 7, after descrbng ts mpementaton detas. 5. Acceerated DSCOVR Agorthms In ths secton, we deveop an acceerated DSCOVR agorthm by foowng the catayst framewor Ln et a., 05; Frostg et a., 05. More specfcay, we adopt the same procedure by Baamurugan and Bach 06 for sovng convex-concave sadde-pont probems. Agorthm 4 proceeds n rounds ndexed by r = 0,,,.... Gven the nta ponts w 0 R d and α 0 R N, each round r computes two new vectors w r and α r usng ether the DSCOVR- SVRG or DSCOVR-SAGA agorthm for sovng a reguated sadde-pont probem, smar to the cassca proxma pont agorthm Rocafear,

18 Let δ > 0 be a parameter whch we w determne ater. Consder the foowng perturbed sadde-pont functon for round r: L r δ w, a = Lw, α δλ w wr δ m = γ α α r. 46 Under Assumpton, the functon L r δ w, a s δλ-strongy convex n w and δγ /m-strongy concave n α. Let Γ δ be a constant that satsfes { 9 X 9n X } Γ δ max, p q λγ δ, q p mλγ δ,, p q where the rght-hand sde s obtaned from 4 by repacng λ and γ wth δλ and δγ respectvey. The constant Γ δ s used n Agorthm 4 to determne the number of nner teratons to run wth each round, as we as for settng the step szes. The foowng theorem s proved n Appendx D. Theorem 4 Suppose Assumpton hods, and et w, α be the sadde-pont of Lw, α. Wth ether optons n Agorthm 4, f we choose the step szes nsde Agorthm or Agorthm 3 as Then for a r, E σ = τ = Ω w r w, α r α ], δγ p Γ δ =,..., m, 47, δλq Γ δ =,..., n. 48 r Ω w 0 w, α 0 α. δ Accordng to Theorem 4, n order to have E Ω w r w, α r α ] ɛ, we need the number of rounds r to satsfy Ω w 0 w, α 0 α r δ og. ɛ Foowng the dscussons n Sectons 3 and 4, when usng unform sampng and assumng m n, we can have Γ δ = n 9κ rand δ mn. 49 Then the tota number of boc coordnate updates n Agorthm 4 s O δγ δ og δ og/ɛ, where the og δ factor comes from the number of stages S n opton and number of steps M n opton. We hde the og δ factor wth the Õ notaton and pug 49 nto the expresson above to obtan Õ n δ m κ rand og. δ ɛ Now we can choose δ dependng on the reatve sze of κ rand and m: 8

19 If κ rand > m, we can mnmzng the above expresson by choosng δ = κrand m, so that the overa teraton compexty becomes Õ n mκ rand og/ɛ. If κ rand m, then no acceeraton s necessary and we can choose δ = 0 to proceed wth a snge round. In ths case, the teraton compexty s Omn as seen from 49. Therefore, n ether case, the tota number of boc teratons by Agorthm 4 can be wrtten as Õ mn n mκ rand og/ɛ. 50 As dscussed before, the tota number of passes over the whoe dataset s obtaned by dvdng by mn: Õ κ rand /m og/ɛ. Ths s the computatona compexty of acceerated DSCOVR sted n Tabe. 5. Proxma Mappng for Acceerated DSCOVR When appyng Agorthm or 3 to approxmate the sadde-pont of 46, we need to repace the proxma mappngs of g and f by those of g δλ/ w r and f δγ / α r, respectvey. More precsey, we repace w t = prox τ g w t τ v t by { w t = arg mn g w δλ w w r } w w t w R d τ τ v t and repace α t α t = prox τ τ δλ g = prox σ f α t = arg mn α R N { f = prox σ σ δγ f τ δλ σ u t w t by τ v t α δγ α α r α σ α t σ δγ σ u t τ δλ τ δλ wr α t σ δγ α r σ δγ, 5 } σ u t. 5 We aso examne the number of nner teratons determned by Γ δ and how to set the step szes. If we choose δ = κrand m, then Γ δ n 49 becomes Γ δ = n 9κ rand 9κ rand δ mn = n mn = 5.5m n. κ rand /m Therefore a sma constant number of passes s suffcent wthn each round. Usng the unform sampng, the step szes can be estmated as foows: σ = τ = δγ p Γ δ κ rand /mγ 5.5n γ n κ rand /m, 53 δλq Γ δ κ rand /mλ5.5m λ. m κ rand 54 As shown by our numerca experments n Secton 8, the step szes can be set much arger n practce. 9

20 6. Conugate-Free DSCOVR Agorthms A maor dsadvantage of prma-dua agorthms for sovng probem s the requrement of computng the proxma mappng of the conugate functon f, whch may not admt cosed-formed souton or effcent computaton. Ths s especay the case for ogstc regresson, one of the most popuar oss functons used n cassfcaton. Lan and Zhou 05 deveoped conugate-free varants of prma-dua agorthms that avod computng the proxma mappng of the conugate functons. The man dea s to repace the Eucdean dstance n the dua proxma mappng wth a Bregman dvergence defned over the conugate functon tsef. Ths technque has been used by Wang and Xao 07 to sove structured ERM probems wth prma-dua frst order methods. Here we use ths approach to derve conugatefree DSCOVR agorthms. In partcuar, we repace the proxma mappng for the dua update α t = prox σ f α t σ u t { = arg mn f α R n α α, u t α α t } σ, by α t where B α, α t = f α f where β t { = arg mn f α R n α α, u t B α, α t σ αt can be computed recursvey by β t, α α t. The souton to 55 s gven by α t = f β t, = βt σ u t, t 0, σ }, 55 wth nta condton β 0 = f α0 see Lan and Zhou, 05, Lemma. Therefore, n order to update the dua varabes α, we do not need to compute the proxma mappng for the conugate functon f ; nstead, tang the gradent of f at some easy-to-compute ponts s suffcent. Ths conugate-free update can be apped n Agorthms, and 3. For the acceerated DSCOVR agorthms, we repace 5 by { α t = arg mn f α R n α α, u t B α, α t σ δγb α, α t }. The souton to the above mnmzaton probem can aso be wrtten as α t = f β t, where β t can be computed recursvey as β t = βt t σ u t σ δγ β σ σ δγ α r, t 0, wth the ntazaton β 0 = f α 0 and β = f. The convergence rates and computatona compextes of the conugate-free DSCOVR agorthms are very smar to the ones gven n Sectons 3 5. We omt detas here, but refer the readers to Lan and Zhou 05 and Wang and Xao 07 for reated resuts. 0

21 server server server h w t, S w t 00, S w t 00 0, S h scheduer resetss free = {,..., n} sends sync message to a servers and worers at begnnng of each stage w s X :, ū s X 00 0 :, ū s X 00 0 m:, ū s m ᾱ s ,αt ᾱ s,α t ᾱ m s,α m t w s, v s w s, v s w s, v s 3 3 worer worer worer m Fgure 3: A dstrbuted system for mpementng DSCOVR conssts of m worers, h parameter servers, and one scheduer. The arrows abeed wth the numbers, and 3 represent three coectve communcatons at the begnnng of each stage n DSCOVR-SVRG. 7. Asynchronous Dstrbuted Impementaton In ths secton, we show how to mpement the DSCOVR agorthms presented n Sectons 3 6 n a dstrbuted computng system. We assume that the system provde both synchronous coectve communcaton and asynchronous pont-to-pont communcaton, whch are a supported by the MPI standard MPI Forum, 0. Throughout ths secton, we assume m < n see Fgure. 7. Impementaton of DSCOVR-SVRG In order to mpement Agorthm, the dstrbuted system need to have the foowng components see Fgure 3: m worers. Each worer, for =,..., m, stores the foowng oca data and varabes : data matrx X : R N d. vectors n R N : ū s vectors n R d : w s, v s., α t, ᾱ s. extra buffers for computaton and communcaton: u t, v t, w t and w t.

22 h parameter servers. Each server stores a subset of the bocs { w t R d : S }, where S,..., S h form a partton of the set {,..., n}. one scheduer. It mantans a set of boc ndces S free {,..., n}. At any gven tme, S free contans ndces of parameter bocs that are not currenty updated by any worer. The reason for havng h > servers s not about nsuffcent storage for parameters, but rather to avod the communcaton overoad between ony one server and a m worers m can be n hundreds. At the begnnng of each stage s, the foowng three coectve communcatons tae pace across the system ustrated n Fgure 3 by arrows wth crced abes, and 3: The scheduer sends a sync message to a servers and worers, and resets S free = {,..., n}. Upon recevng the sync message, the servers aggregate ther bocs of parameters together to form w s and send t to a worers e.g., through the AReduce operaton n MPI. 3 Upon recevng w s, each worer compute ū s = X : w s and X : T ᾱ s, then nvoe a coectve communcaton AReduce to compute v s = /m m = X : T ᾱ s. The number of vectors n R d sent and receved durng the above process s m, countng the communcatons to form w s and v s at m worers gnorng the short sync messages. After the coectve communcatons at the begnnng of each stage, a worers start worng on the nner teratons of Agorthm n parae n an asynchronous, event-drven manner. Each worer nteracts wth the scheduer and the servers n a four-step oop shown n Fgure 4. There are aways m teratons tang pace concurrenty see aso Fgure, each may at a dfferent phase of the four-step oop: Whenever worer fnshes updatng a boc, t sends the par, to the scheduer to request for another boc to update. At the begnnng of each stage, s not needed. When the scheduer receves the par,, t randomy choose a boc from the st of free bocs S free whch are not currenty updated by any worer, oos up for the server whch stores the parameter boc w t.e., S, and then send the par, to server. In addton, the scheduer updates the st S free by addng and deetng. 3 When server receves the par,, t sends the vector w t to worer, and wats for recevng the updated verson w t from worer. 4 After worer receves w t, t computes the updates αt and w t foowng steps 6-7 n Agorthm, and then send w t bac to server. At ast, t assgns the vaue of to and send the par, to the scheduer, requestng the next boc to wor on. The amount of pont-to-pont communcaton requred durng the above process s d foat numbers, for sendng and recevng w t and w t we gnore the sma messages for sendng and recevng, and,. Snce the bocs are pced randomy, the average amount of communcaton per teraton s d/n, or equvaent to /n vectors n R d. Accordng to Theorem, each stage of Agorthm requres og3γ nner teratons; In addton, the dscussons above 30 show that we can tae Γ = n 9/κ rand. Therefore, the average amount of pont-to-pont communcaton wthn each stage s Oκ rand vectors n R d.

23 server S... S h scheduer, receve par, send w t and v t to worer wat to receve w t and v t randomy pc S free fnd server storng boc send par, to server S free S free { } \{} w t, vt 3 4 wt, v t worer... worer m, receve w t and v t computeα t,w t, v t as n Agorthm or 3 send w t,v t to server send to scheduer worer Fgure 4: Communcaton and computaton processes for one nner teraton of DSCOVR-SVRG Agorthm. The bue texts n the parentheses are the addtona vectors requred by DSCOVR-SAGA Agorthm 3. There are aways m teratons tang pace n parae asynchronousy, each evovng around one worer. A server may support mutpe or zero teratons f more than one or none of ts stored parameter bocs are beng updated. Now we are ready to quantfy the communcaton compexty of DSCOVR-SVRG to fnd an ɛ-optma souton. Our dscussons above show that each stage requres coectve communcaton of m vectors n R d and asynchronous pont-to-pont communcaton of equvaenty κ rand such vectors. Snce there are tota Oog/ɛ stages, the tota communcaton compexty s O m κ rand og/ɛ. Ths gves the communcaton compexty shown n Tabe, as we as ts decomposton n Tabe. 7. Impementaton of DSCOVR-SAGA We can mpement Agorthm 3 usng the same dstrbuted system shown n Fgure 3, but wth some modfcatons descrbed beow. Frst, the storage at dfferent components are dfferent: m worers. Each worer, for =,..., m, stores the foowng data and varabes: data matrx X : R N d vectors n R N : α t, u t, ū t, and U t for =,..., n. 3

24 vector n R d : V t : = V t V t ] T n whch s the th row of V t, wth V t buffers for communcaton and update of w t and v t R d. both stored at some server. h servers. Each server stores a subset of bocs { w t, vt R d : S }, for =,..., n. one scheduer. It mantans the set of ndces S free {,..., n}, same as n DSCOVR-SVRG. Une DSCOVR-SVRG, there s no stage-wse sync messages. A worers and servers wor n parae asynchronousy a the tme, foowng the four-step oops ustrated n Fgure 4 ncudng bue coored texts n the parentheses. Wthn each teraton, the man dfference from DSCOVR- SVRG s that, the server and worer need to exchange two vectors of ength d : w t and v t and ther updates. Ths doubes the amount of pont-to-pont communcaton, and the average amount of communcaton per teraton s 4/n vectors of ength d. Usng the teraton compexty n 45, the tota amount of communcaton requred measured by number of vectors of ength d s O m κ rand og/ɛ, whch s the same as for DSCOVR-SVRG. However, ts decomposton nto synchronous and asynchronous communcaton s dfferent, as shown n Tabe. If the nta vectors w 0 0 or α 0 0, then one round of coectve communcaton s requred to propagate the nta condtons to a servers and worers, whch refect the Om synchronous communcaton n Tabe. 7.3 Impementaton of Acceerated DSCOVR Impementaton of the acceerated DSCOVR agorthm s very smar to the non-acceerated ones. The man dfferences e n the two proxma mappngs presented n Secton 5.. In partcuar, the prma update n 5 needs the extra varabe w r, whch shoud be stored at a parameter server together wth w t. We modfy the four-step oops shown n Fgures 4 as foows: Each parameter server stores the extra boc parameters { w r, S }. Durng step 3, s send together wth w t for SVRG or w t, vt for SAGA to a worer. w r In step 4, no update of w r s sent bac to the server. Instead, whenever swtchng rounds, the scheduer w nform each server to update ther w r to the most recent w t. For the dua proxma mappng n 5, each worer needs to store an extra vector α r, and reset t to the most recent α t when movng to the next round. There s no need for addtona synchronzaton or coectve communcaton when swtchng rounds n Agorthm 4. The communcaton compexty measured by the number of vectors of ength d sent or receved can be obtaned by dvdng the teraton compexty n 50 by n,.e., O m mκ rand og/ɛ, as shown n Tabe. Fnay, n order to mpement the conugate-free DSCOVR agorthms descrbed n Secton 6, each worer smpy need to mantan and update an extra vector β t ocay. 8. Experments In ths secton, we present numerca experments on an ndustra dstrbuted computng system. Ths system has hundreds of computers connected by hgh speed Ethernet n a data center. The hardware 4

25 CPU #cores RAM networ operatng system dua Inte Xeon processors 6 8 GB 0 Gbps Wndows Server E5-650 v,.6 GHz.8 GHz Ethernet adapter verson 0 Tabe 3: Confguraton of each machne n the dstrbuted computng system. and software confguratons for each machne are sted n Tabe 3. We mpemented a DSCOVR agorthms presented n ths paper, ncudng the SVRG and SAGA versons, ther acceerated varants, as we as the conugate-free agorthms. A mpementatons are wrtten n C, usng MPI for both coectve and pont-to-pont communcatons see Fgures 3 and 4 respectvey. On each worer machne, we aso use OpenMP OpenMP Archtecture Revew Board, 0 to expot the mut-core archtecture for parae computng, ncudng sparse matrx-vector mutpcatons and vectorzed functon evauatons. Impementng the DSCOVR agorthms requres m h machnes, among them m are worers wth oca datasets, h are parameter servers, and one s a scheduer see Fgure 3. We focus on sovng the ERM probem 3, where the tota of N tranng exampes are eveny parttoned and stored at m worers. We partton the d-dmensona parameters nto n subsets of roughy the same sze dffer at most by one, where each subset conssts of randomy chosen coordnates wthout repacement. Then we store the n subsets of parameters on h servers, each gettng ether n/h or n/h subsets. As descrbed n Secton 7, we mae the confguratons to satsfy n > m > h. For DSCOVR-SVRG and DSCOVR-SAGA, the step szes n 9 are very conservatve. In the experments, we repace the coeffcent /9 by two tunng parameter η d and η p for the dua and prma step szes respectvey,.e., σ = η d λ R m N, τ = η p ν R. 56 For the acceerated DSCOVR agorthms, we use κ rand = R /λν as shown n 8 for ERM. Then the step szes n 53 and 54, wth γ = m/nν and a generc constant coeffcent η, become σ = η d mλ nr ν m N, τ = η p ν R mλ. 57 For comparson, we aso mpemented the foowng frst-order methods for sovng probem : PGD: parae mpementaton of the Proxma Gradent Descent method usng synchronous coectve communcaton over m machnes. We use the adaptve ne search procedure proposed n Nesterov 03, and the exact form used s Agorthm n Ln and Xao 05. APG: parae mpementaton of the Acceerated Proxma Gradent method Nesterov, 004, 03. We use a smar adaptve ne search scheme to the one for PGD, and the exact form used wth strong convexty s Agorthm 4 n Ln and Xao 05. ADMM: the Aternatng Drecton Method of Mutpers. We use the reguarzed consensus verson n Boyd et a. 0, Secton 7... For sovng the oca optmzaton probems at each node, we use the SDCA method Shaev-Shwartz and Zhang, 03. CoCoA: the addng verson of CoCoA n Ma et a. 05. Foowng the suggeston n Ma et a. 07, we use a randomzed coordnate descent agorthm Nesterov, 0; Rchtár and Taáč, 04 for sovng the oca optmzaton probems. 5

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization Journa of Machne Learnng Research 18 17 1-5 Submtted 9/16; Revsed 1/17; Pubshed 1/17 A Genera Dstrbuted Dua Coordnate Optmzaton Framework for Reguarzed Loss Mnmzaton Shun Zheng Insttute for Interdscpnary