Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems

Commucato-Effcet Dstrbuted Prmal-Dual Algorthm for Saddle Pot Problems Yaodog Yu Nayag Techologcal Uversty ydyu@tu.edu.sg Sul Lu Nayag Techologcal Uversty lusl@tu.edu.sg So Jal Pa Nayag Techologcal Uversty sopa@tu.edu.sg Abstract Prmal-dual algorthms, whch are proposed to solve reformulated covex-cocave saddle pot problems, have bee prove to be effectve for solvg a geerc class of covex optmzato problems, especally whe the problems are ll-codtoed. However, the saddle pot problem stll lacs a dstrbuted optmzato framewor where prmal-dual algorthms ca be employed. I ths paper, we propose a ovel commucato-effcet dstrbuted optmzato framewor to solve the covex-cocave saddle pot problem based o prmal-dual methods. We carefully desg local subproblems ad a cetral problem such that our proposed dstrbuted optmzato framewor s commucato-effcet. We provde a covergece aalyss of our proposed algorthm, ad exted t to address o-smooth ad o-strogly covex loss fuctos. We coduct extesve expermets o several real-world datasets to demostrate compettve performace of the proposed method, especally o ll-codtoed problems. INTRODUCTION I the era of bg data, developg dstrbuted mache learg algorthms has become creasgly mportat yet challegg. I ths wor, we focus o developg a ew dstrbuted optmzato algorthm for regularzed emprcal rs mmzato ERM, whch s a geerc class of covex optmzato problems that arses ofte from mache learg. Specfcally, our goal s to m- Idcates equal cotrbutos. mze the emprcal loss defed over data samples: m x R d Px = φ a x gx, = where a,..., a R d are feature vectors of data pots, φ : R R s a covex loss fucto wth the lear predctor a x, for =,...,, ad g : Rd R s a covex regularzato fucto for the predctor x R d. Suppose that a dstrbuted system cossts of K maches, ad each mache has access oly a subset P of the data [] := {,..., }, where {P } K = s a gve partto of the dataset [], ad we deote by = P.. COMMUNICATION-EFFICIENT DISTRIBUTED OPTIMIZATION Oe of the most mportat ssues dstrbuted optmzato s the commucato effcecy because commucato betwee maches s much more expesve tha readg data from the memory o local maches. Therefore, more ad more efforts have bee made proposg commucato-effcet methods dstrbuted optmzato Jagg et al., 04, Ma et al., 05, Redd et al., 06, Shamr et al., 04, Smth et al., 05, Yag, 03. The basc dea behd these methods s to carefully desg local computato ad commucato a dstrbuted system. To acheve ths goal, exstg methods usually decompose the optmzato problem to K local subproblems, deoted by L, wth respect to local data of each mache. After each local mache performs a arbtrary optmzato method o L ad solves t approxmately, updated formato from local maches s set to a cetral ode to carry out a cetral update. Ths allows oe to cotrol the trade-off betwee commucato ad local computato, whch s more flexble dstrbuted settg. I geeral, exstg commucato-effcet methods ca be classfed to two categores. A frst category

s referred to as gradet-type, whch ams to decompose the problem to K local subproblems ad solve each subproblem usg gradet descet methods, such as stochastc varace reduced gradet SVRG Johso ad Zhag, 03, o each local mache depedetly, ad the update the optmzato varables by dog a average of local varables of each mache Redd et al., 06, Shamr et al., 04. The secod oe s referred to as coordate-type, whch usually focuses o solvg the dual problem of, max α R Dα = φ α g = α a, = where φ s the cojugate fucto of φ, α s the dual varable vector wth the -th elemet beg α. Ths category of methods decomposes the dual problem to K local subproblems wth regard to each mache, ad employs a coordate-type method, such as stochastc dual coordate ascet method SDCA Shalev-Shwartz ad Zhag, 03, to solve each local subproblem. After each local subproblem s approxmately solved, the cetral ode taes a average/add step accordg to the local update from each mache Jagg et al., 04, Ma et al., 05, Yag, 03. Our proposed method falls to ths category. There are two lmtatos of the coordate-type approach. Oe lmtato s that most exstg coordatetype methods fal to match the commucato complexty lower bouds proved Arjeva ad Shamr, 05. For example, whe φ s /γ-smooth, g s λ-strogly covex, ad R = max a problem, the the codto umber s defed as κ = R /λγ, 3 ad the lower boud of commucato complexty obtaed Arjeva ad Shamr, 05 s Õ κ log/ɛ order to acheve a ɛ-suboptmal soluto Nesterov, 03. Sce these coordatetype methods Jagg et al., 04, Ma et al., 05, Yag, 03 could be see as a dstrbuted verso of SDCA, ad the terato complexty of SDCA s Õ κ log/ɛ, the commucato complexty of these methods Jagg et al., 04, Ma et al., 05, Yag, 03 s Õ κ log/ɛ, whch does ot match the optmal commucato complexty. The problem could get worse whe problem s ll-codtoed,.e., κ. The other lmtato s that t s ot straght-forward to exted exstg coordate-type approaches Jagg et al., 04, Ma et al., 05, Yag, 03 to deal wth dfferet regularzato forms of the regularzato term g other tha l regularzato.. OUR APPROACH To overcome the lmtatos metoed above, we propose a ovel commucato-effcet dstrbuted framewor wth a prmal-dual algorthm for saddle pot problems Chambolle ad Poc, 0, Zhag ad Xao, 05b. The reasos that we focus o developg dstrbuted framewor for saddle pot problems are three folds:. It has bee show that saddle pot algorthms are able to obta comparable ad eve more stable performace tha other state-of-the-art techques for covex optmzato Chambolle ad Poc, 0, Yu et al., 05, Zhag ad Xao, 05b.. Prmal-dual coordate algorthms for saddle pot problems are able to reach the optmal terato complexty as proved by Zhag ad Xao 05b. 3. Sce prmal-dual algorthms eep both prmal ad dual varables optmzato, they are able to deal wth dfferet ds of regularzato terms g, such as l -regularzato, aturally. Therefore, we am to develop a dstrbuted prmal-dual algorthm whch ca hert the above three propertes a dstrbuted evromet. To be specfc, we frst reformulate problem as a covex-cocave saddle pot problem through covex cojugato, whch eeps both prmal ad dual varables optmzato. We the decompose the saddle pot problem to carefully desged local subproblems, whch ca be solved depedetly by each mache wth respect to ts local data. After the local subproblems are solved approxmately, the updated local dual varables of each mache are set to a cetral ode. Based o all the updates of local dual varables, we costruct a cetral problem ad get the cetral update of the prmal varables o the cetral ode. Fally, the cetral ode seds the updated prmal varables ad aggregated dual varables to each local mache for ext local optmzato terato. Detals wll be descrbed Secto 3. As wll be dscussed Secto 4 ad Secto 5, by carefully desgg the local subproblems ad the cetral problem, our algorthm s able to reach the commucato complexty lower bouds, ad deal wth dfferet ds of regularzato terms. As the parameter λ of the regularzato fucto g s usually o the order of / or / for may mache learg problems, we are especally terested solvg problem uder the dstrbuted settg whe t s ll-codtoed. Wth a extremely large-scale dataset, the codto umber 3 could be relatvely large κ. We wll

show through expermets Secto 6 that our proposed algorthm ca obta better ad more stable performace o ll-codtoed problems compared to other commucato-effcet dstrbuted optmzato methods. Moreover, we provde a soluto to exted our algorthm to deal wth o-smooth ad o-strogly covex optmzato problems Secto 5..3 OTHER RELATED WORK Besdes the commucato-effcet dstrbuted optmzato methods revewed above, there exst other parallel ad dstrbuted optmzato techques. For example, a well-ow approach for solvg problem s to perform a gradet descet method mplemeted a dstrbuted system. Each local mache computes ts local gradet ad seds t to the cetral ode. The cetral ode aggregates the local gradet to tae a gradet step ad updates x, ad the broadcasts t bac to each local mache for the ext terato updates. If the accelerated gradet descet method Nesterov, 03 s used, oe ca obta the terato complexty Õ κ log/ɛ. Aother popular techque s dstrbuted alteratg drecto method of multplers ADMM Boyd et al., 0, Sh et al., 04, whose complexty s Õ κ log/ɛ uder certa codtos. Recetly, Zhag ad Xao 05a proposed a dstrbuted algorthm based o the exact damped Newto method, whch matches the commucato complexty lower boud Õ κ log/ɛ. For more related dstrbuted ad parallel optmzato methods, we refer the readers to Arjeva ad Shamr, 05, Chag et al., 04, Duch et al., 03, Zhag et al., 0 for a more comprehesve lterature of the related wor. PRELIMINARIES I ths secto, we troduce frst-order prmal-dual algorthms for covex-cocave saddle pot problems a sgle mache. To beg wth, we reformulate the prmal problem as a covex-cocave saddle pot problem through covex cojugato, ad the saddle pot reformulato s wdely appled mache learg Da et al., 06, Zhag ad Xao, 05b. Based o the defto of covex cojugate, we replace each compoet fucto φ a x by φ a x = sup {y a, x φ y }, y R where φ y = sup α R {αy φ α} s the covex cojugate of φ, the we arrve at a covex-cocave saddle pot problem m max x R d y R fx, y, 4 where fx, y def = y a, x φ y gx, = where y R s referred to as a vector of dual varables wth ts -th elemet deoted by y, ad each y s assocated wth a data pot a. Uder the assumpto that φ s /γ-smooth, g s λ-strogly covex, ad each φ s γ-strogly covex, the saddle pot problem 4 has a uque soluto, whch s deoted by x, y. Based o the reformulated problem 4, we could rewrte the optmzato problem as m max fx, y = x R d y R Ay, x Φ y gx, where A = [a,..., a ] ad Φ y = = φ y. Ths s the geerc saddle pot problem Chambolle ad Poc, 0. To solve ths optmzato problem, the basc dea s to alteratgly maxmze f wth repect to y, ad mmze f wth respect to x, whch s y t = arg max y R Ay, xt Φ y y yt, σ x t = arg m x R Ayt, x gx x xt, d x t = x t θ x t x t, where the parameters τ ad σ cotrol the quadratc regularzato terms wth respect to x ad y, respectvely, whch s smlar to the use of step sze prmal methods. Ad x t s the extrapolato from x t ad x t wth parameter θ [0, ]. Here, θ x t x t s smlar to the mometum term Nesterov accelerato Nesterov, 03. There have bee varous prmal-dual coordate algorthms desged based o the above scheme, such as Chambolle ad Poc, 0, Yu et al., 05, Zhag ad Xao, 05b. Amog them, Yu et al. 05, Zhag ad Xao 05b developed stochastc versos of the prmal-dual algorthm for the saddle pot problem. Suppose φ s /γ-smooth ad g s λ-strogly covex, most exstg algorthms Chambolle ad Poc, 0, Yu et al., 05, Zhag ad Xao, 05b ca acheve lear covergece rate to obta a ɛ-accurate soluto. The expected terato complexty of these methods s Õ κ log/ɛ, whch s desred o llcodtoed problems. 3 DISTRIBUTED SADDLE-POINT FRAMEWORK I ths secto, we preset our proposed commucatoeffcet Dstrbuted Saddle Pot Algorthm DSPA

detal. Assume that the dataset {a } = s dstrbuted over K local maches, each mache has access to a subset P of the data wth sze of. Here we defe K local vectors of y by usg the otato y [] R for =,..., K: { y, f P, y [] = 0, otherwse, ad K local data matrx of A by usg the otato A [] R d for =,..., K: { a, f P, A [] = 0, otherwse, where A [] deotes the -th colum of A [], ad 0 s a d-dmesoal vector of all zeros. Accordg to the data partto, we decompose the problem 4 to K local subproblems, ad defe the assocated cetral problem. The core dea s to solve the defed local subproblems o each local mache depedetly, ad the cetralze the updated formato of each mache to solve a easy cetral problem. 3. LOCAL SUBPROBLEMS We defe a local subproblem of the orgal saddle pot problem 4 for mache, whch oly requres accessg data that s avalable locally,.e., P. Specfcally, each local subproblem of mache at t-terato s defed as m max x, y [], 5 where L t x, y [] := ad x R d y [] R Lt yj a j, x φ j y j gx j P y t / P r t x, y [] = x xt a, x r t x, y [], [] K y [] y t σ Each local subproblem s related to the soluto obtaed the prevous terato x t, y t, ad we wll defe the parameters τ, σ later. The frst two terms ca be regarded as a local verso of the orgal saddle pot problem 4. The thrd term / P y t a, x = l A [l]y t [l], x 5 ca be terpreted as the teracto wth other local subproblems. Ad the quadratc regularzato term r t x, y [] s used to eforce x ad y [] ot to move too far away from soluto obtaed at the prevous terato.. To solve the local subproblem L t x, y [], we could apply prmal-dual coordate type of methods Yu et al., 05, Zhag ad Xao, 05b. The put eeded for solvg the local subproblem would be x t ad Ay t, as the term l A [l]y t [l] ca be expressed as Ay t A [] y t [] ad mache has access to A [], y t []. By deotg our local optmzato algorthm used as Local-DSPA, we could deote the procedure of solvg the local subproblem as x t, yt [] Local-DSPAx t, Ay t. Note that our framewor, each local subproblem s oly requred to be solved approxmately. x t, yt [] 3. CENTRAL PROBLEM Let deote the approxmate soluto obtaed by solvg L t x, y [] o mache. Let y t =, we defe a cetral problem at t-terato as K = yt [] where C t x := x t = arg m C t x, 6 x R d = y t a, x gx x xt. The cetral update volves two steps. Frstly, updated local A [] y t [] R d are set to the cetral ode. After that, we solve 6 o the cetral ode, ad the sed the updated x t R d ad Ay t R d of the t-terato bac to each local mache. The cetral update could be vewed as mmzg the orgal saddle pot problem 4 over x wth updated y t. Also, we should observe that the parameter τ defed 5 ad 6 s the same parameter, whch s oe of the ey pots of our framewor. 3.3 OVERALL ALGORITHM DESCRIPTION The overall algorthm of DSPA s preseted Algorthm. Frstly, we dstrbute the tal dual varable y 0 to K local maches, ad aggregate A [] y 0 [] from all maches to compute Ay 0 K = A []y 0 []. After that, the tal prmal varables x 0 ad the trasformed dual varables Ay 0 are set to each mache Steps - 4. Secodly, each mache performs optmzato o ts ow local subproblem L t x, y [], usg a prmaldual algorthm, such as SPDC Zhag ad Xao, 05b, ad seds ts local update A [] y t [] R d to the cetral

Algorthm DSPAf, x 0, y 0, τ, σ, T : Iput: Data pots {a } = dstrbuted across K maches {P } K =, parameters τ, σ R, tal prmal varables x 0 R d ad tal dual varables y 0 R. : Dstrbute y 0 to K maches as y 0 [] for each. 3: Each mache compute A [] y 0 [], ad sed t bac to the cetral ode. 4: Perform Ay 0 K = A []y 0 [] o the cetral ode, ad sed x 0 ad Ay 0 to each mache 5: for t =,,..., T do 6: for [K] parallel over all maches do 7: Local-DSPAx t, Ay t x t, yt [] 8: Sed A [] y t [] to the cetral ode 9: ed for 0: Ay t K = A []y t [] o the cetral ode C t x o the cetral ode : x t arg m x R d : Sed x t ad Ay t bac to each mache 3: ed for 4: Output: x T, y T ode. The goal of ths teral procedure s to approxmately solve each local subproblem. We refer ths procedure as the er loop, whch s stated Steps 6-9 of Algorthm. Thrdly, wth the local updates A [] y t [] s, the cetral ode aggregates them to compute Ay t, ad solves the defed cetral problem C t x to get the update of x t Steps 0- of Algorthm. Fally, the cetral ode seds x t ad Ay t bac to each mache order to start the ext roud of local optmzato Step, we refer Steps 5-3 as the outer loop of our algorthm. Note that, o oe had, the essetal formato that the cetral { problem} requres ca be represeted by the vectors A [] y t [] s of each local mache, whch are of dmeso of d. O the other had, each local mache oly requres x t, Ay t, whch s of d dmesos, from the cetral ode. 4 CONVERGENCE ANALYSIS Before we preset our ma covergece results, we mae some assumptos o the objectve fuctos ad the local suboptmalty o each local subproblem. Assumpto. Each φ s covex ad dfferetable, ad φ s /γ-smooth,.e., φ a φ b /γ a b, a, b R. Assumpto. g s λ-strogly covex.e. for x, y R d ad g y gy, gx gy g yx y λ x y. Assumpto 3. There exsts costats Ω x, Ω y > 0 such that max x t x Ω x, t max y t y Ω y, t where x t, y t s t-terato update of Algorthm, ad x, y s the optmal soluto of 4. Assumpto 4. Each local mache produces a approxmate soluto that satsfes ad [ y t E E[fˆx t, ŷt [] ŷt [] x t [] fˆxt Θ where ˆΘ, Θ <, ad, yt [] ] ˆΘ, yt [] ] fˆx t y t [], ŷt ˆx t, ŷt [] x, y[]. ŷ t, [], yt [], [] fˆxt s the optmal soluto of local subproblem L t Assumpto 5. There exst costats c, c > 0, ad the approxmate soluto y t = K = yt [] satsfes E[ ] Θfx t, y fx, y t, ad s defed as = K fˆx t =, ŷt [] fˆxt, yt [] M yt ŷ t, where M = c Ω x c Ω y, ad Θ 0, s a pre-defed costat. Note that Assumptos 4 ad 5 mply that the local subproblems eed to be solved approxmately to some extet. As the local subproblems are saddle pot problems, we ca apply dfferet ds of algorthms for saddle pot problems to solve them, such as Chambolle ad Poc, 0, Yu et al., 05, Zhag ad Xao, 05b, whch clude both stochastc ad o-stochastc optmzato methods. To aalyse the covergece behavor of our dstrbuted algorthm, we eed to characterze the coecto betwee local subproblems ad cetral update o the cetral ode. Based o the relatoshp betwee the cetral update x t ad the local optmal soluto ˆx t, ŷt [] we ca get the covergece guaratee of Algorthm. All proofs ca be foud Appedx.,

Lemma. Suppose that Assumptos -5 hold. Let x, y be the uque saddle pot of fx, y defed 4, ad defe t =fx t, y fx, y t τλ x t x K Kσγ y t y. 4σK If the parameters τ ad σ are chose such that τσ = 4R ad σ > Kγ, the for t, the proposed DSPA Algorthm acheves E[ t ] ΘE[ t ], 7 where Θ = max{ σγ K K, τλ }. Theorem. Suppose that Assumptos -5 hold, ad the parameters τ = γ/4r, σ = /γ ad K κ. I order for Algorthm to obta E[ x T x ] ɛ, t suffces to have the umber of commucato teratos T satsfy T 4R C log, λγ ɛ where C = 0 λ 4R λγ. Based o Lemma ad Theorem, we ca get the commucato complexty of Algorthm. By tag τ = γ 4R ad σ = γ, the commucato complexty of Algorthm s Õ κ log/ɛ. Ths does ot match the optmal commucato complexty. I the ext secto, we derve a accelerated verso of DSPA that attas the optmal commucato complexty. 5 EXTENSION OF DISPA I ths secto, we derve two extesos of DSPA. The frst exteso apples a geerc accelerato scheme proposed by L et al. 05 to obta a accelerated verso of DSPA. The secod oe exteds our algorthm to hadle o-smooth ad o-strogly covex loss fuctos wth covergece guaratee. 5. ACCELERATION The commucato complexty of our algorthm derved Theorem s Õ κ log/ɛ order to acheve a ɛ-suboptmal soluto, whch does ot match the lower boud proved Arjeva ad Shamr, 05. Based o the Catalyst accelerato scheme proposed by L et al. 05, we develop the accelerated DSPA to acheve the optmal commucato complexty. Catalyst s a geerc scheme for acceleratg frst-order optmzato methods, smlar to classcal gradet descet schemes of Nesterov accelerato. It s a atural choce for dstrbuted optmzato algorthms that solve local subproblem approxmately each commucato terato. Based o Catalyst accelerato, we modfy the orgal objectve fucto by addg a quadratc term f t x, y = fx, y ϑ x zt, where ϑ > 0 s a parameter defed Catalyst, z t s obtaed by a extrapolato step smlar to the accelerato scheme Nesterov, 03. Wth a carefully selected parameter ϑ, we ca develop a accelerated verso of DSPA descrbed Algorthm, whch acheves the optmal commucato complexty. We refer to the accelerated dstrbuted saddle pot algorthm as A-DSPA. Theorem. Suppose Assumpto -5 hold ad assume ϑ = 4R γ λ > 0, f the parameters Catalyst are chose as T s = Õ, q = λ/λ ϑ, α 0 = q. The total commucato teratos of Algorthm for achevg Px T Px < ɛ s Õ κlog ɛ. The Catalyst accelerato scheme s appled wdely to accelerate dfferet ds of optmzato methods Redd et al., 06. As the accelerato quadratc term s related to the prmal varable x ad the saddle pot problem eeps the prmal varable, t s coveet to apply ths accelerato scheme to DSPA ad acheve the optmal commucato complexty. 5. NON-SMOOTH OR/AND NON-STRONGLY CONVEX FUNCTIONS The commucato complexty bouds Theorem ad Theorem are developed uder the Assumptos ad, whch meas that the dervatve of φ eeds to be /γ-lpschtz cotuous ad g eeds to be λ-strogly covex. For geeral loss fuctos mache learg, both assumptos may fal, for example, whe φ s hge loss ad g s l -regularzato term. Therefore, we am to develop a exteso of Algorthm to deal wth o-smooth ad o-strogly covex loss fuctos. To be cocse, we oly cosder the case whe φ s o-smooth ad g s o-strogly covex. The ey dea here s to apply a perturbato term to aalyze the o-smooth ad o-strogly covex settg, whch s commoly used may optmzato methods Redd et al., 06, Zhag ad Xao, 05b. Here we assume that ether φ s smooth or g s strogly covex.

Algorthm A-DSPA : Iput: Data pots {a } = dstrbuted across K maches {P } K =, parameters τ, σ, ϑ, α 0 R, tal prmal varable x 0 geerated o the cetral ode, tal local dual varables y 0 geerated o local maches, optmzato algorthm DSPA, umber of er teratos T s of DSPA. : Italzato: q = λ/λ ϑ, z 0 = x 0, t = 3: whle stoppg crtero s ot satsfed do 4: x t, y t = DSPAf t, x t, y t, τ, σ, T s 5: Compute α t 0, from equato α t = α t α t qα t 6: Compute z t = x t β t x t x t, where β t = αt αt α αt 7: t = t 8: ed whle 9: Output x t I partcular, we cosder the followg modfed saddle pot fucto f ɛ x, y def = = y a, x φ y ɛ y gx ɛ x, 8 where ɛ > 0 s a pre-defed scalar. f ɛ x, y ca be regarded as a approxmato of the orgal saddle pot problem fx, y. We the obta that both φ y ɛ y ad gx ɛ x are ɛ-strogly covex. I the followg corollary, we show that the perturbed fucto f ɛ x, y s a good approxmato of the org fucto fx, y. Corollary. Assume that φ s covex ad L φ -Lpschtz cotuous, ad g s covex. Let x, y be the uque saddle pot of fx, y, ad x ɛ, y ɛ be the uque saddle pot of f ɛ x, y. The we have Px ɛ Px ɛ x L φ. As a result, we could apply Algorthm to the perturbed fucto f ɛ x, y. By applyg Theorem ad the relatoshp betwee f ɛ x, y ad fx, y, we obta that the commucato complexty of Algorthm for o-smooth ad o-strogly covex fuctos s Õ /ɛ log ɛ, ad the proof s smlar to the oe Zhag ad Xao, 05b. 6 NUMERICAL EXPERIMENTS I ths secto, we coduct umercal expermets of DSPA ad A-DSPA o several real-world dstrbuted datasets. We compare the performace of our algorthms wth the state-of-the-art dstrbuted optmzato algorthm CoCoA Ma et al., 05. For our expermets, we solve the stadard bary classfcato tas wth datasets obtaed from LIBSVM datasets Chag ad L, 0. The statstcs of the datasets are summarzed Table. Our goal s to mmze the regularzed emprcal rs wth smoothed hge loss: m Px = φ a x R d x λ x. = For each tas, the data pot taes the form of a, b, where a s a feature vector, b {, } s the correspodg class label ad assocated wth loss fucto φ. The smoothed hge loss fucto Shalev-Shwartz ad Zhag, 03 φ s defed as 0 f b z, φ z = b z γ f b z γ, γ b z otherwse, where φ s /γ-smooth. We set γ = our expermets. The cojugate fucto of φ s φ β = b β γ β for b β [, 0] ad otherwse. Table : Three Datasets for Numercal Expermets DATASET # SAMPLES # FEATURES d RCV 677,399 47,36 Realsm 7,309 0,958 Covtype 58,0 54 6. IMPLEMENTATION DETAILS We mplemet DSPA, A-DSPA ad CoCoA Petuum Xg et al., 05. For DSPA ad A-DSPA, we use SPDC Zhag ad Xao, 05b as the local solver, ad use SDCA Shalev-Shwartz ad Zhag, 03 as the local solver for CoCoA. Whe dog comparso amog dfferet algorthms, we ru the same umber of local teratos o local maches.e., teratos of SDCA or SPDC before commucato wth the cetral ode. We compare dfferet algorthms based o commucato teratos. From our results, we fd that DSPA coverges faster whe parameters τ ad σ are large, whch s smlar to the fdgs reported Johso ad Zhag, 03, Zhag ad Xao, 05a. We adopt

Px - Px * 0 0 CoCoA DSPA A-DSPA 0-0 -4 0-6 Px - Px * 0 - CoCoA DSPA A-DSPA 0-0 -3 Px - Px * 0 - CoCoA DSPA A-DSPA 0-0 -8 0 00 00 300 400 500 Number of Commucatos a RCV, λ = 0 6 0-4 0 00 00 300 400 500 Number of Commucatos b RCV, λ = 0 7 0-3 0 500 000 Number of Commucatos c RCV, λ = 0 8 Px - Px * 0 0 CoCoA DSPA A-DSPA 0-0 -4 Px - Px * 0 - CoCoA DSPA A-DSPA 0-0 -3 0-4 Px - Px * 0 - CoCoA DSPA A-DSPA 0-0 -3 0-4 0-6 0 00 00 300 400 500 Number of Commucatos d Realsm, λ = 0 6 0-5 0 00 00 300 400 500 Number of Commucatos e Realsm, λ = 0 7 0-5 0 500 000 Number of Commucatos f Realsm, λ = 0 8 Px - Px * 0 0 CoCoA DSPA A-DSPA 0-0 -4 0-6 Px - Px * CoCoA 0 - DSPA A-DSPA 0-0 -3 Px - Px * CoCoA 0 - DSPA A-DSPA 0-0 -3 0-8 0 00 00 300 400 500 Number of Commucatos g Covtype, λ = 0 6 0-4 0 00 00 300 400 500 Number of Commucatos h Covtype, λ = 0 7 0-4 0 00 00 300 400 500 Number of Commucatos Covtype, λ = 0 8 Fgure : Comparso of DSPA, A-DSPA ad CoCoA o three datasets: RCV, Realsm, Covtype. The horzotal axs s the umber of commucatos, ad the vertcal axs s the optmalty gap P x T P x. Each plot cotas a comparso of CoCoA blue, crcle, DSPA red, asters ad A-DSPA yellow, damod. All the plots are show o log-y scale. The umber of local teratos performed each commucato terato s 5,000 for Realsm, ad 0,000 teratos for RCV ad Covtype. tued parameters τ ad σ from a predefed rage. For CoCoA, we preset the results of the selected optmal parameter σ. 6. COMPARISON BETWEEN DISPA AND COCOA We compare DSPA ad CoCoA o three datasets: RCV, Realsm, Covtype, across dfferet values of the regularzato parameter λ. To be specfc, the regularzato parameter λ s set to be λ { 0 6, 0 7, 0 8} o the three datasets. Both DSPA ad CoCoA are mplemeted o 6 maches K = 6. For both DSPA ad CoCoA, the umbers of local teratos performed each commucato terato are the same. I Fgure, the results show that the performace of DSPA s comparable wth CoCoA. We ca see that whe λ = 0 6, DSPA has comparable covergece performace wth regard to CoCoA. O all the three datasets, whe λ { 0 7, 0 8}, DSPA coverges to the optmal soluto faster ad more stable tha Co- CoA. We also otce that the performace of Co-

Px - Px * 0 0 K=4 K=8 K=6 K=3 K=64 0-0 - 0-3 Px - Px * 0 - K=4 K=8 K=6 K=3 K=64 0-0 -4 0 00 00 300 400 500 Number of Commucatos a Realsm, λ = 0 7 0-3 0 00 00 300 400 500 Number of Commucatos b RCV, λ = 0 7 Fgure : The performace of DSPA wth creasg umber of maches, K {4, 8, 6, 3, 64}. The local teratos performed each commucato terato s,000 for Realsm, ad 5,000 teratos for RCV. CoA o Covtype s ot stable compared to that o the other two datasets, especally whe λ s very small,.e., λ = 0 7 or λ = 0 8, whch s cosstet wth the expermetal results of CoCoA o Covtype Redd et al. 06. I Secto 6.3, we wll show the superor performace of A-DSPA o ll-codtoed problems. 6.3 COMPARISON OF DISPA AND A-DISPA I Fgure, we also compare A-DSPA ad DSPA to show the accelerato obtaed by applyg Catalyst accelerato. We compare DSPA ad A-DSPA o three datasets RCV, Realsm, Covtype across dfferet regularzato parameter uder the same settgs as metoed Secto 6.. The results show that whe problem s ll-codtoed,.e. κ = R / λγ, A- DSPA coverges substatally faster tha DSPA. Ths cofrms our theoretcal aalyss, as the commucato complexty of A-DSPA s Õ κlog ɛ ad DSPA s Õ κlog ɛ. 6.4 SCALABILITY Sce scalablty s a mportat metrc of dstrbuted optmzato algorthms, Fgure, we study the scalablty of DSPA by observg the umercal performace wth creasg umber of maches. We coduct expermets o two datasets RCV, Realsm, ad the mache umber s set to be K {4, 8, 6, 3, 64}. For comparso, we eep the teratos ad parameters τ, σ the same for each dataset. The expermets show that DSPA ca scale effectvely wth umber of maches, whch cofrms our theory Secto 4. We observe that performace slghtly drops o RCV wth 64 maches, ths s possbly because that local subproblems are solved wth hgher accuracy whch may affect the effectveess of aggregato the cetral update. 7 CONCLUSION I ths wor, we preset a ovel dstrbuted optmzato framewor for solvg the saddle pot problem, whch ca be appled to solvg a geerc class of covex optmzato problems. We provde the theoretcal guaratee of our algorthms, ad show that the accelerated algorthm ca acheve the optmal commucato complexty. We also exted our algorthm for solvg o-smooth ad o-strogly covex loss fuctos. The expermetal results demostrate that our algorthms obta better performace compared to the state-of-the-art dstrbuted optmzato method. Acowledgemets Ths wor s supported by NTU Sgapore Nayag Assstat Professorshp NAP grat M40853.00, Sgapore MOE AcRF Ter- grat MOE06-T--060, ad MOE AcRF Ter- grat 06-T-00-59. Refereces Y. Arjeva ad O. Shamr. Commucato complexty of dstrbuted covex learg ad optmzato. I Advaces Neural Iformato Processg Systems, pages 756 764, 05. S. Boyd, N. Parh, E. Chu, B. Peleato, ad J. Ecste. Dstrbuted optmzato ad statstcal learg va the alteratg drecto method of multplers.

Foudatos ad Treds R Mache Learg, 3:, 0. A. Chambolle ad T. Poc. A frst-order prmal-dual algorthm for covex problems wth applcatos to magg. Joural of Mathematcal Imagg ad Vso, 40:0 45, 0. C.-C. Chag ad C.-J. L. LIBSVM: A lbrary for support vector maches. ACM Trasactos o Itellget Systems ad Techology, :7: 7:7, 0. Software avalable at http://www.cse.tu. edu.tw/ cjl/lbsvm. T.-H. Chag, A. Nedc, ad A. Scagloe. Dstrbuted costraed optmzato by cosesus-based prmaldual perturbato method. IEEE Trasactos o Automatc Cotrol, 596:54 538, 04. B. Da, N. He, Y. Pa, B. Boots, ad L. Sog. Learg from codtoal dstrbutos va dual erel embeddgs. arxv preprt arxv:607.04579, 06. J. Duch, M. I. Jorda, ad B. McMaha. Estmato, optmzato, ad parallelsm whe data s sparse. I Advaces Neural Iformato Processg Systems, pages 83 840, 03. M. Jagg, V. Smth, M. Taác, J. Terhorst, S. Krsha, T. Hofma, ad M. I. Jorda. Commucatoeffcet dstrbuted dual coordate ascet. I Advaces Neural Iformato Processg Systems, pages 3068 3076, 04. R. Johso ad T. Zhag. Acceleratg stochastc gradet descet usg predctve varace reducto. I Advaces Neural Iformato Processg Systems, pages 35 33, 03. H. L, J. Maral, ad Z. Harchaou. A uversal catalyst for frst-order optmzato. I Advaces Neural Iformato Processg Systems, pages 3384 339, 05. C. Ma, V. Smth, M. Jagg, M. Jorda, P. Rchtar, ad M. Taac. Addg vs. averagg dstrbuted prmaldual optmzato. I Proceedgs of The 3d Iteratoal Coferece o Mache Learg, pages 973 98, 05. Y. Nesterov. Itroductory lectures o covex optmzato: A basc course, volume 87. Sprger Scece & Busess Meda, 03. S. J. Redd, J. Koečỳ, P. Rchtár, B. Póczós, ad A. Smola. Ade: Fast ad commucato effcet dstrbuted optmzato. arxv:608.06879, 06. arxv preprt S. Shalev-Shwartz ad T. Zhag. Stochastc dual coordate ascet methods for regularzed loss mmzato. Joural of Mache Learg Research, 4Feb:567 599, 03. O. Shamr, N. Srebro, ad T. Zhag. Commucatoeffcet dstrbuted optmzato usg a approxmate ewto-type method. I Proceedgs of The 3st Iteratoal Coferece o Mache Learg, pages 000 008, 04. W. Sh, Q. Lg, K. Yua, G. Wu, ad W. Y. O the lear covergece of the admm decetralzed cosesus optmzato. IEEE Trasactos o Sgal Processg, 67:750 76, 04. V. Smth, S. Forte, M. I. Jorda, ad M. Jagg. L-regularzed dstrbuted optmzato: A commucato-effcet prmal-dual framewor. arxv preprt arxv:5.040, 05. E. P. Xg, Q. Ho, W. Da, J. K. Km, J. We, S. Lee, X. Zheg, P. Xe, A. Kumar, ad Y. Yu. Petuum: A ew platform for dstrbuted mache learg o bg data. IEEE Trasactos o Bg Data, :49 67, 05. T. Yag. Tradg computato for commucato: Dstrbuted stochastc dual coordate ascet. I Advaces Neural Iformato Processg Systems, pages 69 637, 03. A. W. Yu, Q. L, ad T. Yag. Doubly stochastc prmal-dual coordate method for regularzed emprcal rs mmzato wth factorzed data. CoRR, abs/508.03390, 05. Y. Zhag ad L. Xao. Dsco: Dstrbuted optmzato for self-cocordat emprcal loss. I Proceedgs of the 3d Iteratoal Coferece o Mache Learg, pages 36 370, 05a. Y. Zhag ad L. Xao. Stochastc prmal-dual coordate method for regularzed emprcal rs mmzato. I Proceedgs of The 3d Iteratoal Coferece o Mache Learg, pages 353 36, 05b. Y. Zhag, M. J. Wawrght, ad J. C. Duch. Commucato-effcet algorthms for statstcal optmzato. I Advaces Neural Iformato Processg Systems, pages 50 50, 0.

Appedx A Lemmas for Covergece Aalyss of DSPA We frst troduce the lemma that characterzes the optmalty codto of local subproblem L t x, y []. Lemma. Assume that each φ s /γ-smooth ad g s λ-strogly covex, R = max{ a,... a }. Let ˆx t, ŷt [] be the optmal soluto of Lt L t x, ŷt x, y [], =,,..., K. Based o the strogly covexty, we have [] Lt ˆxt, ŷt [] λ x ˆx t, x R d L t ˆxt, y [] L t K ˆxt, ŷt [] σ γ y [] ŷ t [], y [] R Proof. Based o the defto of the saddle pot, we ca otce that L t ˆxt, y [] s a K σ γ -strogly covex fucto ad mmzed by ŷ t [], whch mples L t Also otce that L t ˆxt, y [] L t ˆxt, ŷ t K [] σ γ x, ŷt L t ˆxt, ŷt [] λ y [] ŷ t [], y [] R [] s a τ λ -strogly covex fucto mmzed by ˆx t, whch mples x, ŷt [] Lt x ˆx t, x R d Based o the optmalty codto of L t x, y [] ad the cetral update x t o cetral worer, we ca get the coecto betwee local subproblems ad cetral update o cetral worer. Lemma 3 Relatoshp betwee local optmal soluto ad cetral update. Assume that each φ s /γ-smooth ad g s λ-strogly covex, R = max{ a,... a }. Let ˆx t, ŷt [] be the optmal soluto of Lt x, y [], =,,..., K. Let x t = arg m C t x, t holds that x R d where Λ = K K 3 4σK yt y t Λ 4τ λ x t ˆx t 9 = K = ŷt [] yt [] A x t ˆx t ad /τ = 4σR. Proof. Based o lemma, we could get that for =,,..., K, ŷ t j j P ˆx t xt a j, x t ˆx t λ y t / P x t ˆx t a, x t ˆx t Sce x t mmzes the fucto C t x, whch meas that for =,,..., K, we have = y t a, ˆx t gxt gˆx t xt x t xt gˆx t gxt ˆxt xt xt x t λ x t ˆx t

Sum up the above two equaltes, we ca get ŷ t j j P a j, x t ˆx t j P ŷ t j y t j y t / P a j, x t ˆx t a, x t ˆx t / P ŷt [] yt [] A x t ˆx t l y t y t [l] = y t y t a, ˆx t xt τ λ x t ˆx t a, x t ˆx t τ λ x t ˆx t y t [l] A x t ˆx t τ λ x t ˆx t We eed to upper boud the secod term o the left-had-sde of the above equalty. Sce a R, ad we assume that /τ = 4σR, the y t [l] y t [l] A x t ˆx t xt t ˆx l Ayt [l] y t [l] 4τ /τ l Combe the above two equaltes, K 4σK l y t [l] xt ˆx t 4τ / P y t xt ˆx t K 4τ 4σK 4σR l y t [l] ŷt [] yt [] A x t ˆx t 3 4τ λ y t a y t [l] y t [l] x t ˆx t Sum up the above equalty over =,,..., K, ad deote Λ = K = ŷt [] yt [] A x t ˆx t, we have K K 3 4σK yt y t Λ 4τ λ x t ˆx t = Lemma 3 shows that the dstace betwee x t ad ˆx t ca be cotrol by the update of yt each terato. B Proofs of Covergece for DSPA ad A-DSPA B. Proof of Lemma Proof. We start from characterzg the relatoshp betwee x t ad x after the t-update Algorthm. Accordg to the defto of x t, we have C t x C t x t λ x t x,.e., = y t a, x x t gx gx t x x t xt x t λ We also could derve the equalty characterzg the relato betwee ŷ t = K we have ŷ t [] = arg max L t ˆxt, y [] y [] R = ŷt x t ˆx 0 [] ad y. For =,,..., K, Sce φ s/γ-smooth, we have φ s γ-strogly covex, therefore, for =,,..., K, j P ŷ t j yj a j, ˆx t φ j yj φ j ŷ t j Ky j yt j Kŷt j y t j K σ σ σ γ ŷ t j yj

Sum up the above equalty over j, we have ŷ t j yj a j, ˆx t φ j yj φ j ŷ t j K y [] yt σ j P j P [] K ŷt [] yt σ [] K σ γ y[] ŷt [] Sum up the above equalty over =,,..., K ad multplyg both sdes by /, K ŷ t j j P = K ŷt y t σ yj a j, ˆx t = K σ γ y ŷ t φ yj φ ŷ t j K y y t σ I addto, we cosder a combato of the saddle-pot fucto values at dfferet pots, we have fx t, y fx, y fx, y fx, y t = y a, x t φ y gx t y t a, x = = = = gx t gx φ y t φ y y a, x t = = = = φ y t gx y t a, x The we add 0 ad to the above equalty, whch mples K ŷ t j yj a j, ˆx y t a, x x t y a, x t = j P = = φ y t φ ŷ t x x t K y y t σ = fx t, y fx, y fx, y fx, y t xt x t K ŷt y t σ λ = y t a, x K x t x σ γ y ŷ t whch mples that K ŷ t j yj a j, ˆx x t j P = = ŷ t y t a, x t x x t K y y t σ fx t, y fx, y fx, y fx, y t xt x t λ K x t x σ γ y ŷ t = φ y t φ ŷ K ŷt y t σ

We eed to upper boud the frst term o the left-had-sde of the above equalty, assume that /τ = 4σR, we have ŷ t j j P y j a j, ˆx t xt = ŷt [] y [] A ˆx t xt the we ca get the upper boud ˆxt xt 4τ ˆxt xt 4τ ˆxt xt 4τ Aŷt [] y [] /τ j P ŷ t j y j a j 4σR ŷt [] y [] 4σK ext we deote that K K ŷ t j yj a j, ˆx x t j P = Λ = = ŷ t = y t a, x t Combg the above equalty ad equato, we derve that K = ˆx t xt ŷt y 4τ 4σK = ˆx t xt x x t K y y t Λ 4τ σ fx t, y fx, y λ x t x fx, y fx, y t K σ γ 4σK Based o the equalty 9 lemma 3, we could get that K = φ y t φ ŷ xt x t y ŷ t K ŷt y t σ ˆx t xt K 4τ 4σK yt y t x x t K y y t Λ Λ σ fx t, y fx, y fx, y fx, y t xt x t K 3 4τ λ x t ˆx t λ x t x = K σ γ 4σK K ŷt y t σ y ŷ t Sce we ca rewrte y ŷ t = y y t y t ŷ t y y t y t ŷ t, the we deote that K Λ = σ γ y t ŷ t y y t y t ŷ t 4σK K σ Based o the defto of Λ, we derve that y t ŷ t y t y t y t ŷ t x x t K y y t Λ Λ Λ σ fx t, y fx, y fx, y fx, y t λ K x t x σ γ 4σK y y t Λ

where we defe Λ = xt x t Assume that Θ 0, the we ca obta Θ fx t, y fx, y K λ x t ˆx t K K y t y t σ 4σK = Θ fx, y fx, y t x x t K y y t Λ σ fx t, y fx, y fx, y fx, y t λ K x t x σ γ 4σK where Λ = Λ Λ Λ ad Λ > 0. Θ fx t, y fx, y t y y t Λ I order to get the covergece guaratee of our algorthm, we eed to get the upper boud of Λ, based o the defto of Λ, we could get Λ K ŷ t [] yt = K σ γ σk based o the assumpto, the we could get that where M = 4 R Ω x 3K σ φ y t φ ŷ [] A x t ˆx t ŷt y t A x t = y y t ŷ t y t K σ yt y t ŷ t y t Λ M ŷt y t K = fˆx t, ŷt [] fˆxt, yt [] γ Ω y = c Ω x c Ω y as defed Assumpto 5. Based o the assumpto E[Λ] ˆΘ M yt ŷ t Θ K = fˆx t [] fˆxt, yt [], ŷt where ˆΘ <, Θ < are defed Assumpto 4, ad defe the parameter Θ as { } Θ = max σγ K, τλ K based o Assumpto 5, ad we could get that E[ ] = E[ M ŷt y t K = fˆx t [] fˆxt, yt [] ] Θ fx t, y fx, y t, ŷt The we could get our fal cocluso where we requre that E[ t ] ΘE[ t ] σ > Kγ, τσ = 4R 3

B. Proof of Theorem Proof. By Lemma, for each t > 0, we have E[ t ] Θ t E[ 0 ], accordg to the defto of C, we have I order to obta Θ T C ɛ, whch eeds Suppose the parameters τ, σ are set as The accordg to the defto of Θ { Θ = max σγ K K, E[ x t x ] Θ t C 4 log C/ɛ T log Θ σ = γ, τ = γ 4R } { = max τλ K K, As log as K K K K = K λγ 4R, sce we assume K κ. The we have Thus we ca get Θ = λγ/4r = 4R /λγ log C/ɛ T = log Θ log C/ɛ log 4R /λγ } λγ/4r where we apply the equalty log x x the last equalty ad we could get the cocluso. B.3 Covergece guaratee of prmal-dual gap Next we derve the covergece rate of prmal-dual gap based o Theorem. Lemma 4 Yu et al. 05. Suppose Assumpto, holds, Let x, y s the uque saddle-pot of fx, y, ad R = max a. The for ay pot x, y domg domφ, we have Px fx, y R γ x x, Dy fx, y R λ y y 5 Corollary. Suppose Assumpto A holds ad the parameters τ, σ, Θ are set as, 3. The the terates of Algorthm satsfy E[Px T Dy T ] ɛ t suffces to have the umber of commucato teratos T satsfy, T 4R log λγ. Proof. Based o Lemma 4, we have R λγ Px t Dy t = Px t Px Dy Dy t Smlar to Theorem, we could get the cocluso. C ɛ fx t, y R γ xt x fx, y t R λ yt y R t λγ

B.4 Proof of Theorem Proof. If the parameters τ, σ Algorthm are set as τ = γ 4R, σ = γ the based o Lemma, we have { } { } Θ = max σγ K K, = max τλ K K, λγ assume that K, ad K K λγ 4R, whch meas we focus o the stuato that κ = R λγ. The we could get that Θ = λγ = 4R 4R λγ Based o Corollary, we could get that Px t Px Px t Dy t 4R R t C Θ t Px 0 Px λγ where C s a costat ad C > 0 ad P s defed. If we apply DSPA o f t defed Secto 5., the based o Proposto 3. of L et al. 05, the τ M defed Proposto 3. equals τ M = 4R λϑγ where ϑ s the parameter defed Secto 5.. We could get that τ M λϑγ 4R sce κ. Now apply Proposto 3. L et al. 05, we could get the global lear rate of covergece wthe parameter τ A,F, τ A,F = τ Õ M λ/ λ ϑ = Õ λ ϑ γ λ 4R λ ϑ f we tae ϑ as ϑ = 4R γ λ, the we ca get that the commucato complexty of the A-DSPA for achevg Px T Px < ɛ s Õ κlog ɛ.