A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization

Size: px

Start display at page:

Download "A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization"

Andrea Higgins
5 years ago
Views:

1 Journa of Machne Learnng Research Submtted 9/16; Revsed 1/17; Pubshed 1/17 A Genera Dstrbuted Dua Coordnate Optmzaton Framework for Reguarzed Loss Mnmzaton Shun Zheng Insttute for Interdscpnary Informaton Scences Tsnghua Unversty Bejng, Chna Jae Wang Department of Computer Scence The Unversty of Chcago Chcago, Inos Fen Xa Bejng Wsdom Uranum Technoogy Co., Ltd. Bejng, Chna We Xu Insttute for Interdscpnary Informaton Scences Tsnghua Unversty Bejng, Chna Tong Zhang Tencent AI Lab Shenzhen, Chna zhengs14@mas.tsnghua.edu.cn jae@uchcago.edu xafen@ebran.a wexu@tsnghua.edu.cn tongzhang@tongzhang-m.org Edtor: Sathya Keerth Abstract In modern arge-scae machne earnng appcatons, the tranng data are often parttoned and stored on mutpe machnes. It s customary to empoy the data paraesm approach, where the aggregated tranng oss s mnmzed wthout movng data across machnes. In ths paper, we ntroduce a nove dstrbuted dua formuaton for reguarzed oss mnmzaton probems that can drecty hande data paraesm n the dstrbuted settng. Ths formuaton aows us to systematcay derve dua coordnate optmzaton procedures, whch we refer to as Dstrbuted Aternatng Dua Maxmzaton DADM. The framework extends earer studes descrbed n Boyd et a., 11; Ma et a., 17; Jagg et a., 14; Yang, 13 and has rgorous theoretca anayses. Moreover, wth the hep of the new formuaton, we deveop the acceerated verson of DADM Acc-DADM by generazng the acceeraton technque from Shaev-Shwartz and Zhang, 14 to the dstrbuted settng. We aso provde theoretca resuts for the proposed acceerated verson, and the new resut mproves prevous ones Yang, 13; Ma et a., 17 whose teraton compextes grow neary on the condton number. Our emprca studes vadate our theory and show that our acceerated approach sgnfcanty mproves the prevous state-of-the-art dstrbuted dua coordnate optmzaton agorthms.. Most of the work was done durng the nternshp of Shun Zheng at Badu Bg Data Lab n Bejng. c 17 Shun Zheng, Jae Wang, Fen Xa, We Xu, and Tong Zhang. Lcense: CC-BY 4., see Attrbuton requrements are provded at

2 Zheng, Wang, Xa, Xu, and Zhang Keywords: dstrbuted optmzaton, stochastc dua coordnate ascent, acceeraton, reguarzed oss mnmzaton, computatona compexty 1. Introducton In arge-scae machne earnng appcatons for bg data anayss, t becomes a common practce to partton the tranng data and store them on mutpe machnes connected va a commodty network. A typca settng of dstrbuted machne earnng s to aow these machnes to tran n parae, wth each machne processng ts oca data wth no data communcaton. Ths paradgm s often referred to as data paraesm. To reduce the overa tranng tme, t s often necessary to ncrease the number of machnes and to mnmze the communcaton overhead. A sgnfcant chaenge s to reduce the tranng tme as much as possbe when we ncrease the number of machnes. A practca souton requres two research drectons: one s to mprove the underyng system desgn makng t sutabe for machne earnng agorthms Dean and Ghemawat, 8; Zahara et a., 1; Dean et a., 1; L et a., 14; the other s to adapt tradtona snge-machne optmzaton methods to hande data paraesm Boyd et a., 11; Yang, 13; Mahajan et a., 13; Shamr et a., 14; Jagg et a., 14; Mahajan et a., 17; Ma et a., 17; Takáč et a., 15; Zhang and Ln, 15. Ths paper focuses on the atter. For bg data machne earnng on a snge machne, there are two types of agorthms: batch agorthms such as gradent descent or L-BFGS Lu and Noceda, 1989, and stochastc optmzaton agorthms such as stochastc gradent descent and ther modern varance reduced versons Defazo et a., 14; Johnson and Zhang, 13. It s known that batch agorthms are reatvey easy to paraeze. However, on a snge machne, they converge more sowy than the modern stochastc optmzaton agorthms due to ther hgh perteraton computaton costs. Specfcay, t has been shown that the modern stochastc optmzaton agorthms converge faster than the tradtona batch agorthms for convex reguarzed oss mnmzaton probems. The faster convergence can be guaranteed n theory and observed n practce. The fast convergence of modern stochastc optmzaton methods has ed to studes to extend these methods to the dstrbuted computng settng. Specfcay, ths paper consders the generazaton of Stochastc Dua Coordnate Ascent SDCA method Hseh et a., 8; Shaev-Shwartz and Zhang, 13 and ts proxma varant Shaev-Shwartz and Zhang, 14 to hande dstrbuted tranng usng data paraesm. Athough ths probem has been consdered prevousy Yang, 13; Jagg et a., 14; Ma et a., 17, these earer approaches work wth the dua formuaton that s the same as the tradtona snge-machne dua formuaton, where dua varabes are couped, and hence run nto dffcutes when they try to motvate and anayze the derved methods under the dstrbuted envronment. One contrbuton of ths work s to ntroduce a new dua formuaton specfcay for dstrbuted reguarzed oss mnmzaton probems when data are dstrbuted to mutpe machnes. In our new formuaton, we decoupe the oca dua varabes through ntroducng another dua varabe β. Ths unque dua formuaton aows us to naturay extend the proxma SDCA agorthm ProxSDCA of Shaev-Shwartz and Zhang, 14 to the settng of mut-machne dstrbuted optmzaton that can beneft from data paraesm. Moreover, the anayss of the orgna ProxSDCA can be easy adapted to the new formu-

3 A Genera Dstrbuted Dua Coordnate Optmzaton Framework aton, eadng to new theoretca resuts. Ths new dua formuaton can aso be combned wth the acceeraton technque of Shaev-Shwartz and Zhang, 14 to further mprove convergence. In the proposed formuaton, each teraton of the dstrbuted dua coordnate ascent optmzaton s naturay decomposed nto a oca step and a goba step. In the oca step, we aow the use of any oca procedure to optmze a oca dua objectve functon usng oca parameters and oca data on each machne. Ths fexbty s smar to those of Ma et a., 17; Jagg et a., 14. For exampe, we may appy ProxSDCA as the oca procedure. In the oca step, a computer node can perform the optmzaton ndependenty wthout communcatng wth each other. Whe n the goba step, nodes communcate wth each other to synchronze the oca parameters and jonty update the goba prma souton. Ony ths goba step requres communcaton among nodes. We summarze our man contrbutons as foows: New dstrbuted dua formuaton Ths new formuaton naturay eads to a two-step oca-goba dua aternatng optmzaton procedure for dstrbuted machne earnng. We thus ca the resutng procedure Dstrbuted Aternatng Dua Maxmzaton DADM. Note that DADM drecty generazes ProxSDCA, whch can hande compex reguarzatons such as L -L 1 reguarzaton. New convergence anayss The new formuaton aows us to drecty generaze the anayss of ProxSDCA n Shaev-Shwartz and Zhang, 14 to the dstrbuted settng. Ths anayss s n contrast to that of CoCoA + n Ma et a., 17, whch empoys a dfferent way based on the Θ-approxmate souton assumpton of the oca sover. Our anayss can ead to smpfed resuts n the commony used mn-batch setup. Acceeraton wth theoretca guarantees Based on the new dstrbuted dua formuaton, we can naturay derve a dstrbuted verson of the acceerated proxma SDCA method AccProxSDCA of Shaev-Shwartz and Zhang, 14, whch has been shown to be effectve on a snge machne. We ca the resutng procedure Acceerated Dstrbuted Aternatng Dua Maxmzaton Acc-DADM. The man dea s to modfy the orgna formuaton usng a sequence of approxmatons that have stronger reguarzatons. Moreover, we drecty adapt theoretca anayses of AccProxSDCA to the dstrbuted settng and provde guarantees for Acc-DADM. Our theorems guarantee that we can aways obtan a computaton speedup compared wth the snge-machne AccProxSDCA. These guarantees mprove the theoretca resuts of DADM and prevous methods Yang, 13; Ma et a., 17 whose teraton compextes grow neary on the condton number. Latter methods possby fa to provde computaton tme mprovement over the snge-machne ProxSDCA when the condton number s arge. Extensve emprca studes We perform extensve experments to compare the convergence and the scaabty of the acceerated approach wth that of prevous state-of-the-art dstrbuted dua coordnate ascent methods. Our emprca studes show that Acc-DADM can acheve faster convergence and better scaabty than the prevous state-of-the-art, n partcuar when the condton number s arge. Ths phenomenon s consstent wth our theory. 3

4 Zheng, Wang, Xa, Xu, and Zhang We organze the rest of the paper as foows. Secton dscusses reated works. Secton 3 provdes premnary defntons. Secton 4 to 6 present the dstrbuted prma formuaton, the dstrbuted dua formuaton and our DADM method respectvey. Secton 7 then provdes theorems for DADM. Secton 8 ntroduces the acceerated verson and provdes correspondng theoretca guarantees. Secton 9 ncudes a proofs of ths paper. Secton 1 provdes extensve emprca studes of our nove method. Fnay, Secton 11 concudes the whoe paper.. Reated Work Severa generazatons of SDCA to the dstrbuted settngs have been proposed n the terature, ncudng DsDCA Yang, 13, CoCoA Jagg et a., 14, and CoCoA + Ma et a., 17. DsDCA was the frst attempt to study dstrbuted SDCA, and t provded a basc theoretca anayss and a practca varant that behaves we emprcay. Nevertheess, ther theoretca resut ony appes to a few specay chosen mn-batch oca dua updates that dffer from the practca method used n ther experments. In partcuar, they dd not show that optmzng each oca dua probem eads to convergence. Ths mtaton makes the methods they anayzed nfexbe. CoCoA was proposed to fx the above gap between theory and practce, and t was camed to be a framework for dstrbuted dua coordnate ascent n that t aows any oca dua sover to be used for the oca dua probem, rather than the mpractca choces of DsDCA. However, the actua performance of CoCoA s nferor to the practca varant proposed n DsDCA wth an aggressve oca update. We note that the practca varant of DsDCA dd not have a sod theoretca guarantee at that tme. CoCoA + fxed ths stuaton and may be regarded as a generazaton of CoCoA. The most effectve choce of the aggregaton parameter eads to a verson whch s smar to DsDCA, but aows exact optmzaton of each dua probem n ther theory. Accordng to studes n Ma et a., 17, the resutng CoCoA + agorthm performs sgnfcanty better than the orgna CoCoA both theoretcay and emprcay. The orgna CoCoA + Ma et a., 15 can ony hande probems wth the L reguarzer, and t was generazed to genera strongy convex reguarzers n the ong verson Ma et a., 17. Besdes, Smth et a., 16 extended the framework to sove the prma probem of reguarzed oss mnmzaton and cover genera non-strongy convex reguarzers such as L 1 reguarzer, and Hseh et a., 15 studed parae SDCA wth asynchronous updates. Athough CoCoA + has the advantage of aowng arbtrary oca sovers and fexbe approxmate soutons of oca dua probems, ts theoretca anayses do not capture the contrbuton of the number of machnes and the mn-batch sze to the teraton compexty expcty. Moreover, the teraton compextes of both CoCoA + and DsDCA grow neary wth the condton number. Thus they probaby cannot provde computaton tme mprovement over the snge-machne SDCA when the condton number s arge. Ths paper w remedy these unsatsfed aspects by provdng a dfferent anayss based on a new dstrbuted dua formuaton. Usng ths formuaton, we can anayze procedures that can take an arbtrary oca dua sover, whch s ke CoCoA + ; moreover, we aow the dua updates to be a mn-batch, whch s ke DsDCA. Besdes, ths formuaton aso aows 4

5 A Genera Dstrbuted Dua Coordnate Optmzaton Framework us to naturay generaze AccProxSDCA and reevant theoretca resuts to the dstrbuted settng. Our emprca resuts aso vadate the superorty of the acceerated approach. Whe we focus on extendng SDCA n ths paper, we note that there are other approaches for parae optmzaton. For exampe, there are drect attempts to paraeze stochastc gradent descent Nu et a., 11; Znkevch et a., 1. Some of these procedures ony consder mut-core shared memory stuaton, whch s very dfferent from the dstrbuted computng envronment nvestgated n ths paper. In the settng of dstrbuted computng, data are parttoned nto mutpe machnes, and one often needs to study communcaton-effcent agorthms. In such cases, one extreme s to aow exact optmzaton of subprobems on each oca machne as consdered n Shamr et a., 14; Zhang and Ln, 15. Athough ths approach mnmzes communcaton, the computaton cost for each oca sover can domnate the overa tranng. Therefore n practce, t s necessary to do a trade-off by usng the mn-batch update approach Takáč et a., 13, 15. However, t s dffcut for tradtona mn-batch methods to desgn reasonabe aggregaton strateges to acheve fast convergence. Takáč et a., 15 studed how the step sze can be reduced when the mn-batch sze grows n the dstrbuted settng. Lee and Roth, 15 derved an anaytca souton of the optma step sze for dua near support vector machne probems. Besdes, Mahajan et a., 13 presented a genera framework for dstrbuted optmzaton based on oca functona approxmaton, whch ncude severa frst-order and second-order methods as speca cases. Mahajan et a., 17 consdered each machne to hande a bock of coordnates and proposed dstrbuted bock coordnate descent methods for sovng 1 reguarzed oss mnmzaton probems. Dfferent from those methods, Dstrbuted Aternatng Dua Maxmzaton DADM proposed n ths work handes the trade-off between computaton and communcaton by deveopng bounds for mn-batch dua updates, whch s smar to Yang, 13. Moreover, DADM aows other better oca sovers to acheve faster convergence n practce. 3. Premnares In ths secton, we ntroduce some notatons used ater. A functons that we consder n ths paper are proper convex functons over a Eucdean space. Gven a functon f : R d R, we denote ts conjugate functon as f b = sup[b a fa]. a A functon f : R d R s L-Lpschtz wth respect to f for a a, b R d, we have fa fb L a b. A functon f : R d R s 1/γ-smooth wth respect to f t s dfferentabe and ts gradent s 1/γ-Lpschtz wth respect to. An equvaent defnton s that for a a, b R d, we have fb fa + fa b a + 1 γ b a. A functon f : R d R s -strongy convex wth respect to f for any a, b R d, we have fb fa + fa b a + b a, 5

6 Zheng, Wang, Xa, Xu, and Zhang where fa s any subgradent of fa. It s we known that a functon f s γ-strongy convex wth respect to f and ony f ts conjugate functon f s 1/γ-smooth wth respect to. 4. Dstrbuted Prma Formuaton In ths paper, we consder the foowng generc reguarzed oss mnmzaton probem: [ ] n P w := φ X w + ngw + hw, 1 mn w R d =1 whch s often encountered n practca machne earnng probems. Here we assume each X R d q s a d q matrx, w R d s the mode parameter vector, φ u s a convex oss functon defned on R q, whch s assocated wth the -th data pont, > s the reguarzaton parameter, gw s a strongy convex reguarzer and hw s another convex reguarzer. A speca case s to smpy set hw =. Here we aow the more genera formuaton, whch can be used to derve dfferent dstrbuted dua forms that may be usefu for speca purposes. The above optmzaton formuaton can be specazed to a varety of machne earnng probems. As an exampe, we may consder the L -L 1 reguarzed east squares probem, where φ x w = w x y for vector nput data x R d and rea vaued output y R, gw = w + a w 1, and hw = b w 1 for some a, b. If we set hw =, then t s we-known see, for exampe, Shaev-Shwartz and Zhang, 14 that the prma probem 1 has an equvaent snge-machne dua form of max α R n [ Dα := n =1 n φ α ng =1 X ] α, n where α = [α 1,, α n ], α R q = 1,..., n are dua varabes, φ s the convex conjugate functon of φ, and smary, g s the convex conjugate functon of g. The stochastc dua coordnate ascent method, referred to as SDCA n Shaev-Shwartz and Zhang, 14, maxmzes the dua formuaton by optmzng one randomy chosen dua varabe at each teraton. Throughout the agorthm, the foowng prma-dua reatonshp s mantaned: n wα = g =1 X α, 3 n for some subgradent g v. It s known that wα = w, where w and α are optma soutons of the prma probem and the dua probem respectvey. It was shown n Shaev-Shwartz and Zhang, 14 that the duaty gap defned as P wα Dα, whch s an upper-bound of the prma sub-optmaty P wα P w, converges to zero. Moreover, a convergence rate can be estabshed. In partcuar, for smooth oss functons, the convergence rate s near. We note that SDCA s sutabe for optmzaton on a snge machne because t works wth a dua formuaton that s sutabe for a snge machne. In the foowng, we w 6

7 A Genera Dstrbuted Dua Coordnate Optmzaton Framework generaze the snge-machne dua formuaton to the dstrbuted settng, and study the correspondng dstrbuted verson of SDCA. In the dstrbuted settng, we assume that the tranng data are parttoned and dstrbuted to m machnes. In other words, the ndex set S = {1,..., n} of the tranng data s dvded nto m non-overappng parttons, where each machne {1,..., m} contans ts own partton S S. We assume that S = S, and we use n := S to denote the sze of the tranng data on machne. Next, we can rewrte the prma probem 1 as the foowng constraned mnmzaton probem that s sutabe for the mut-machne dstrbuted settng: mn w;{w } m =1 m P w + hw =1 s.t. w = w, for a {1,..., m}, where P w := S φ X w + n gw, 4 where w represents the oca prma varabe on each machne, P s the correspondng oca prma probem and the constrants w = w are mposed to synchronze the oca prma varabes. Obvousy ths mut-machne dstrbuted prma formuaton 4 s equvaent to the orgna prma probem 1. We note that the dea of objectve spttng n 4 s smar to the goba varabe consensus formuaton descrbed n Boyd et a., 11. Instead of usng the commony used ADMM Aternatng Drecton Method of Mutpers method that s not a generazaton of, n ths paper we derve a dstrbuted dua formuaton based on 4 that drecty generazes. We further propose a framework caed Dstrbuted Aternatng Dua Maxmzaton DADM to sove the dstrbuted dua formuaton. One advantage of DADM over ADMM s that DADM does not need to sove the subprobems n hgh accuracy, and thus t can naturay enjoy the trade-off between computaton and communcaton, whch s smar to reated methods such as DsDCA, CoCoA and CoCoA Dstrbuted Dua Formuaton The optmzaton probem 4 can be further rewrtten as: mn w;{w };{u } s.t m φ u + n gw + hw S =1 u = X w, for a S w = w, for a {1,..., m}. 5 Here we ntroduce n dua varabes α := {α } n =1, where each α s the Lagrange mutper for the constrant u X w =, and m dua varabes β := {β } m =1, where each β s the Lagrange mutper for the constrant w w =. We can now ntroduce the prma-dua 7

8 Zheng, Wang, Xa, Xu, and Zhang objectve functon wth Lagrange mutpers as foows: Jw; {w }; {u }; {α }; {β } m := φ u + α u X w + n gw + β w w + hw. =1 S Proposton 1 Defne the dua objectve as m Dα, β := φ α n g S X α β h β. n S Then we have =1 Dα, β = mn Jw; {w }; {u }; {α }; {β }, w;{w };{u } where the mnmzers are acheved when the foowng equatons are satsfed φ u + α =, X α β + n gw =, S 6 β + hw =, for some subgradents φ u, gw, and hw. When β = {β } are fxed, we may defne the oca snge-machne dua formuaton on each machne wth respect to α as D α β := φ α n g S X α β, 7 n S where α represents oca dua varabes {α ; S } on machne, β R d serves as a carrer for synchronzaton of machne. Based on Proposton 1, we obtan the foowng mut-machne dstrbuted dua formuaton for the correspondng prma probem 4: m m Dα, β = D α β h β. 8 =1 Moreover, we have the non-negatve duaty gap, and zero duaty gap can be acheved when w s the mnmzer of P w and α, β maxmzes the dua Dα, β. Proposton Gven any w, α, β, the foowng duaty gap s non-negatve: P w Dα, β. Moreover, zero duaty gap can be acheved at w, α, β, where w s the mnmzer of P w and α, β s a maxmzer of Dα, β. =1 8

9 A Genera Dstrbuted Dua Coordnate Optmzaton Framework We note that the parameters {β } m =1 pass the goba nformaton across mutpe machnes. When β s fxed, D α β wth respect to α corresponds to the dua of the adjusted oca prma probem: P w β := S φ X w + n g w, 9 where the orgna reguarzer n gw n P w s repaced by the adjusted reguarzer n g w := n gw + β w. Smar to the snge-machne prma-dua reatonshp of 3, we have the foowng oca prma-dua reatonshp on each machne as: w α, β = g ṽ = g v, 1 where v = S X α n, ṽ = v β n. where Moreover, we can defne the goba prma-dua reatonshp as wα, β = g ṽ = g v, 11 n =1 v = X α, ṽ = v n β n. We can aso estabsh the reatonshp of goba-oca duaty n Proposton 3. Proposton 3 Gven w, α, β and {w } such that w 1 = = w m = w, we have the foowng decomposton of goba duaty gap as the sum of oca duaty gaps: P w Dα, β m =1 [ P w β D α β ], and the equaty hods when hw = β for some subgradent hw. Athough we aow arbtrary hw, the case of hw = s of speca nterests. Ths corresponds to the conjugate functon h β = { + f β f β =. That s, the term h m =1 β s equvaent to mposng the constrant m =1 β =. 9

10 Zheng, Wang, Xa, Xu, and Zhang Agorthm 1 Loca Dua Update Retreve oca parameters α t 1,ṽ t 1 Randomy pck a mn-batch Q S Approxmatey maxmze 1 w.r.t α Q Update α t as α t = α t 1 + α for a Q return v t = 1 n Q X α 6. Dstrbuted Aternatng Dua Maxmzaton Mnmzng the prma formuaton 4 s equvaent to maxmzng the dua formuaton 8, and the atter can be acheved by repeatedy usng the foowng aternatng optmzaton strategy, whch we refer to as Dstrbuted Aternatng Dua Maxmzaton DADM: Loca step: fx β and et each machne approxmatey optmze D α β w.r.t α n parae. Goba step: maxmze the goba dua objectve w.r.t β, and set the goba prma parameter w accordngy. The above steps are apped n teratons t = 1,,..., T. At the begnnng of each teraton t, we assume that the oca prma and dua varabes on each oca machne are α t 1, β t 1, v t 1, then we seek to update α t 1 to α t and vt 1 to v t n the oca step, and seek to update β t 1 to β t n the goba step. We note that the oca step can be executed n parae w.r.t dua varabes {α } m =1. In practce, t s often usefu to optmze 7 approxmatey by usng a randomy seected mn-batch Q S of sze Q = M. That s, we want to fnd α t wth Q to approxmatey maxmze the oca dua objectve as foows: D t Q α Q := φ α t 1 α n g ṽ t 1 Q + X α. 1 n Q Ths step s descrbed n Agorthm 1. We can use any sover for ths approxmate optmzaton, and n our experments, we choose ProxSDCA. The goba step s to synchronze a oca soutons, whch requres communcaton among the machnes. Ths s acheved by optmzng the foowng dua objectve wth respect to a β = {β }: β t arg max Dα t, β. 13 β Proposton 4 Gven v, et wv be the unque souton of the foowng optmzaton probem [ ] wv = arg mn nw v + ngw + hw 14 w that satsfes n gw + hw = nv 1

11 A Genera Dstrbuted Dua Coordnate Optmzaton Framework for some subgradents gw and hw = ρ at w = wv. Then βv = ρ s a souton of max [ ng v b ] h b, b n and wv = g v βv. n Proposton 5 Gven α, a souton of can be obtaned by settng max Dα, β β β = n v α vα + βvα n where βvα s defned n Proposton 4, n =1 vα = X α, v α n = S X α n. Moreover, f we et w = wα, β = wvα = g vα βvα, n where wv s defned n Proposton 4, and w = w α, β = g v α β, n then w = w for a, and P w Dα, β = m [ P w β D α β ]. =1 Accordng to Proposton 5, the souton of 13 s gven by β t = n v t v t + ρt, n where v t = m =1 n n vt = v t 1 + m =1 n n vt, 11

12 Zheng, Wang, Xa, Xu, and Zhang Agorthm Dstrbuted Aternatng Dua Maxmzaton DADM Input: Objectve P w, target duaty gap ɛ, warm start varabes w nt, α nt, β nt, v nt, f not specfed, set w nt =, α nt =, β nt =, v nt =,. Intaze: et w = w nt, α = α nt, β = β nt, v = v nt. for t = 1,,... do Loca step for a machnes = 1,,..., m n parae do ca an arbtrary oca procedure, such as Agorthm 1 end for Goba step Aggregate v t = v t 1 + m =1 n n vt Compute ṽ t accordng to 15 Let ṽ t = ṽ t ṽ t 1 for a machnes = 1,,..., m n parae do update oca parameter ṽ t = ṽ t 1 + ṽ t end for Stoppng condton: Stop f P w t Dα t, β t ɛ. end for return w t = g ṽ t, α t, β t, v t, and the duaty gap P w t Dα t, β t. and ρ t = hw t s a subgradent of h at the souton w t of [ ] w t = arg mn nw v t + ngw + hw, w that can acheve the frst order optmaty condton nv t + n gw t + ρ t = for some subgradent gw t. The defnton of ṽ mpes that after each goba update, we have ṽ t = ṽ t = v t ρt n = gwt, for a = 1,..., m. 15 Snce the objectve 1 for the oca step on each machne ony depends on the mnbatch Q samped from S and the vector ṽ t, whch needs to be synchronzed at each goba step, we know from 15 that at each tme t, we can pass the same vector ṽ t as ṽ t to a nodes. In practce, t may be benefca to pass ṽ t nstead, especay when ṽ t s sparse but ṽ t s dense. Put thngs together, the oca-goba DADM teratons can be summarzed n Agorthm. If we consder the speca case of hw =, the souton of 15 s smpy ṽ t = ṽ t = v t, and the goba step n Agorthm can be smpfed as frst aggregatng updates by ṽ t = v t = 1 m =1 n n vt,

13 A Genera Dstrbuted Dua Coordnate Optmzaton Framework and then updatng oca parameters n parae. Further, f hw = and the data partton s baanced, that s n are dentca for a = 1,..., m, t can be verfed that the DADM procedure gnorng the mn-batch varaton s equvaent to CoCoA +. Therefore the framework presented here may be regarded as an aternatve nterpretaton. Moreover, when the added reguarzaton n 1 s compex and mght nvoves more than one non-smooth term, consderng the spttng of gw and hw can brng computatona advantages. For exampe, to promote both sparsty and group sparsty n the predctor we often use the sparse group asso reguarzaton Fredman et a., 1, where a combnaton of L 1 norm and mxed L /L 1 norm group sparse norm s ntroduced: 1 G w G + w / w, where we add a sght L reguarzaton to make t strongy convex, as dd n Shaev-Shwartz and Zhang, 14. The proxma mappng wth respect to the sparse group asso reguarzaton functon does not have cosed form souton, thus often rees on teratve mnmzaton steps, but there are cosed form proxma mappng wth respect to ether L -L 1 norm or the group norm. Thus f we smpy set hw = and gw = 1 G w G + w / w, then both the oca optmzaton update 1 and goba synchronzaton step 14 w not have cosed form souton. However, f we assgn the group norm on hw such that hw = 1 G w G, and hence gw = w / w, the oca updates steps 1 w enjoy cosed form update, whch makes the mpementaton much easer and we ony need to use teratve mnmzaton on the rare goba synchronzaton step Convergence Anayss Let w be the optma souton for the prma probem P w and α, β be the optma souton for the dua probem Dα, β respectvey. For the prma souton w t and the dua souton α t, β t at teraton t, we defne the prma sub-optmaty as and the dua sub-optmaty as ɛ t P := P wt P w, ɛ t D := Dα, β Dα t, β t. Due to the cose reatonshp of the dstrbuted dua formuaton and the snge-machne dua formuaton, an anayss of DADM can be obtaned by drecty generazng that of SDCA. We consder two knds of oss functons, smooth oss functons that mpy fast near convergence and genera L-Lpschtz oss functons. For the foowng two theorems we aways assume that g s 1-strongy convex w.r.t, X R for a, M = Q t s fxed on each machne, and our oca procedure optmzes D Q suffcenty we on each t t machne such that D Q α Q D Q α Q, where α Q s gven by a speca choce n each theorem. Theorem 6 Assume that each φ s 1/γ-smooth w.r.t and α Q s gven by α := s u t 1 α t 1, for a Q, 13

14 Zheng, Wang, Xa, Xu, and Zhang where u t 1 := φ X wt 1 and s := γn γn +M R [, 1]. To reach an expected duaty gap of E[P w T Dα T, β T ] ɛ, every T satsfyng the foowng condton s suffcent, T R γ + max n M og R γ + max n M ɛ D ɛ. 16 Theorem 7 Assume that each φ s L-Lpschtz w.r.t, and α Q s gven by := φ X wt 1 and ] q [, mn M /n ]. To reach an expected nor- ɛ, every T satsfyng the foowng condton s where u t 1 mazed duaty gap of E suffcent, α := qn u t 1 α t 1, for a Q, M [ P w Dα,β n T T + ñ + G ɛ max { }, ñ ogñ ɛ D ng + ñ + 5G ɛ, 17 where T max{t, 4G ɛ ñ + t }, t = max{, ñ ogñ ɛ D ng }, ñ = max n /M, G = 4RL, and w, α, β represent ether the average vector or a randomy chosen vector of w t 1, α t 1, β t 1 over t {T +1,..., T } respectvey, such as α = 1 T T T t=t +1 αt 1, β = 1 T T T t=t +1 βt 1, w = 1 T T T t=t +1 wt 1. Remark 8 Both Theorem 6 and Theorem 7 ncorporate two key components: the term n max M and the condton number term 1 L γ or. When the term max n M domnates the teraton compexty, we can speed up convergence and reduce the number of communcatons by ncreasng the number of machnes m or the oca mn-batch sze M. However, n some crcumstances when the condton number s arge, t w become the eadng factor, and ncreasng m or M w not contrbute to the computaton speedup. To tacke ths probem, we deveop the acceerated verson of DADM n Secton 8. Remark 9 Our method s cosey reated to prevous dstrbuted extensons of SDCA. Theorems 6, 7 that provde theoretca guarantees for more genera oca updates acheve the same teraton compexty wth the ones n DsDCA that ony aows some speca choces of oca mn-batch updates. Compared wth theoretca resuts of CoCoA + that are based on the Θ-approxmate souton of the oca dua subprobem, athough the derved bounds are wthn the same scae, Õ1/ɛ for Lpschtz osses and Õog1/ɛ for smooth osses, our bounds are dfferent and compementary. The anayss of CoCoA + can provde better nsghts for more accurate soutons of the oca sub-probems. Whe our anayss s based on the mn-batch setup, t can capture the contrbutons of the mn-batch sze and the number of machnes more expcty. Remark 1 Snce the bounds are derved wth a speca choce of α Q, the actua performance of the agorthm can be sgnfcanty better than what s ndcated by the bounds when the oca duas are better optmzed. For exampe, we can choose ProxSDCA n Shaev- Shwartz and Zhang, 14 as the oca procedure and adopt the sequenta update strategy as the oca sover of CoCoA + does. Ths s aso the one used n our experments. 14

15 A Genera Dstrbuted Dua Coordnate Optmzaton Framework Agorthm 3 Acceerated Dstrbuted Aternatng Dua Maxmzaton Acc-DADM. Parameters κ, η = / + κ, ν = 1 η/1 + η. Intaze v = y = w =, α =, ξ = 1 + η P D,. for t = 1,,..., T outer do 1. Construct new objectve: n P t w = φ X w + ngw + hw + κn w y t 1. =1. Ca DADM sover: w t, α t, β t, v t, ɛ t = DADMP t, ηξ t 1 / + η, w t 1, α t 1, β t 1, v t Update: y t = w t + νw t w t Update: end for Return w Touter. ξ t = 1 η/ξ t Acceeraton 1 L Theorems 6, 7 a mpy that when the condton number γ or s reatvey sma, DADM converges fast. However, the convergence may be sow when the condton number s arge and domnates the teraton compexty. In fact, we observe emprcay that the basc DADM method converges sowy when the reguarzaton parameter s sma. Ths phenomenon s aso consstent wth that of SDCA for the snge-machne case. In ths secton, we ntroduce the Acceerated Dstrbuted Aternatng Dua Maxmzaton Acc- DADM method that can aevate the probem. The procedure s motvated by Shaev-Shwartz and Zhang, 14, whch empoys an nner-outer teraton: at every teraton t, we sove a sghty modfed objectve, whch adds a reguarzaton term centered around the vector y t 1 = w t 1 + ν w t 1 w t, 18 where ν [, 1] s caed the momentum parameter. The acceerated DADM procedure descrbed n Agorthm 3 can be smary vewed as an nner-outer agorthm, where DADM serves as the nner teraton. In the outer teraton, we adjust the reguarzaton vector y t 1. That s, at each outer teraton t, we defne a modfed oca prma objectve on each machne, whch has the same form as the orgna oca prma objectve 9, except that g w s modfed to g t w that s defned by n g t w =n g t w + β w, g t w =gw + κ w y t 1. 15

16 Zheng, Wang, Xa, Xu, and Zhang It foows that we w need to sove a modfed dua at each oca step wth g repaced by g t n the oca dua probem 1. Therefore, compared to the basc DADM procedure, nothng changes other than g beng repaced by g t at each teraton. Specfcay, when the number of machnes m equas 1, ths agorthm reduces to AccProxSDCA descrbed n Shaev-Shwartz and Zhang, 14. Thus Acc-DADM can be naturay regarded as the dstrbuted generazaton of the snge-machne AccProxSDCA. Moreover, Acc-DADM aso aows arbtrary oca procedures as DADM does. Our emprca studes show that Acc-DADM sgnfcanty outperforms DADM n many cases. There are probaby two reasons. One reason s the use of a modfed reguarzer g t w that s more strongy convex than the orgna reguarzer gw when κ s much arger than. The other reason s cosey reated to the dstrbuted settng consdered n ths paper. Observe that n the modfed oca prma objectve P t w β := P w β + κn w y t 1, the frst term corresponds to the orgna oca prma objectve and the second term s an extra reguarzaton due to acceeraton that constrans w to be cose to y t 1. The effect s that dfferent oca probems become more smar to each other, whch stabze the overa system. 8.1 Theoretca Resuts of Acc-DADM for Smooth Losses The foowng theorem estabshes the computaton effcency guarantees for Acc-DADM. Theorem 11 Assume that each φ s 1/γ-smooth, and g s 1-strongy convex w.r.t, X R for a, M = Q s fxed on each machne. To obtan expected ɛ prma sub-optmaty: E[P w t ] P w ɛ, t s suffcent to have the foowng number of stages n Agorthm 3 T outer 1 + η og ξ 4 + κ + κ P D, = 1 + og + og, ɛ ɛ and the number of nner teratons n DADM at each stage: R T nner γ + κ + max n R og γ + κ + max n M M κ og. In partcuar, suppose we assume n 1 = n =... = n m, and M 1 = M =... = M m = b, then the tota vector computatons for each machne are bounded by κ + ÕT outer T nner b = Õ R 1 + γ + κ + n b. mb Remark 1 When κ =, then the guarantees reduce to DADM. However, DADM ony enjoys near speedup over ProxSDCA when the number of machnes satsfes m nγ/r, 16

17 A Genera Dstrbuted Dua Coordnate Optmzaton Framework and beng abe to obtan sub-near speedup when R γ = On. Besdes enjoyng the propertes descrbed above as DADM, f we choose κ n Agorthm 3 as κ = mr γn, and b = 1, then the tota vector computatons for each machne are bounded by Rm n Õ = γn m Õ Rn, γm whch means Acc-DADM can be much faster than DADM when the condton number s arge and aways obtan a reducton of computatons over the snge-machne AccProxSDCA by a factor of Õ 1/m. 8. Acceeraton for Non-smooth, Lpschtz Losses Theorem 11 estabshes the rate of convergence for smooth oss functons, but the acceeraton framework can aso be used on non-smooth, Lpschtz oss functons. The man dea s to use the Nesterov s smoothng technque Nesterov, 5 to construct a smooth approxmaton of the non-smooth functon φ, by addng a strongy-convex reguarzaton term on the conjugate of φ : φ α := φ α + γ α, by the property of conjugate functons e.g. Lemma n Shaev-Shwartz and Zhang, 14, we know φ, as the conjugate functon of φ s 1/γ-smooth, and φ u φ u γl. Then nstead of the orgna functon wth non-smooth osses, we mnmze the smoothed objectve: [ ] n ˆP w := φ X w + ngw + hw. 19 mn w R d =1 The foowng coroary estabshes the computaton effcency guarantees for Acc-DADM on non-smooth, Lpschtz oss functons. Coroary 13 Assume that each φ s L-Lpschtz, and g s 1-strongy convex w.r.t, X R for a, M = Q s fxed on each machne. To obtan expected ɛ normazed prma sub-optmaty: [ ] P w t E P w ɛ, n n t s suffcent to run Agorthm 3 on the smoothed objectve 19, wth and the foowng number of stages, T outer 1 + η og ξ 4 + κ = 1 + ɛ γ = ɛ L, og 17 + κ + og P D, ɛ,

18 Zheng, Wang, Xa, Xu, and Zhang and the number of nner teratons n DADM at each stage: T nner L R ɛ + κ + max n M L R og ɛ + κ + max n M κ og. In partcuar, suppose we assume n 1 = n =... = n m, and M 1 = M =... = M m = b, then the tota vector computatons for each machne are bounded by ÕT outer T nner b = Õ 1 + κ + L R ɛ + κ + n b. mb Remark 14 When κ =, then the guarantees reduce to DADM for Lpschtz osses. Moreover, when L Rm nɛ, f we choose κ n Agorthm 3 as κ = ml R nɛ, and b = 1, then the tota vector computatons for each machne are bounded by L Õ Rm n Rn = nɛ m Õ L, ɛm whch means Acc-DADM can be much faster than DADM when ɛ s sma and aways obtan a reducton of computatons over the snge-machne AccProxSDCA by a factor of Õ 1/m. 9. Proofs In ths secton, we frst present proofs about severa prevous propostons to estabsh our framework sody. Then based on our new dstrbuted dua formuaton, we drecty generaze the anayss of SDCA and adapt t to DADM n the commony used mn-batch setup. Fnay, we descrbe the proof for the theoretca guarantees of Acc-DADM. 9.1 Proof of Proposton 1 Proof Gven any set of parameters w; {w }; {u }; {α }; {β }, we have mn Jw; {w }; {u }; {α }; {β } w;{w };{u } m = mn mn φ u + α w;{w } u u X w + n gw + β w w + hw, =1 S }{{} A 18

19 A Genera Dstrbuted Dua Coordnate Optmzaton Framework where the mnmum s acheved at {u } such that φ u + α =. By emnatng u we obtan m A = mn φ α α X w + n gw + β w w + hw w;{w } =1 S m = mn mn φ w w α X α β w + n gw β w + hw, =1 S S }{{} B where mnmum s acheved at {w } such that X S α β + n gw =. By emnatng w we obtan m B = mn φ w α n g S X α β β n w + hw =1 S m = φ α n g S X α β h β, n =1 S }{{} Dα,β where the mnmzer s acheved at w such that β + hw =. Ths competes the proof. 9. Proof of Proposton Proof Gven any w, f we take u = X w and w = w for a and, then P w = Jw; {w }; {u }; {α }; {β } for arbtrary {α }; {β }. It foows from Proposton 1 that P w = Jw; {w }; {u }; {α }; {β } Dα, β. w s the mnmzer of P w. When w = w, we may set u = u = X w and w = w = w. From the frst order optmaty condton, we can obtan X φ u + n gw + hw =. If we take α = φ u and β = S X α n gw for some subgradents, then t s not dffcut to check that a equatons n 6 are satsfed. It foows that we can acheve equaty n Proposton 1 as P w = Jw ; {w }; {u }; {α }; {β } = Dα, β. 19

20 Zheng, Wang, Xa, Xu, and Zhang Ths means that zero duaty gap can be acheved wth w. It s easy to verfy that α, β maxmzes Dα, β, snce for any α, β, we have Dα, β Jw ; {w }; {u }; {α }; {β } = P w = Dα, β. 9.3 Proof of Proposton 3 Proof We have the decompostons m Dα, β = D α β h β, =1 and m P w = P w β β w + hw. =1 It foows that the duaty gap m P w Dα, β = =1[ P w β D α β ] + h β + hw β w. Note that the defnton of convex conjugate functon mpes that h β + hw β w, and the equaty hods when hw = β. Ths mpes the desred resut. 9.4 Proof of Proposton 4 Proof It s easy to check by usng the duaty that for any b and w: ng v b h b n [ nw v b ] [ ] + ngw + b w + hw n = nw v + ngw + hw, and the equaty hods f b = hw and v b n = gw for some subgradents. Based on the assumptons, the equaty can be acheved at b = βv = hwv and w = wv. Ths proves the desred resut by notcng that v b n = gw mpes that w = g v b/n.

21 A Genera Dstrbuted Dua Coordnate Optmzaton Framework 9.5 Proof of Proposton 5 Proof Snce α s fxed, we know that the probem max β Dα, β s equvaent to [ m max n g v α β ] h β. β n =1 Now by usng Jensen s nequaty, we obtan for any β : m n g v α β n =1 ng m =1 = ng vα n S X α β n n β n ng vα βvα n h β h h β h βvα. In the above dervaton, the ast nequaty has used Proposton 4. Here the equates can be acheved when v α β = vα βvα n n for a, whch can be obtaned wth the choce of {β } = {β } gven n the statement of the proposton. β 9.6 Proof of Theorem 6 The foowng resut s the mn-batch verson of a reated resut n the anayss of ProxSDCA, whch we appy to any oca machne. The proof s ncuded for competeness. Lemma 15 Assume that φ s γ-strongy convex w.r.t where γ can be zero and g s 1-smooth w.r.t. Every oca step, we randomy pck a mn-batch Q S, whose sze s M := Q, and optmze w.r.t dua varabes α, Q. Then, usng the smpfed notaton we have P w t 1 = P w t 1 β t 1, D α t 1 = D α t 1 β t 1, where E [D α t D α t 1 ] s M n E [P w t 1 G t := [ X γn ] 1 s E M s S D α t 1 ] s M n G t [ ] u t 1 α t 1 α := α t α t 1 = s u t 1 α t 1, for a Q, 1

22 Zheng, Wang, Xa, Xu, and Zhang and u t 1 = φ X wt 1, s [, 1]. Proof Snce ony the eements n Q are updated, the mprovement n the dua objectve can be wrtten as D α t D α t 1 = φ α t n g v t 1 + n 1 X α Q Q φ α t 1 n g v t 1 Q φ α t 1 α g v t 1 X α 1 X α n Q Q Q }{{} A φ α t 1, Q }{{} B where we have used the fact the g s 1-smooth n the dervaton of the nequaty. By the defnton of the update n the agorthm, and the defnton of α = s u t 1 α t 1, s [, 1], we have A φ α t 1 + s u t 1 α t 1 Q g v t 1 X s u t 1 α t 1 Q 1 X s u t 1 n Q α t 1 1 From now on, we omt the superscrpt t 1. Snce φ s γ-strongy convex w.r.t, we have φ α + s u α = φ s u + 1 s α s φ u + 1 s φ α γ s 1 s u α

23 A Genera Dstrbuted Dua Coordnate Optmzaton Framework Brngng Eq. nto Eq. 1, we get A Q s φ u 1 s φ α + γ s 1 s u α w s X u α 1 s X u α n Q Q φ α + Q Q }{{} B + Q γ s 1 s u α s w X u φ u + s φ α + s w X α M X u α s n, Q where we get the second nequaty accordng to the fact that Q a Q M a. Snce we choose u = φ X w, for some subgradents φ X w, whch yeds w X u φ u = φ X w, then we obtan A B Q s [φ X w + φ α + w X α ] + [ γ1 s u α s s M X ]. n Q = ] [φ X w + φ α + w X α Q s + M n Q s u α [ ] γn 1 s X. M s 3 Reca that wth w = g ṽ, we have gw + g ṽ = w ṽ. Then we derve the oca duaty gap as P w D α = φ X w + n gw + β w φ α n g S X α β n S S = φ X w + φ α + w X α S 3

24 Zheng, Wang, Xa, Xu, and Zhang Then, takng the expectaton of Eq. 3 w.r.t the random choce of mn-batch set Q at round t, we obtan E t [A B ] M n S s + M n = s M n [φ X w + φ α + w X α ] [ ] s u α γn 1 s X M s S ] [φ X w + φ α + w X α S M n S s u α [ X γn ] 1 s. M s Take expectaton of both sdes w.r.t the randomness n prevous teratons, we have E[A B ] s M E [ P w D α n ] s M n G t, where G t := [ X γn ] 1 s E [ u α M s ]. S Proof of Theorem 6. Proof We w appy Lemma 15 wth s = 1 γn 1 + RM = γn γn + M R [, 1], S. Reca that X R for a S, then we have X γn 1 s M s, for a S, whch mpes that G t for a. It foows that for a after the oca update step we have: E [ D α t s M E n βt 1 [ P w t 1 D α t 1 β t 1 β t 1 ] D α t 1 β t 1 ]. 4 Now we note that after the goba step at teraton t 1, the choces of w t 1 and β t 1 n DADM s accordng to the choce of Proposton 4 and Proposton 5. It foows from 4

25 A Genera Dstrbuted Dua Coordnate Optmzaton Framework Proposton 5 that the foowng reatonshp between the goba and oca duaty gap at the begnnng of the t-th teraton s satsfed: P w t 1 Dα t 1, β t 1 = [ P w t 1 β t 1 D ] α t 1 β t 1. Usng ths decomposton and summng over n 4, we obtan where E [Dα t, β t 1 Dα t 1, β t 1 ] qe [P w t 1 Dα t 1, β t 1 ], q = mn s M n = mn Snce Dα t, β t Dα t, β t 1, we obtan γm γn + M R. E [Dα t, β t Dα t 1, β t 1 ] qe [P w t 1 Dα t 1, β t 1 ]. Let α, β be the optma souton of the dua probem, we have defned the dua suboptmaty as ɛ t D := Dα, β Dα t, β t. Let ɛ t 1 G = P w t 1 Dα t 1, β t 1, and we know that ɛ t 1 D ɛ t 1 G. It foows that Therefore we have E[ɛ t 1 D ] E[ɛ t 1 D ɛ t D ] qe[ɛt 1 G ] qe[ɛ t 1 D ]. qe[ɛ t G ] E[ɛt D ] 1 qe[ɛt 1 D ] 1 q t ɛ D e qt ɛ To obtan an expected duaty gap of E[ɛ T G ] ɛ, every T, whch satsfes T 1 q og 1 ɛ D, q ɛ s suffcent. Ths proves the desred bound. D. 9.7 Proof of Theorem 7 Now, we consder L-Lpschtz oss functons and use the foowng basc emma for L- Lpschtz osses taken from Shaev-Shwartz and Zhang, 13, 14. Lemma 16 Let φ : R q R be an L-Lpschtz functon w.r.t, then we have φ α =, for any α R q s.t. α > L. Proof of Theorem 7. Proof Appyng Lemma 15 wth γ =, then we have = [ ] X E u t 1 α t 1. S G t 5

26 Zheng, Wang, Xa, Xu, and Zhang Accordng to Lemma 16, we know that u t 1 u t 1 α t 1 u t 1 L and α t 1 + α t 1 4L. L, thus we have Reca that X R, then we have Gt G, where G = 4n RL. Combnng ths nto Lemma 15, we have E [ D α t s M E n βt 1 [ P w t 1 D α t 1 β t 1 ] β t 1 D α t 1 β t 1 ] s M n G. 5 Now we aso note that after the goba step at teraton t 1, the choces of w t 1 and β t 1 n DADM s accordng to the choce of Proposton 4 and Proposton 5. It foows from Proposton 5 that the foowng reatonshp of goba and oca duaty gap at the begnnng of the t-th teraton s satsfed: P w t 1 Dα t 1, β t 1 = [ P w t 1 β t 1 D ] α t 1 β t 1. Summng the nequaty 5 over, combnng wth the above decomposton and brngng Dα t, β t Dα t, β t 1 nto t, we get E[Dα t, β t Dα t 1, β t 1 ] qe[p w t 1 Dα t 1, β t 1 ] m =1 q G, 6 M where q [, mn n ], q = s M n and s [, 1] s chosen so that a s M n = 1,..., m are equa. Let α, β be the optma souton for the dua probem Dα, β, and we have defned the dua suboptmaty as ɛ t D := Dα, β Dα t, β t. Note that the duaty gap s an upper bound of the dua suboptmaty, P w t 1 Dα t 1, β t 1 ɛ t 1 D. Then 6 mpes that E [ ] [ ɛ t D 1 qe n ] ɛ t 1 D + q G n, where G = 1 n m G = 4RL Startng from ths recurson, we can now appy the same anayss for L-Lpschtz oss functons of the snge-machne SDCA n Shaev-Shwartz and Zhang, 13 to obtan the foowng desred nequaty: E [ ] ɛ t D n =1 G ñ + t t, 7 for a t t = max, ñ og ɛ ng, where ñ = max n /M. Further appyng the same strateges n Shaev-Shwartz and Zhang, 13 based on 7 proves the desred bound. D ñ 6

27 A Genera Dstrbuted Dua Coordnate Optmzaton Framework 9.8 Proof of Theorem 11 Our proof strategy foows Shaev-Shwartz and Zhang, 14 and Frostg et a., 15, whch both used acceeraton technques of Nesterov, 4 on top of approxmate proxma pont steps, the man dfferences compared wth Shaev-Shwartz and Zhang, 14 and Frostg et a., 15 are here we warm start wth two groups dua varabes α and β where Shaev-Shwartz and Zhang, 14 warm start ony wth α as t consder the snge machne settng, and Frostg et a., 15 warm start from prma varabes w. Proof The proof conssts of the foowng steps: In Lemma 17 we show that one can construct a quadratc ower bound of the orgna objectve P w from an approxmate mnmzer of the proxma objectve P t w. Usng the quadratc ower bound we construct an estmaton sequence, based on whch n Lemma 18 we prove the acceerated convergence rate for the outer oops. We show n Lemma 19 that by warm start the terates from the ast stage, the dua sub-optmaty for the next stage s sma. Based on Lemma 19, we know the contracton factor between the nta dua sub-optmaty and the target prma-dua gap at stage t can be upper bounded by D t α t opt, βt opt D tα t 1, β t 1 ηξ t 1 / + η ɛ t 1 ηξ t 1 / + η η/ η η1 η/ κ 36 + η 5 1 η 36κξ t 3 ηξ t 1 / + η where the ast step we used the fact that η 4 = η 1η +1+1 > η 1η +1 = κη +1. Thus usng the resuts from pan DADM Theorem 6, we know the number of nner teratons n each stage s upper bounded by where χ = Dt α t opt χ og χ + og, βt opt D tα t 1, β t 1 ηξ t 1 / + η χ og χ κ og, R γ+κ + max n M. 9.9 Proof of Coroary 13 By the property of φ u, for every w we have ˆP w n P w n γl, 7

28 Zheng, Wang, Xa, Xu, and Zhang thus f we found a predctor w t that s ɛ -suboptma wth respect to ˆP w n : ˆP w t n mn w ˆP w n ɛ, and we choose γ = ɛ/l, we know t must be ɛ-suboptma wth respect to P w n, because P w t n P w n ˆP w t n ˆP w t n ˆP w + γl n mn w ˆP w n + ɛ ɛ. The rest of the proof just foows the smooth case as proved n Theorem 11. Dua subprobems n Acc-DADM P t w = = =1 =1 Defne: = + κ, fw = gw + κ w. Let n φ X w + ngw + hw + κn w y t 1 n φ X w + fw n κ w y t 1 + hw + κn y t 1 be the goba prma probem to sove, and P t w = S φ X w + n gw + κn w y t 1 be the separated oca probem. Gven each dua varabe β, we aso defne the adjusted oca prma probem as: P t w β = S φ X w + n gw + β w + κn t s not hard to see the adjusted oca dua probem s D t α β = S φ α n f S X α β + κn y t 1 n and the goba dua objectve can be wrtten as D t α, β = m m D t α β h β. =1 =1 w y t 1, + κn y t 1, Quadratc ower bound for P w based on approxmate proxma pont agorthm Snce P t w = P w + κn w y t 1, and et wt opt = arg mn w P t w. The foowng emma shows we can construct a ower bound of P w from an approxmate mnmzer of P t w. 8

29 A Genera Dstrbuted Dua Coordnate Optmzaton Framework Lemma 17 Let w + be an ɛ-approxmated mnmzer of P t w,.e. P t w + P t w t opt + ɛ. We can construct the foowng quadratc ower bound for P w, as w P w P w + + Qw; w +, y t 1, ɛ, 8 where Qw; w +, y t 1, ɛ = n 4 y w t 1 κ n w + y t κ y t 1 w + κ + ɛ. Proof Snce w t opt s the mnmzer of a κ + n-strongy convex objectve P tw, we know w, P t w P t w t κ + n opt + w w t opt P t w + κ + n + w w t opt ɛ, whch s equvaent to P w P w + + Snce κ + / κ + n w w t w w + + / =κ re-organzng terms we get So κ + = κ + / opt w w t ɛ + κn opt + wt opt w+ w t w opt w + t opt w+ + κ + / w w t opt, wt opt w+ κ + / w t w opt w + t opt w+ + κ + / w + w t opt w + y t 1 w y t 1., w t opt w κ + / w w + κ + κ + / + / w t opt w w + w t opt P w P w + κ + /n + w w + κ + κ + /n w + w t opt ɛ + κn w + y t 1 w y t 1 9

30 Zheng, Wang, Xa, Xu, and Zhang Aso noted that κ+n w + w t Decompose w w + we get opt ɛ, we get P w P w + κ + /n + w w + κ + ɛ + κn w + y t 1 w y t 1 w w + w = y t 1 + y t 1 w + + w yt 1, y t 1 w +. So P w P w + + /n κ + /n + w y t 1 y t 1 w + κ + ɛ + κ + /n w yt 1, y t 1 w + Notced that the rght hand sde of above nequaty s a quadratc functon wth respect to w, and the mnmum s acheved when w = y t κ y t 1 w +, wth mnmum vaue κ n w + y t 1 wth above we fnshed the proof of Lemma 17. κ + ɛ, Convergence proof and for t 1, Defne the foowng sequence of quadratc functons ψ w = P + n κ + 4 w P D,, ψ t w = 1 ηψ t 1 w + ηp w t + Qw; w t, y t 1, ɛ t, where η = +κ, We frst cacuate the expct form of the quadratc functon ψ tw and ts mnmzer v t = arg mn w ψ t w. Ceary v =, and notced that ψ t w s aways a n -strongy convex functon, we know ψ tw s n the foowng form: ψ t w = ψ t v t + n 4 3 w v t.

31 A Genera Dstrbuted Dua Coordnate Optmzaton Framework Based on the defnton of ψ t+1 w, and v t+1 s mnmzng ψ t+1 w, based on frst-order optmaty condton, we know 1 ηn v t+1 v t + ηn v t+1 y t 1 + κ y t w t+1 =, rearrangng we get v t+1 = 1 ηv t + η y t 1 + κ y t w t+1. The foowng emma proves the convergence rate of w t to ts mnmzer. Lemma 18 Let ɛ t η 1 + η ξ t, and ξ t = 1 η/ t ξ, we w have the foowng convergence guarantee: P w t P w ξ t. Proof It s suffcent to prove P w t mn w ψ tw ξ t, 9 then we get P w t P w P w t ψ t w P w t mn w ψ tw ξ t. We prove equaton 9 by nducton. When t =, we have κ + P w φ v = ɛ = ξ, 31

32 Zheng, Wang, Xa, Xu, and Zhang whch verfed 9 s true for t =. Suppose the cam hods for some t 1, for the stage t + 1, we have ψ t+1 v t+1 =1 η ψ t v t + n 4 v t+1 v t + ηp w t+1 + Qv t+1 ; w t+1, y t, ɛ t =1 ηψ t v t + 1 ηη n 4 vt + ηp w t+1 + η1 η n 4 vt ηκ n y t w t+1 κ + η ɛ t =1 ηψ t v t + ηp w t+1 ηκ n η1 ηn + 4 vt y t 1 + κ y t 1 + κ y t w t+1 y t 1 + κ y t w t+1 y t w t+1 y t w t+1 κ + η. ɛ t Snce ηκ n ηκ n y t w t+1 η1 ηn + 4 κ + η1 ηn + 4 vt 1 + κ y t w t+1 v t y t, y t w t+1 + η1 ηn ηκ n + η1 ηκ n y t w t+1 + η1 ηn κ + v t y t, y t w t+1 = η κ n y t w t+1 y t 1 + κ y t w t+1 + η1 ηn κ + v t y t, y t w t+1 Thus κ + ψ t+1 v t+1 1 ηψ t v t + ηp w t+1 η η κ n y t w t+1 + η1 ηn κ + ɛ t v t y t, y t w t+1 3

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory Advanced Scence and Technoogy Letters Vo.83 (ISA 205), pp.60-65 http://dx.do.org/0.4257/ast.205.83.2 Research on Compex etworks Contro Based on Fuzzy Integra Sdng Theory Dongsheng Yang, Bngqng L, 2, He