An Accelerated Proximal Coordinate Gradient Method

A Accelerated Proxmal Coordate Gradet Method Qhag L Uversty of Iowa Iowa Cty IA USA qhag-l@uowaedu Zhaosog Lu Smo Fraser Uversty Buraby BC Caada zhaosog@sfuca L Xao Mcrosoft Research Redmod WA USA lxao@mcrosoftcom Abstract We develop a accelerated radomzed proxmal coordate gradet APCG method for solvg a broad class of composte covex optmzato problems I partcular our method acheves faster lear covergece rates for mmzg strogly covex fuctos tha exstg radomzed proxmal coordate gradet methods We show how to apply the APCG method to solve the dual of the regularzed emprcal rsk mmzato ERM problem ad devse effcet mplemetatos that avod full-dmesoal vector operatos For ll-codtoed ERM problems our method obtas mproved covergece rates tha the state-ofthe-art stochastc dual coordate ascet SDCA method 1 Itroducto Coordate descet methods have receved extesve atteto recet years due to ther potetal for solvg large-scale optmzato problems arsg from mache learg ad other applcatos I ths paper we develop a accelerated proxmal coordate gradet APCG method for solvg covex optmzato problems wth the followg form: mmze x R N Fx def = fx+ψx } 1 where f s dfferetable o domψ ad Ψ has a block separable structure More specfcally Ψx = Ψ x where each x deotes a sub-vector of x wth cardalty N ad each Ψ : R N R + } s a closed covex fucto We assume the collecto x : = 1} form a partto of the compoets of x R N I addto to the capablty of modelg osmooth regularzato terms such as Ψx = λ x 1 ths model also cludes optmzato problems wth block separable costrats More precsely each block costrat x C where C s a closed covex set ca be modeled by a dcator fucto defed as Ψ x = 0 fx C ad otherwse At each terato coordate descet methods choose oe block of coordates x to suffcetly reduce the objectve value whle keepg other blocks fxed Oe commo ad smple approach for choosg such a block s the cyclc scheme The global ad local covergece propertes of the cyclc coordate descet method have bee studed for example [1 11 16 5] Recetly radomzed strateges for choosg the block to update became more popular I addto to ts theoretcal beefts [13 14 19] umerous expermets have demostrated that radomzed coordate descet methods are very powerful for solvg large-scale mache learg problems [3 6 18 19] Ispred by the success of accelerated full gradet methods eg [1 1 ] several recet work appled Nesterov s accelerato schemes to speed up radomzed coordate descet methods I partcular Nesterov [13] developed a accelerated radomzed coordate gradet method for mmzg ucostraed smooth covex fuctos whch correspods to the case of Ψx 0 1 1

Lu ad Xao [10] gave a sharper covergece aalyss of Nesterov s method ad Lee ad Sdford [8] developed extesos wth weghted radom samplg schemes More recetly Fercoq ad Rchtárk [4] proposed a APPROX Accelerated Parallel ad PROXmal coordate descet method for solvg the more geeral problem 1 ad obtaed accelerated sublear covergece rates but ther method caot explot the strog covexty to obta accelerated lear rates I ths paper we develop a geeral APCG method that acheves accelerated lear covergece rates whe the objectve fucto s strogly covex Wthout the strog covexty assumpto our method recovers the APPROX method [4] Moreover we show how to apply the APCG method to solve the dual of the regularzed emprcal rsk mmzato ERM problem ad devse effcet mplemetatos that avod full-dmesoal vector operatos For ll-codtoed ERM problems our method obtas faster covergece rates tha the state-of-the-art stochastc dual coordate ascet SDCA method [19] ad the mproved terato complexty matches the accelerated SDCA method [0] We preset umercal expermets to llustrate the advatage of our method 11 Notatos ad assumptos For ay partto of x R N to x R N : = 1} there s a N N permutato matrxu parttoed as U = [U 1 U ] where U R N N such that x = U x ad x = U T x = 1 For ay x R N the partal gradet of f wth respect tox s defed as fx = U T fx = 1 We assocate each subspace R N for = 1 wth the stadard Eucldea orm deoted by We make the followg assumptos whch are stadard the lterature o coordate descet methods eg [13 14] Assumpto 1 The gradet of fuctof s block-wse Lpschtz cotuous wth costatsl e fx+u h fx L h h R N = 1 x R N For coveece we defe the followg orm the whole space R N : 1/ x L = L x x R N 3 Assumpto There exsts µ 0 such that for ally R N ad x domψ fy fx+ fxy x + µ y x L The covexty parameter of f wth respect to the orm L s the largest µ such that the above equalty holds Every covex fucto satsfes Assumpto wthµ = 0 Ifµ > 0 the fuctof s called strogly covex We ote that a mmedate cosequece of Assumpto 1 s fx+u h fx+ fxh + L h h R N = 1 x R N 4 Ths together wth Assumpto mples µ 1 The APCG method I ths secto we descrbe the geeral APCG method ad several varats that are more sutable for mplemetato uder dfferet assumptos These algorthms exted Nesterov s accelerated gradet methods [1 Secto ] to the composte ad coordate descet settg We frst expla the otatos used our algorthms The algorthms proceed teratos wth k beg the terato couter Lower case letters x y z represet vectors the full space R N ad x k y k ad z k are ther values at the kth terato Each block coordate s dcated wth a subscrpt for examplex k represets the value of theth block of the vectorx k The Greek letters α β γ are scalars ad α k β k ad γ k represet ther values at teratok

Algorthm 1: the APCG method Iput: x 0 domψ ad covexty parameter µ 0 Italze: set z 0 = x 0 ad choose 0 < γ 0 [µ1] Iterate: repeat fork = 01 1 Compute α k 0 1 ] from the equato ad set Compute y k as y k = α k = 1 α k γ k +α k µ 5 γ k+1 = 1 α k γ k +α k µ β k = α kµ γ k+1 6 1 α k γ k z k +γ k+1 x k 7 α k γ k +γ k+1 3 Choose k 1} uformly at radom ad compute z k+1 αk = argm x 1 β k z k β k y k } x R + L k fy k x k +Ψ k x k N 4 Set x k+1 = y k +α k z k+1 z k + µ zk y k 8 The geeral APCG method s gve as Algorthm 1 At each terato k t chooses a radom coordate k 1} ad geeratesy k x k+1 adz k+1 Oe ca observe thatx k+1 ad z k+1 deped o the realzato of the radom varable ξ k = 0 1 k } whle y k s depedet of k ad oly depeds o ξ k 1 To better uderstad ths method we make the followg observatos For coveece we defe z k+1 αk = argm x 1 βk z k β k y k } x R L + fyk x y k +Ψx 9 N whch s a full-dmesoal update verso of Step 3 Oe ca observe that z k+1 s updated as k+1 z f = k z k+1 = 1 β k z k Notce that from 5 6 7 ad 8 we have whch together wth 10 yelds x k+1 = +β k y k f k x k+1 = y k +α k z k+1 1 β k z k β k y k y k +α k z k+1 z k + µ z k y k f = k y k f k That s Step 4 we oly eed to update the block coordates x k+1 k 10 11 ad set the rest to bey k We ow state a theorem cocerg the expected rate of covergece of the APCG method whose proof ca be foud the full report [9] Theorem 1 Suppose Assumptos 1 ad hold Let F be the optmal value of problem 1 ad x k } be the sequece geerated by the APCG method The for ay k 0 there holds: µ k E ξk 1 [Fx k ] F m 1 +k γ 0 } Fx 0 F + γ 0 R 0 where def R 0 = m x x X x0 L 1 ad X s the set of optmal solutos of problem 1 3

Our result Theorem 1 mproves upo the covergece rates of the proxmal coordate gradet methods [14 10] whch have covergece rates o the order of 1 } µ k O m +k For = 1 our result matches exactly that of the accelerated full gradet method [1 Secto ] 1 Two specal cases Here we gve two smplfed versos of the APCG method for the specal cases of µ = 0 ad µ > 0 respectvely Algorthm shows the smplfed verso for µ = 0 whch ca be appled to problems wthout strog covexty or f the covexty parameter µ s ukow Iput: x 0 domψ Algorthm : APCG wthµ = 0 Italze: set z 0 = x 0 ad choose α 0 0 1 ] Iterate: repeat fork = 01 1 Compute y k = 1 α k x k +α k z k Choose k 1} uformly at radom ad compute = argm x R N z k+1 k ad setz k+1 αk L k = z k for all k 3 Set x k+1 = y k +α k z k+1 z k α 4 Compute α k+1 = 1 4 k +4α k α k x z k k + k fy k x y k } k +Ψ k x Accordg to Theorem 1 Algorthm has a accelerated sublear covergece rate that s E ξk 1 [Fx k ] F Fx 0 F + 1 +kα 0 R 0 Wth the choce of α 0 = 1/ Algorthm reduces to the APPROX method [4] wth sgle block update at each terato eτ = 1 ther Algorthm 1 For the strogly covex case wth µ > 0 we ca talze Algorthm 1 wth the parameter γ 0 = µ whch mples γ k = µ ad α k = β k = µ/ for all k 0 Ths results Algorthm 3 Algorthm 3: APCG wthγ 0 = µ > 0 Iput: x 0 domψ ad covexty parameter µ > 0 Italze: set z 0 = x 0 ad ad α = µ Iterate: repeat fork = 01 1 Compute y k = xk +αz k 1+α Choose k 1} uformly at radom ad compute z k+1 α = argm x 1 αz k αy k } + L k fy k x k y k k +Ψ k x k x R N 3 Set x k+1 = y k +αz k+1 z k +α z k y k As a drect corollary of Theorem 1 Algorthm 3 ejoys a accelerated lear covergece rate: µ k E ξk 1 [Fx k ] F 1 Fx 0 F + µ R 0 To the best of our kowledge ths s the frst tme such a accelerated rate s obtaed for solvg the geeral problem 1 wth strog covexty usg coordate descet type of methods 4

Effcet mplemetato The APCG methods we preseted so far all eed to perform full-dmesoal vector operatos at each terato For example y k s updated as a covex combato of x k ad z k ad ths ca be very costly sce geeral they ca be dese vectors Moreover for the strogly covex case Algorthms 1 ad 3 all blocks of z k+1 eed to be updated at each terato although oly the k -th block eeds to compute the partal gradet ad perform a proxmal mappg These full-dmesoal vector updates cost ON operatos per terato ad may cause the overall computatoal cost of APCG to be eve hgher tha the full gradet methods see dscussos [13] I order to avod full-dmesoal vector operatos Lee ad Sdford [8] proposed a chage of varables scheme for accelerated coordated descet methods for ucostraed smooth mmzato Fercoq ad Rchtárk [4] devsed a smlar scheme for effcet mplemetato the µ = 0 case for composte mmzato Here we show that such a scheme ca also be developed for the case of µ > 0 the composte optmzato settg For smplcty we oly preset a equvalet mplemetato of the smplfed APCG method descrbed Algorthm 3 Algorthm 4: Effcet mplemetato of APCG wthγ 0 = µ > 0 Iput: x 0 domψ ad covexty parameter µ > 0 Italze: set α = µ 1 α ad ρ = 1+α ad talzeu0 = 0 ad v 0 = x 0 Iterate: repeat fork = 01 1 Choose k 1} uformly at radom ad compute } k αlk k = argm + k fρ k+1 u k +v k +Ψ k ρ k+1 u k k +v k k + R N k Let u k+1 = u k ad v k+1 = v k ad update u k+1 k = u k k 1 α ρ k+1 k k Output: x k+1 = ρ k+1 u k+1 +v k+1 v k+1 k = v k k + 1+α k k 13 The followg Proposto s proved the full report [9] Proposto 1 The terates of Algorthm 3 ad Algorthm 4 satsfy the followg relatoshps: x k = ρ k u k +v k y k = ρ k+1 u k +v k z k = ρ k u k +v k 14 We ote that Algorthm 4 oly a sgle block coordate of the vectorsu k adv k are updated at each terato whch coston However computg the partal gradet k fρ k+1 u k +v k may stll coston geeral I the ext secto we show how to further explot structure may ERM problems to completely avod full-dmesoal vector operatos 3 Applcato to regularzed emprcal rsk mmzato ERM LetA 1 A be vectors R d φ 1 φ be a sequece of covex fuctos defed or adg be a covex fucto o R d Regularzed ERM ams to solve the followg problem: mmze w R d Pw wth Pw = 1 φ A T w+λgw whereλ > 0 s a regularzato parameter For example gve a labelb ±1} for each vectora for = 1 we obta the lear SVM problem by settgφ z = max01 b z} adgw = 1/ w Regularzed logstc regresso s obtaed by settgφ z = log1+exp b z Ths formulato also cludes regresso problems For example rdge regresso s obtaed by settg 1/φ z = z b ad gw = 1/ w ad we get Lasso fgw = w 1 5

Let φ be the covex cojugate of φ that s φ u = max z Rzu φ z The dual of the regularzed ERM problem s see eg [19] maxmze Dx wth Dx = 1 1 φ x R x λg λ Ax where A = [A 1 A ] Ths s equvalet to mmze Fx def = Dx that s mmze x R Fx def = 1 φ x +λg 1 λ Ax The structure of Fx above matches the formulato 1 ad wth fx = λg 1 λ Ax ad Ψ x = 1 φ x ad we ca apply the APCG method to mmze Fx I order to explot the fast lear covergece rate we make the followg assumpto Assumpto 3 Each fuctoφ s1/γ smooth ad the fuctog has ut covexty parameter 1 Here we slghtly abuse the otato by overloadg γ whch also appeared Algorthm 1 But ths secto t solely represets the verse smoothess parameter of φ Assumpto 3 mples that each φ has strog covexty parameter γ wth respect to the local Eucldea orm ad g s dfferetable ad g has Lpschtz costat 1 I the followg we splt the fucto Fx = fx+ψx by relocatg the strog covexty term as follows: 1 fx = λg λ Ax + γ x Ψx = 1 φ x γ x 15 As a result the fucto f s strogly covex ad each Ψ s stll covex Now we ca apply the APCG method to mmzefx = Dx ad obta the followg guaratee Theorem Suppose Assumpto 3 holds ad A R for all = 1 I order to obta a expected dual optmalty gap E[D Dx k ] ǫ by usg the APCG method t suffces to have k + R logc/ǫ 16 where D = max x R Dx ad the costat C = D Dx 0 +γ/ x 0 x Proof The fucto fx 15 has coordate Lpschtz costats L = A λ + γ R + λ ad covexty parameter γ wth respect to the uweghted Eucldea orm The strog covexty parameter of fx wth respect to the orm L defed 3 s / R + λ = R + µ = γ Accordg to Theorem 1 we havee[d Dx 0 ] t suffces to have the umber of teratosk to be larger tha R µ logc/ǫ = + logc/ǫ = Ths fshes the proof 1 µ + R logc/ǫ + kc exp µ k C Therefore R logc/ǫ Several state-of-the-art algorthms for ERM cludg SDCA [19] SAG [15 17] ad SVRG [7 3] obta the terato complexty O log1/ǫ 17 + R We ote that our result 16 ca be much better for ll-codtoed problems e whe the codto umber R s larger tha Ths s also cofrmed by our umercal expermets Secto 4 The complexty boud 17 for the aforemetoed work s for mmzg the prmal objectve Px or the dualty gappx Dx but our result Theorem s terms of the dual optmalty I the full report [9] we show that the same guaratee o accelerated prmal-dual covergece ca be obtaed by our method wth a extra prmal gradet step wthout affectg the overall complexty The expermets Secto 4 llustrate superor performace of our algorthm o reducg the prmal objectve value eve wthout performg the extra step 6

We ote that Shalev-Shwartz ad Zhag [0] recetly developed a accelerated SDCA method whch acheves the same complextyo + log1/ǫ as our method Ther method calls the SDCA method a full-dmesoal accelerated gradet method a er-outer terato procedure I cotrast our APCG method s a straghtforward sgle loop coordate gradet method 31 Implemetato detals Here we show how to explot the structure of the regularzed ERM problem to effcetly compute the coordate gradet k fy k ad totally avod full-dmesoal updates Algorthm 4 We focus o the specal casegw = 1 w ad show how to compute k fy k Accordg to 15 k fy k = 1 λ A T Ayk + γ yk k Sce we do ot form y k Algorthm 4 we update Ay k by storg ad updatg two vectors R d : p k = Au k ad q k = Av k The resultg method s detaled Algorthm 5 Algorthm 5: APCG for solvg dual ERM Iput: x 0 domψ ad covexty parameter µ > 0 Italze: set α = µ 1 α ad ρ = 1+α ad let u0 = 0 v 0 = x 0 p 0 = 0 ad q 0 = Ax 0 Iterate: repeat fork = 01 1 Choose k 1} uformly at radom compute the coordate gradet k k = 1 λ ρ k+1 A T k p k +A T k q k + γ ρ k+1 u k k +v k k Compute coordate cremet } k α Ak k = argm + λ + k k + 1 φ k ρ k+1 u k k v k k R N k 3 Let u k+1 = u k ad v k+1 = v k ad update u k+1 k = u k k 1 α ρ k+1 k k p k+1 = p k 1 α ρ k+1 A k k k v k+1 k = v k k + 1+α k k q k+1 = q k + 1+α A k k k 18 Output: approxmate prmal ad dual solutos w k+1 = 1 λ ρ k+ p k+1 +q k+1 x k+1 = ρ k+1 u k+1 +v k+1 Each terato of Algorthm 5 oly volves the two er products A T k p k A T k q k computg k k ad the two vector addtos 18 They all cost Od rather tha O Whe the A s are sparse the case of most large-scale problems these operatos ca be carred out very effcetly Bascally each terato of Algorthm 5 oly costs twce as much as that of SDCA [6 19] 4 Expermets I our expermets we solve ERM problems wth smoothed hge loss for bary classfcato That s we pre-multply each feature vector A by ts label b ±1} ad use the loss fucto 0 f a 1 φa = 1 a γ f a 1 γ 1 γ 1 a otherwse The cojugate fucto ofφsφ b = b+ γ b fb [ 10] ad otherwse Therefore we have Ψ x = 1 φ x γ x x = f x [01] otherwse The dataset used our expermets are summarzed Table 1 7

λ rcv1 covertype ews0 AFG SDCA APCG 5 0 0 40 60 80 100 AFG SDCA APCG 5 0 0 40 60 80 100 AFG SDCA APCG 5 0 0 40 60 80 100 0 0 40 60 80 100 0 0 40 60 80 100 0 0 40 60 80 100 10 7 10 0 0 40 60 80 100 10 0 0 40 60 80 100 10 0 0 40 60 80 100 AFG 10 8 10 SDCA APCG 10 10 0 0 40 60 80 100 0 0 40 60 80 100 0 0 40 60 80 100 Fgure 1: Comparg the APCG method wth SDCA ad the accelerated full gradet method AFG wth adaptve le search I each plot the vertcal axs s the prmal objectve gappw k P ad the horzotal axs s the umber of passes through the etre dataset The three colums correspod to the three datasets ad each row correspods to a partcular value of the regularzato parameter λ I our expermets we compare the APCG method wth SDCA ad the accelerated full gradet method AFG [1] wth a addtoal le search procedure to mprove effcecy Whe the regularzato parameter λ s ot too small aroud the APCG performs smlarly as SDCA as predcted by our complexty results ad they both outperform AFG by a substatal marg Fgure 1 shows the results the ll-codtoed settg wth λ varyg form to 10 8 Here we see that APCG has superor performace reducg the prmal objectve value compared wth SDCA ad AFG eve though our theory oly gves complexty for solvg the dual ERM problem AFG evetually catches up for cases wth very large codto umber see the plots forλ = 10 8 datasets umber of samples umber of features d sparsty rcv1 04 4736 016% covtype 58101 54 % ews0 19996 1355191 004% Table 1: Characterstcs of three bary classfcato datasets avalable from the LIBSVM web page: http://wwwcsetuedutw/ cjl/lbsvmtools/datasets 8

Refereces [1] A Beck ad M Teboulle A fast teratve shrkage-threshold algorthm for lear verse problems SIAM Joural o Imagg Sceces 1:183 0 009 [] A Beck ad L Tetruashvl O the covergece of block coordate descet type methods SIAM Joural o Optmzato 134:037 060 013 [3] K-W Chag C-J Hseh ad C-J L Coordate descet method for large-scale l -loss lear support vector maches Joural of Mache Learg Research 9:1369 1398 008 [4] O Fercoq ad P Rchtárk Accelerated parallel ad proxmal coordate descet Mauscrpt arxv:1315799 013 [5] M Hog X Wag M Razavyay ad Z Q Luo Iterato complexty aalyss of block coordate descet methods Mauscrpt arxv:13106957 013 [6] C-J Hseh K-W Chag C-J L S-S Keerth ad S Sudararaja A dual coordate descet method for large-scale lear svm I Proceedgs of the 5th Iteratoal Coferece o Mache Learg ICML pages 408 415 008 [7] R Johso ad T Zhag Acceleratg stochastc gradet descet usg predctve varace reducto I Advaces Neural Iformato Processg Systems 6 pages 315 33 013 [8] Y T Lee ad A Sdford Effcet accelerated coordate descet methods ad faster algorthms for solvg lear systems arxv:130519 [9] Q L Z Lu ad L Xao A accelerated proxmal coordate gradet method ad ts applcato to regularzed emprcal rsk mmzato Techcal Report MSR-TR-014-94 Mcrosoft Research 014 arxv:1407196 [10] Z Lu ad L Xao O the complexty aalyss of radomzed block-coordate descet methods Accepted by Mathematcal Programmg Seres A 014 arxv:1305473 [11] Z Q Luo ad P Tseg O the covergece of the coordate descet method for covex dfferetable mmzato Joural of Optmzato Theory & Applcatos 71:7 35 00 [1] Y Nesterov Itroductory Lectures o Covex Optmzato: A Basc Course Kluwer Bosto 004 [13] Y Nesterov Effcecy of coordate descet methods o huge-scale optmzato problems SIAM Joural o Optmzato :341 36 01 [14] P Rchtárk ad M Takáč Iterato complexty of radomzed block-coordate descet methods for mmzg a composte fucto Mathematcal Programmg 1441:1 38 014 [15] N Le Roux M Schmdt ad F Bach A stochastc gradet method wth a expoetal covergece rate for fte trag sets I Advaces Neural Iformato Processg Systems 5 pages 67 680 01 [16] A Saha ad A Tewar O the o-asymptotc covergece of cyclc coordate descet methods SIAM Jorual o Optmzato 3:576 601 013 [17] M Schmdt N Le Roux ad F Bach Mmzg fte sums wth the stochastc average gradet Techcal Report HAL 00860051 INRIA Pars Frace 013 [18] S Shalev-Shwartz ad A Tewar Stochastc methods forl 1 regularzed loss mmzato I Proceedgs of the 6th Iteratoal Coferece o Mache Learg ICML pages 99 936 Motreal Caada 009 [19] S Shalev-Shwartz ad T Zhag Stochastc dual coordate ascet methods for regularzed loss mmzato Joural of Mache Learg Research 14:567 599 013 [0] S Shalev-Shwartz ad T Zhag Accelerated proxmal stochastc dual coordate ascet for regularzed loss mmzato Proceedgs of the 31st Iteratoal Coferece o Mache Learg ICML JMLR W&CP 31:64 7 014 [1] P Tseg Covergece of a block coordate descet method for odfferetable mmzato Joural of Optmzato Theory ad Applcatos 140:513 535 001 [] P Tseg O accelerated proxmal gradet methods for covex-cocave optmzato Upublshed mauscrpt 008 [3] L Xao ad T Zhag A proxmal stochastc gradet method wth progressve varace reducto Techcal Report MSR-TR-014-38 Mcrosoft Research 014 arxv:14034699 9