Distributed Accelerated Proximal Coordinate Gradient Methods

Similar documents
An Accelerated Proximal Coordinate Gradient Method

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

arxiv: v1 [cs.lg] 22 Feb 2015

Dimensionality Reduction and Learning

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Functions of Random Variables

Summary of the lecture in Biostatistics

A tighter lower bound on the circuit size of the hardest Boolean functions

PTAS for Bin-Packing

Introduction to local (nonparametric) density estimation. methods

TESTS BASED ON MAXIMUM LIKELIHOOD

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Bayes (Naïve or not) Classifiers: Generative Approach

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Rademacher Complexity. Examples

PROJECTION PROBLEM FOR REGULAR POLYGONS

The Mathematical Appendix

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Lecture 9: Tolerant Testing

A Remark on the Uniform Convergence of Some Sequences of Functions

Analysis of Lagrange Interpolation Formula

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

CHAPTER VI Statistical Analysis of Experimental Data

Chapter 8. Inferences about More Than Two Population Central Values

5 Short Proofs of Simplified Stirling s Approximation

Q-analogue of a Linear Transformation Preserving Log-concavity

Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

Point Estimation: definition of estimators

Randomized Dual Coordinate Ascent with Arbitrary Sampling

MATH 247/Winter Notes on the adjoint and on normal operators.

MEASURES OF DISPERSION

Chapter 5 Properties of a Random Sample

L5 Polynomial / Spline Curves

CHAPTER 4 RADICAL EXPRESSIONS

18.413: Error Correcting Codes Lab March 2, Lecture 8

Support vector machines

Generalized Linear Regression with Regularization

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

Median as a Weighted Arithmetic Mean of All Sample Observations

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

Supervised learning: Linear regression Logistic regression

ESS Line Fitting

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Multivariate Transformation of Variables and Maximum Likelihood Estimation

An Introduction to. Support Vector Machine

MULTIDIMENSIONAL HETEROGENEOUS VARIABLE PREDICTION BASED ON EXPERTS STATEMENTS. Gennadiy Lbov, Maxim Gerasimov

Beam Warming Second-Order Upwind Method

Chapter 9 Jordan Block Matrices

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Lecture 3 Probability review (cont d)

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

A New Family of Transformations for Lifetime Data

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Mu Sequences/Series Solutions National Convention 2014

NP!= P. By Liu Ran. Table of Contents. The P versus NP problem is a major unsolved problem in computer

Econometric Methods. Review of Estimation

ρ < 1 be five real numbers. The

NP!= P. By Liu Ran. Table of Contents. The P vs. NP problem is a major unsolved problem in computer

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Entropy ISSN by MDPI

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Pinaki Mitra Dept. of CSE IIT Guwahati

Parameter, Statistic and Random Samples

Descriptive Statistics

Unsupervised Learning and Other Neural Networks

to the estimation of total sensitivity indices

MA/CSSE 473 Day 27. Dynamic programming

Solving Interval and Fuzzy Multi Objective. Linear Programming Problem. by Necessarily Efficiency Points

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Simulation Output Analysis

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

A NEW LOG-NORMAL DISTRIBUTION

A New Measure of Probabilistic Entropy. and its Properties

On Modified Interval Symmetric Single-Step Procedure ISS2-5D for the Simultaneous Inclusion of Polynomial Zeros

STRONG CONSISTENCY OF LEAST SQUARES ESTIMATE IN MULTIPLE REGRESSION WHEN THE ERROR VARIANCE IS INFINITE

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

LINEARLY CONSTRAINED MINIMIZATION BY USING NEWTON S METHOD

Lecture 3. Sampling, sampling distributions, and parameter estimation

Arithmetic Mean and Geometric Mean

STRONG CONSISTENCY FOR SIMPLE LINEAR EV MODEL WITH v/ -MIXING

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Non-uniform Turán-type problems

The Necessarily Efficient Point Method for Interval Molp Problems

Lecture Note to Rice Chapter 8

18.657: Mathematics of Machine Learning

BERNSTEIN COLLOCATION METHOD FOR SOLVING NONLINEAR DIFFERENTIAL EQUATIONS. Aysegul Akyuz Dascioglu and Nese Isler

CSE 5526: Introduction to Neural Networks Linear Regression

Transforms that are commonly used are separable

COMPROMISE HYPERSPHERE FOR STOCHASTIC DOMINANCE MODEL

Transcription:

Dstrbuted Accelerated Proxmal Coordate Gradet Methods Yog Re, Ju Zhu Ceter for Bo-Ispred Computg Research State Key Lab for Itell. Tech. & Systems Dept. of Comp. Sc. & Tech., TNLst Lab, Tsghua Uversty reyog15@mals.tsghua.edu.c; dcszj@tsghua.edu.c Abstract We develop a geeral accelerated proxmal coordate descet algorthm dstrbuted settgs Ds- APCG) for the optmzato problem that mmzes the sum of two covex fuctos: the frst part f s smooth wth a gradet oracle, ad the other oe Ψ s separable wth respect to blocks of coordate ad has a smple kow structure e.g., L 1 orm). Our algorthm gets ew accelerated covergece rate the case that f s strogly covex by makg use of moder parallel structures, ad cludes prevous o-strogly case as a specal case. We further preset effcet mplemetatos to avod full-dmesoal operatos each step, sgfcatly reducg the computato cost. Expermets o the regularzed emprcal rsk mmzato problem demostrate the effectveess of our algorthm ad match our theoretcal fdgs. 1 Itroducto We cosder the followg optmzato problem wth a composte objectve fucto: m F x) := fx) + Ψx), 1) x R N where f ad Ψ are proper ad lower sem-cotuous covex fuctos. We further assume that f s dfferetable o R N, ad Ψ has a smple blockwse separable structure Ψx) = Ψ x ), where x s the -th block of x wth cardalty N. Ths problem s ubqutous mache learg, where fx) deotes the loss ad Ψ represets some costrats or regularzatos. Gve a set D of..d data, a typcal loss fucto f has the followg form fx) = lx; A j ), ) A j D where l s some smoothed loss fucto such as the smoothed hge loss [L et al., 014]. The choce of Ψ depeds o the requremets of certa problems, such as boud costrats e.g., Ψ x) = 0 for x [0, 1] ad otherwse) or regularzatos for a specal purpose e.g., L 1 -regularzer for sparsty). fx) ca be strogly covex or ot. The strog covexty property usually meas a faster lear) covergece rate Correspodg author ad hece s terested may cases e.g., rdge regresso). We am to develop a geeral accelerated coordate descet method to solve problem 1) uder dstrbuted settgs, ad accelerate the covergece rate by makg use of the parallel structure. 1.1 Related Work ad Motvato Coordate descet CD) methods are popular optmzato algorthms to hadle such problems that ca break dow to small peces sce they ca usually make use of specal structures uderlyg the problems. At each terato t, the basc CD method chooses oe block of coordate x t to suffcetly reduce the objectve value whle keepg the other blocks fxed. There are two commo strateges for choosg such a block the cyclc scheme ad the radomzed scheme, where the former chooses t a cyclc fasho.e., t+1 = t mod + 1), whle the latter chooses t uformly or va some dstrbutos. The radomzed scheme s more commo sce t ejoys both theoretcal ad practcal beefts [Wrght, 015]. Due to ther superor covergece property, CD algorthms have bee wdely used may practcal problems, especally some regmes that eed relatvely hgh accurate solutos [Wrght, 015]. To mprove the performace of the basc CD algorthm, may varats have bee proposed. Amog them, Nesterov s accelerato techque [Nesterov, 1983], whch s prove to be a optmal frst order gradet) method covergece rate, s a mportat le. Specally, Nesterov [01] developed a accelerated radomzed CD method for mmzg ucostraed smooth fuctos.e., wth Ψx) = 0), whch s the frst oe that apples accelerato techques to CD methods. Later oe, Lu ad Xao [013] gave a mproved verso wth a sharper covergece aalyss. For the more geeral problem 1), Fercoq ad Rchtark [015] proposed the APPROX algorthm for solvg fuctos fx) wthout strog covexty to obta a accelerated sublear covergece rate ad L [014] proposed a geeral APCG algorthm to obta a accelerated covergece rate for fx) ether wth strog covexty or ot. Aother mportat le s to utlze parallel computg archtectures to accelerate the CD algorthms. Here we focus o the sychroous case, where multple blocks of coordates are sampled ad gradets are the computed each terato, rug parallel o multple processors. After 655

that, a barrer sychrozato step s set to update the coordates, see a seres of recet work parallel ad dstrbuted settgs [Bradley et al., 011; Necoara ad Clpc, 013; Rchtark ad Takac, 013a]. To meet the requremets of the bg-data challege, combg the accelerato techque wth advaced parallel computg archtectures seems to be atural to leverage ther advatages respectvely. For the geeral problem 1), the APPROX algorthm s parallel ature to deal wth fx) wthout strog covexty. For a specal case of 1) called emprcal rsk mmzato ERM) problem, Shalev-Shwartz ad Zhag [014] proposed a erouter loop optmzato scheme to obta a accelerated covergece rate wth a strog covexty assumpto, where the er loop s easly parallelzed usg exstg techques [Takac ad Rchtark, 015]. However, the parallel algorthm for problem 1) the strogly covex case s stll uexplored. I ths paper, we propose a ew dstrbuted accelerated coordate gradet descet method DsAPCG) to solve the geeral form 1) for fx) wth or wthout strog covexty. Our algorthm les the le of Nesterov accelerato methods, wth ew carefully modfed Nesterov s sequeces to adapt to dstrbuted settgs. Our algorthm cludes AP- PROX as a specal case whe the fucto does ot have a strog covexty assumpto, where a sublear covergece rate s obtaed. I the strog covex case, we obta a accelerated lear covergece rate, thaks to the parallel structure. Furthermore, we propose a effcet mplemetato to avod full-dmesoal vector operatos, whch reduces the updatg cost each terato. For practcal use, we apply our algorthm to the ERM problem ad fd a sgfcat mprovemet comparg wth several other dstrbuted CD solvers. 1. Outle of The Paper I Secto, we troduce the otatos ad assumptos, preset the geeral DsAPCG method, aalyze ts covergece rate wth or wthout strog covexty, ad show the ga by leveragg parallel structures. I Secto 3, we preset a equvalet verso of the geeral algorthm to avod full-dmesoal mapulato may cases. I Secto 4, we apply our DsAPCG method to the wdely studed ERM problem. Besdes, we use expermets o a smoothed verso of SVM problems to demostrate the effectveess of our algorthm. Fally, we coclude Secto 5. The DsAPCG Method.1 Notatos, Assumptos ad Settgs For a N-dmesoal vector x R N, let Ω = {x R N } deote a partto of the coordates wth N = N. Let U = [U 1,..., U ] be a N N permutato matrx wth U R N N. The, we have x = U x, ad x = U x, []. Wthout losg geeralty, we assume that U s the detty matrx. Our dstrbuted settg s smlar as [Rchtark ad Takac, 013a]. Suppose that the cluster cossts of K odes ad the blocks of coordates features) are uformly dstrbuted each ode.e., each ode keeps ad updates /K blocks of coordates). For smplcty, we assume that s dvsble by K. We deote S k Ω as the collecto of blocks of coordates that are dstrbuted the k-th ode ad we have K k=1 S k = Ω. I the meawhle, all the data descrbg the features S k are stored the k-th ode. Followg [L et al., 014], for ay x R N, the partal gradet of f wth respect to x s defed as fx) = U fx), [], where [] := {1,...} deotes the set of tegers from 1 to. We make the followg assumptos, whch are commoly used the lterature of coordate descet methods [Fercoq ad Rchtark, 015]. Assumpto 1. The gradet of fx) s block-wse Lpschtz cotuous wth costats L : fx+u h ) fx) L h, x R N, h R N. 1/ For coveece, we deote x L = L x ) as the L -orm blockwsely weghted by the coeffcets L. Assumpto. There exsts µ 0 such that for all y R N ad x domψ): fy) fx) + fx), y x + µ y x L. A mmedate cosequece of Assumpto 1 s fx+u h ) fx)+ fx), h + L h, h R N, whch bouds the varato of fx) whe a sgle block chages. I dstrbuted settgs, as more tha oe blocks vary each terato, we eed the followg lemma to boud the total varato of the fucto fx). Lemma 1. [Rchtark ad Takac, 013b]) Assume that f satsfes Assumpto 1. For all x, h R N, we have fx + h) fx) + fx), h + Supph) h L, where Supph) := { [] : h ) 0} s the set of blocks that are ot equal to zero. The above boud s tght geeral. Suppose that dstrbuted settgs, we alter κ blocks each terato ad defe L κ = κ L. The, the boud gves that for x, y R N, fy) fx) + fx), y x + µ κ y x L κ, 3) ad Assumpto 1 mples that fx + h) fx) + fx), h + 1 h L κ, 4) where µ κ = µ/κ.. The DsAPCG Algorthm The geeral DsAPCG algorthm s summarzed Alg.1. We start K processors ad each processor k rus Alg.1 smultaeously. The algorthm matas a o-creasg step sze α t ad termedate varable y t), smlarly the orgal Nesterov s method. At terato t, each ode samples multple blocks of coordates ad computes the termedate varable z t) usg a proxmal gradet step, as step 4. Fally, 656

the coordates are updated a sychrozed maer step 5. We make more commets o several key steps as follows. Step1: S t) k collects the dces of the coordates to be updated for ode k. The total umber of updated coordates oe terato s κ = τk, where τ s the m-batch sze that every ode samples at each terato. Note that the sample procedure s ot equvalet to uformly sample κ blocks of coordates from all sce they are stored locally. Our algorthm s essetally a m-batch maer. Whe K = 1, we ca merge several small blocks to a larger oe ad use [L et al., 014] drectly.e., uformly sample oe large block at each terato), ad hece the m-batch aalyss s ot ecessary. However, such mergg ca ot be doe the dstrbuted settgs, sce every ode eeds to sample ther ow data, ot as a whole. Step: The orgal Nesterov s sequece s computed as αt = 1 α t )γ t + α t µ. Our ew sequece cosders the fluece troduced by mult-block alterato oe terato. Smlar propertes hold as the orgal oes. Step3: The Reduce operator s smlar to MPI::AllReduce, where y t) s frst computed by gatherg coordates from all odes ad the broadcasted to all odes. Ths s a crucal step sce y t) s of sze ON) ad hece troduces the most part of commucato cost. y t) s used to compute the gradet, however, oe ca further reduce the commucato cost by explorg the specal structure of fx), as we shall see. Step5: The update step volves full-dmesoal operatos sce z t+1), z t) ad y t) are dese geeral ad the smlar problem exsts for the computato of y t). Ths ssue wll be dealt wth by proposg a effcet ad equvalet algorthm Secto 3..3 Covergece Aalyss We aalyze the covergece rate of Alg.1. The ma theorem s as follows. We defer the full proof to appedx. Theorem 1. Suppose that Assumptos 1 ad hold. Let F be the optmal value of problem 1) ad {x t) } be the sequece geerated by the DsAPCG method. The for ay t 0, the followg holds: { ) κµ t ) } E[F x t) ] F m 1, F x 0) ) F + γ 0 κr 0 + tκ γ 0 ), 7) where R 0 := m x X x0) x L, ad X s the set of optmal solutos of problem 1). ) κµ t ) The two terms 1 ad + tκ correspod to the strog covex ad o-strog covex cases re- γ 0 spectvely. For the o-strogly covex case, a slght chage of the proof ca remove the κ the last term of 7) ad the we recover the covergece rate [Fercoq ad Rchtark, 015] as a specal case. For the strogly covex case, we get a accelerated lear covergece rate, as the followg corollary shows: Algorthm 1 The DsAPCG algorthm. Iput: x 0) domψ) ad covexty parameter µ κ 0. Italze: set z 0) = x 0) ad choose 0 < γ 0 [µ κ, 1]. Start K odes wth ther correspodg data ad coordates. for t = 0, 1,, 3... do: 1. Uformly sample τ blocks of coordates S t) k.. Compute α t 0, κ/] usg the relato: κ α t = 1 α t)γ t + α tµ κ, ad set γ t+1 = 1 α t)γ t + α tµ κ, β t = αtµκ. γ t+1 3. Reduce: y t) 1 = α t γ t z t) + γ t+1 x t)) 5) α t γ t + γ t+1 4. for all dex S k, compute z t+1) z t+1) = argm x R N as α t L x 1 β t )z t) β t y t) + fy t) ), x + Ψ x ), ad for all dex S k, compute z t+1) 5. Set ed for z t+1) = 1 β t )z t) as + β t y t) x t+1) = y t) + κ α tz t+1) z t) ) + κµ κ zt) y t) ) 6) Corollary 1. Suppose that the same codtos Theorem. 1 hold ad further assume that f s µ-strogly covex. I order to obta E[F x t) ] F ϵ, t suffces to have the terato t satsfy t log C + Dκ, κµ ϵ where C = F x 0) ) F, D = γ 0 R 0/. The prevous sgle-mache verso APCG provded [L et al., 014] eeds O/ µ) to acheve ϵ accuracy, whle our DsAPCG further accelerates ths rate by a factor κ, omttg the log term. It remas a ope problem that whether we ca accelerate ths rate by κ. Remark 1. We have assumed that the coordates ad ther correspodg data) are uformly dstrbuted each ode. However, the computato power of each ode may be dfferet practce, whch stragglers may slow dow the barrer sychrozato. To avod ths, odes ca store dfferet amouts of coordates, depedg ther computato power, ad cosequetly, each ode k has ts ow m-batch sze τ k. The above theorem stll holds as log as each block s sampled wth equal probablty. 3 Effcet Implemetato For the strogly covex case wth µ > 0, we ca choose γ 0 = µ/κ to get a cocse verso. I ths case, we have that γ t = µ/κ ad cosequetly, α t = β t = κµ/. 657

As metoed Secto., a straghtforward mplemetato of Alg.1 requres full-dmesoal operatos o x.e. ON)), whch s comparable to a full-gradet step. L et al. [014] provded a equvalet verso to ther orgal algorthm, correspodg to the specal sgle-mache case our paper, wth a effcet update step. Here we show that such strategy ca also be used the dstrbuted settgs wth some modfcatos. The overall algorthm s summerzed Alg. ad the equvalet asserto s made Proposto 1. Proof ca be foud the appedx. Algorthm DsAPCG wthout full-dmesoal vector operators the case µ κ > 0 Iput: x 0) domψ) ad covexty parameter µ κ > 0. Italze: set u 0) = 0, v 0) = x 0), α = κµ, ρ = 1 α 1+α. Start K odes wth ther correspodg data ad coordates. for t = 0, 1,, 3... do: 1. Reduce: ρ t+1 u t) + v t).. Uformly sample τ blocks of coordates S t) k. 3. for all S t) k h t) = argm h R N compute h t) αl h + as fρ t+1 u t) + v t) ), h + Ψ ρ t+1 u t) + v t) + h 4. Let u t+1) = u t), v t+1) = v t) ad update S t) as: u t+1) = u t) 1 κ α ρ t+1 Output: x t) = ρ t u t) + v t). ht), v t+1) = v t) ) + 1 + α κ h t) Proposto 1. Algorthm ad Algorthm 1 are equvalet wth x t) = ρ t u t) + v t), for all t 0. { argmax Dx) = 1 x R y t) = ρ t+1 u t) + v t), z t) = ρ t u t) + v t), As we ca see, the updatg step Alg. avods fulldmesoal operatos. However, the reduce step ρ t+1 u t) + v t) stll eeds ON) computato cost geeral. We ca further explore the structure of certa problem to avod t, as we shall see for the problem of ERM. 4 Applcato to Prmal-dual ERM Problem 4.1 Prmal ad Dual ERM Problem The ERM problem arses ubqutously supervsed mache learg applcatos. Let A 1,..., A be vectors R d, ϕ 1,..., ϕ be a sequece of covex fuctos o R, ad g be a covex fucto o R d, the regularzed ERM problem s defed as follows: { } argm P w) = 1 ϕ A w) + λgw) w R d The dual of the above problem s ϕ x ) λg 1 λ Ax) where A = [A 1,..., A ], ad ϕ, g are cojugate fuctos of ϕ, g respectvely. Notce that the above problem, the sample sze s the dmesoalty whe we cosder the dual problem. We focus o the strog covexty case, where we eed the followg assumpto that s also stadard the lterature of solvg the prmal ad dual ERM problems. Assumpto 3. Each fucto ϕ s 1/γ smooth ad the fucto g s strog covex wth parameter 1. The above assumpto mples that ϕ s γ strog covex ad g s cotuous dfferetable. The structure of Dx) matches problem 1) wth the equvalet form F x) = Dx), where fx) = λg 1 λ Ax) ad Ψx) = 1 ϕ x ). I order to match the lear covergece rate assumpto, we re-locate the strog covexty of ϕ ad get the fal optmzato problem as argm x R F x) where F x) = λg 1 λ Ax) + γ x + 1 ϕ x ) γ }{{} x ). }{{} fx) Ψx) We focus o a specal case that gw) = 1 w. Such specal case mples that we ca effcetly compute the partal coordate gradet ad s mostly used as regularzato term ERM problems [Ma et al., 015]. I ths case, we have fy t) ) = 1 λ A Ay t) ) + γ yt) 8) Besdes, we ca determe a upper boud for the Lpschtz costat L R + λγ λ, where R = max A, [], ad a lower boud for the strog covexty parameter for λγ fx) wth respect to L as µ R. Detals ca be + λγ foud the appedx. 4. Numercal Expermets We cosder mmzg the smoothed hge loss problem, order to satsfy the 1/γ smooth codto. Precsely, we have 0 f a > 1 ϕ a) = 1 a γ f a 1 γ 1 γ 1 a) otherwse The cojugate fucto of ϕ s the as follows: { ϕ b + γ b) = b f b [ 1, 0] otherwse Cosequecely, we have Ψ x) = 1 ϕ x) γ ) x = { x } f x [0, 1] otherwse I the cotext of prmal-dual optmzato problem, people ofte care about the dualty gap P wx)) Dx), whch s, 658

Fgure 1: Dualty gap vs. the umber of teratos, as well as dualty gap vs. elapsed tme for the Epslo datasets. We vary the m-batch sze τ whle keep the umber of odes K fxed or vse versa. λ s fxed to be 10 6. The dualty gap ad elapsed tme are show log doma whle the umber of teratos s show ormally to emphasze the lear covergece rate. Algorthm 3 DsAPCG for regularzed ERM wth µ > 0 Iput: x 0) domψ) ad µ = Italze: set α = κµ λγ R +λγ, ρ = 1 α 1+α, v0) = x 0), u 0) = 0, p 0) = 0 ad q 0) = Ax 0). Start K odes wth ther correspodg data ad coordates. for t = 0, 1,, 3... do: 1. Reduce: p t), q t).. Uformly sample τ blocks of coordates S t) k. 3. for all S t), compute: h t) α A + λγ) = argm h + t), h h R N λ ) + Ψ ρ t+1) u t) + v t) + h t) = 1 λ ρt+1 A p t) + A q t) ) + γ ρt+1 u t) + v t) ). 4. Let u t+1) = u t), v t+1) = v t) ad for all S t) : u t+1) = u t) Update p, q as p t+1) = p t) 1 κ α ρ t+1 1 α κ ρ t+1 ht), v t+1) = v t) Aht) Output: prmal ad dual solutos: + 1 + α κ h t), q t+1) = q t) 1 + α κ A h t) x t+1) = ρ t+1 u t+1) +v t+1), w t+1) = 1 λ ρt+1 p t+1) +q t+1) ) datasets dmeso d sample sze sparsty epslo,000 100,000 100% covtype 54 581,01 % RCV1 47,36 677,399 0.16% Table 1: Iformato of three bary classfcato datasets. a upper boud to the gap D Dx). A overall dscusso about the relato betwee these gaps ca be foud [Duer et al., 016]. Here we drectly use the dualty gap as the statstcal dcator. We mplemet the algorthms by C++ ad opempi ad ru them clusters o Tahe-II super computer, where each ode we use a sgle cpu. Expermets are performed o 3 datasets from [Fa ad L, 011] whose formato s summarzed Table 1. Ifluece of m-batch sze τ ad umber of odes K We frst aalyze the fluece of the m-batch sze τ ad the umber of odes K o the Epslo dataset. We ether vary the m-batch sze τ o each ode wth the umber of odes K fxed, or vse versa. Itutvely, a larger m-batch sze τ meas a larger descet oe terato, however, t eeds more computato cost. Smlarly, a larger umber of odes K meas a larger descet whle more commucato cost. Therefore, o the oe had, we show the dualty gap w.r.t the umber of teratos to verfy the lear covergece rate ad the beefts by creasg computato resources. O the other had, we show the dualty gap w.r.t rug tme to make clear the trade-off betwee computato ad commucato. The overall results are summarzed Fg.1. I terms of dualty gap w.r.t the umber of teratos, our DsAPCG algorthm actually acheves a lear covergece rate all settgs, whch s cosstet wth our theory. Ad the crease of τ ad K does gve a large descet each terato. Takg a closer look we ca see ther s a κ acceleratg factor terms of the terato umber, aga matchg our theoretcal fdgs. For dualty gap w.r.t rug tme, the result suggests that a smaller m-batch sze e.g., 10) has better performace ad more odes meas a faster covergece rate. Comparso wth other solvers Now we compare our algorthm wth other state-of-art dstrbuted solvers for the prmal ad dual regularzed ERM problem, cludg the m-batch verso of SDCA dstrbuted settgs [Yag, 013] deoted by DsDCA) ad Co- CoA+ [Ma et al., 015]. The CoCoA+ solver s a er-outer 659

Fgure : Dualty gap vs. the umber of teratos, as well as dualty gap vs. elapsed tme for the RCV1 datasets wth umber of odes K = 16, m = H = 10. λ vares from 10 6 to 10 8. Fgure 3: Dualty gap vs. the umber of teratos, as well as dualty gap vs. elapsed tme for the Covertype datasets wth umber of odes K = 16, m = H = 10. λ vares from 10 6 to 10 8. loop scheme that volves a local solver, provdg a local approxmato to the global problem, ad the outer loop s resposble for commucato ad global parameter updatg. Here we follow the orgal paper [Ma et al., 015] wth SDCA as the local solver. I terms of the umber of teratos H for the local SDCA solver, we foud that usg a relatvely small umber of teratos e.g., 10 or 10 3 ) acheves the best performace wth rug tme beg take to cosderato. Hece we choose H = 10 for CoCoA+. For a far comparso, we set the m-batch sze to be τ = 10 for our DsAPCG method ad DsDCA. We vary λ from 10 6 to 10 8, whch s a relatvely hard settg sce the strog covexty parameter s small. For all settgs, we use K = 16 odes. The overall comparso s summarzed Fg. ad Fg.3. I all settgs, the CoCoA+ ad our DsAPCG method outperform DsDCA a lot. For the former two, as we ca see, whe λ s relatvely large,.e., λ = 10 6, the CoCoA+ solver reduces the dualty gap quckly at the begg, however, the speed slows dow rapdly at a relatvely accuracy level, e.g., 10 6 the RCV1 dataset. Ths pheomea happes, o matter wth the choce of the umber of teratos for the local SDCA solver. I cotrast, the DsAPCG algorthm keeps the lear covergece rate all the tme ad hece acheves better performace the case that hgh accuracy s eeded. For the most ll codto,.e., λ = 10 8, our DsAPCG algorthm acheve the best performace, ether terms of teratos or rug tme, o both datasets. It s worth metag that CoCoA+ s a framework that ca use ay local solver, ot oly lmted SDCA. As metoed Secto., our algorthm ca be regarded as a m-batch verso of the APCG algorthm whe K = 1, whch supports share-memory level parallel computg mult-core odes. A combato of CoCoA+ wth our sgle-mache verso of DsAPCG seems to be a good choce to further accelerate the algorthm practce. 5 Coclusos We have preseted a dstrbuted accelerated coordate methods DsAPCG for the composte covex optmzato problem. Our method combes the Nesterov s method ad parallel structures, ejoyg a accelerated covergece rate for both strogly ad o-strogly covex cases. Expermets for the ERM problem show better performace camparg wth several other state-of-art solvers, matchg our theoretcal fdgs as well. Ackowledgmets The work was supported by the Natoal Basc Research Program 973 Program) of Cha No. 013CB39403), NSFC Projects Nos. 6160106010, 6161136008, 6133007), Tagog Isttute for Itellget Computg, ad the Youth Top-otch Talet Support Program. 660

Refereces [Bradley et al., 011] Joseph Bradley, Aapo Kyrola, Day Bckso, ad Carlos Guestr. Parallel coordate descet for l1-regularzed loss mmzato. ICML, 011. [Duer et al., 016] Celeste Duer, Smoe Forte, Mart Takac, ad Mart Jagg. Prmal-dual rates ad certfcates. ICML, 016. [Fa ad L, 011] Rog-E Fa ad Chh-Je L. Lbsvm data: Classfcato, regresso ad mult-label. URL: http://www.cse.tu.edu.tw/ cjl/lbsvmtools/datasets, 011. [Fercoq ad Rchtark, 015] Olver Fercoq ad Peter Rchtark. Accelerated, parallel ad proxmal coordate descet. SIAM Joural o Optmzato, 015. [L et al., 014] Qhag L, Zhaosog Lu, ad L Xao. A accelerated proxmal coordate gradet method ad ts applcato to regularzed emprcal rsk mmzato. NIPS, 014. [Ma et al., 015] Chex Ma, Vrga Smth, Mart Jagg, Mchael Jorda, Peter Rchtark, ad Mart Takac. Addg vs. averagg dstrbuted prmal-dual optmzato. ICML, 015. [Necoara ad Clpc, 013] Io Necoara ad Dragos Clpc. Effcet parallel coordate descet algorthm for covex optmzato problems wth separable costrats: applcato to dstrbuted mpc. Joural of Process Cotrol, 013. [Nesterov, 1983] Yur Nesterov. A method of solvg a covex programmg problem wth covergece rate o1/kˆ). Sovet Mathematcs Doklady, 1983. [Nesterov, 01] Yur Nesterov. Effcecy of coordate descet methods o huge-scale optmzato problems. SIAM Joural o Optmzato, 01. [Rchtark ad Takac, 013a] Peter Rchtark ad Mart Takac. Dstrbuted coordate descet method for learg wth bg data. arxv:1310.059, 013. [Rchtark ad Takac, 013b] Peter Rchtark ad Mart Takac. Parallel coordate descet methods for bg data optmzato. arxv:11.0873v, 013. [Shalev-Shwartz ad Zhag, 014] Sha Shalev-Shwartz ad Tog Zhag. Accelerated proxmal stochastc dual coordate ascet for regularzed loss mmzato. ICML, 014. [Takac ad Rchtark, 015] Mart Takac ad Peter Rchtark. Dstrbuted m-batch sdca. arxv:1507.083v1, 015. [Wrght, 015] Stephe Wrght. Coordate descet algorthms. Mathematcal Programmg 151.1, 3-34., 015. [Xao ad Lu, 013] L Xao ad Zhaosog Lu. O the complexty aalyss of radomzed block-coordate descet methods. arxv:1304.473, 013. [Yag, 013] Tabao Yag. Tradg computato for commucato: Dstrbuted stochastc dual coordate ascet. NIPS, 013. 661