arxiv: v1 [stat.ml] 15 Mar 2018

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 15 Mar 2018"

Jocelin Bradford
6 years ago
Views:

1 Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv: v1 [stat.ml] 15 Mar 2018 Abstract Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for largescale applications with high-dimensional data. One popular way to implement sparse learning is to use L 1 regularization. In this paper, we propose a novel method, called proximal SCOPE (), for distributed sparse learning with L 1 regularization. is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework of, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, is also communication efficient. Experimental results on real data sets show that can outperform other state-of-the-art distributed methods for sparse learning. 1. Introduction Many machine learning models can be formulated as the following regularized empirical risk minimization problem: min w R d P (w) = 1 n n f i (w) + R(w), (1) i=1 where w is the parameter to learn, f i (w) is generally the loss on training instance i, n is the number of train- 1 Nanjing University. Correspondence to: Shen- Yi Zhao <zhaosy@lamda.nju.edu.cn>, Gong-Duo Zhang <zhanggd@lamda.nju.edu.cn>, Ming-Wei Li <limw@lamda.nju.edu.cn>, Wu-Jun Li <liwujun@nju.edu.cn>. ing instances, and R(w) is a regularization term. For example, given a set of labeled instances {(x i, y i )} n i=1 where x i R d is the feature vector of instance i and y i {+1, 1} is the class label of instance i, if we take f i (w) = (y i x T i w)2 and R(w) = λ w 1, it is known as lasso regression (Tibshirani, 1994). If we take f i (w) = log(1 + e yixt i w ) and R(w) = λ 2 w 2 2, it is known as logistic regression. Here, λ is a hyper-parameter. In some cases, f i (w) can also contain regularization besides the loss. One example is the elastic net (Zou & Hastie, 2005) which will be introduced in Section 6. Recently, sparse learning, which tries to learn a sparse model for prediction, has become a hot topic in machine learning. There are different ways to implement sparse learning (Tibshirani, 1994; Wang et al., 2017). One popular way is to use L 1 regularization, i.e., R(w) = λ w 1. In this paper, we focus on sparse learning with R(w) = λ w 1. Hence, in the following content of this paper, R(w) = λ w 1 unless otherwise stated. One traditional method to solve (1) is proximal gradient descent (pgd) (Beck & Teboulle, 2009), which can be written as follows: w t+1 = prox R,η (w t η F (w t )), (2) where F (w) = 1 n n i=1 f i(w), w t is the value of w at iteration t, η is the learning rate, prox is the proximal mapping defined as prox R,η (u) = arg min(r(v) + 1 v 2η v u 2 ). (3) Recently, stochastic learning methods, including stochastic gradient descent (SGD) (Nemirovski et al., 2009), stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013), and stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz & Zhang, 2013), have been proposed to speedup the learning procedure in machine learning. Inspired by the success of these stochastic learning methods, proximal stochastic methods, including proximal SGD (psgd), proximal block coordinate descent (BCD), proximal SVRG (psvrg) and proximal SDCA (psdca), have also been proposed for sparse learning in recent years (Langford et al., 2008; Duchi & Singer,

2 2009; Tseng, 2001; Wu & Lange, 2008; Scherrer et al., 2012; Shalev-Shwartz & Zhang, 2014; Xiao & Zhang, 2014; Shi & Liu, 2015; Byrd et al., 2016; Schmidt et al., 2017). All these proximal stochastic methods are sequential (serial) and implemented with one single thread. The serial proximal stochastic methods may not be efficient enough for solving large-scale sparse learning problems. Furthermore, the training set might be distributively stored on a cluster of multiple machines in some applications. Hence, distributed sparse learning (Aybat et al., 2015) with a cluster of multiple machines has attracted much attention in recent years, especially for large-scale applications with high-dimensional data. In particular, researchers have recently proposed several distributed proximal stochastic methods for sparse learning. One main branch of the distributed proximal stochastic methods includes distributed psgd (dpsgd) (Li et al., 2016) and distributed psvrg (dpsvrg) (Huo et al., 2016; Meng et al., 2017). Both dpsgd and dpsvrg adopt minibatch based strategy for distributed learning. Although both synchronous and asynchronous communication strategies can be used for these methods, most methods (Li et al., 2016; Huo et al., 2016; Meng et al., 2017) adopt asynchronous strategy because it has lower waiting cost than synchronous strategy. The asynchronous version of these methods can be easily implemented with existing distributed frameworks like Parameter Server (Li et al., 2014; Xing et al., 2015). One shortcoming of these dpsgd and dpsvrg methods is that the communication cost is high. More specifically, the communication cost of each epoch is O(n), where n is the number of training instances. Another branch of the distributed proximal stochastic methods is based on block coordinate descent (Bradley et al., 2011; Richtárik & Takác, 2016; Fercoq & Richtárik, 2016; Mahajan et al., 2017). Although in each iteration these methods update only a block of coordinates, they usually have to pass through the whole data set. For example in linear model, to calculate the gradient of any coordinate, they have to calculate x T i w from i = 1 to n. Due to the partition of data, it would also bring O(n) communication cost in each iteration. Another kind of distributed proximal stochastic methods is based on SDCA. One representative of this kind is PROXCOCOA+ (Smith et al., 2015). Based on the primal-dual framework, the communication cost of PROXCOCOA+ in each epoch is reduced to be constant. However, it can only solve a special problem with the form of f(xw) + λ w 1, where X R n d is the data matrix. Many sparse learning problems, like the L 1 regularized logistic regression, cannot be solved by PROXCOCOA+. In this paper, we propose a novel method, called proximal SCOPE (), for distributed sparse learning with L 1 regularization. is a proximal generalization of the scalable composite optimization for learning (SCOPE) (Zhao et al., 2017). SCOPE cannot be used for sparse learning, while can be used for sparse learning. The contributions of are briefly summarized as follows: is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The CALL framework is communication efficient because there is no communication during the inner iterations of each epoch. is theoretically guaranteed to be convergent with a linear convergence rate if the data partition is good enough, and better data partition implies faster convergence rate. Hence, is also computation efficient. Experimental results on real data sets show that can outperform other state-of-the-art distributed methods for sparse learning. 2. Preliminary In this paper, we use to denote the L 2 norm 2, w to denote the optimal solution of (1). For a vector a, we use a (j) to denote the jth coordinate value of a. [n] denotes the set {1, 2,..., n}. For a function h(a; b), we use h(a; b) to denote the gradient of h(a; b) with respect to (w.r.t.) the first argument a. Furthermore, we give the following definitions. Definition 1. We call a function h( ) is Lipschitz continuous if there exists a positive constant L 0 such that a, b h(a) h(b) L 0 a b. Definition 2. We call a function h( ) is L-smooth if it is differentiable and there exists a positive constant L such that a, b h(b) h(a) + h(a) T (b a) + L 2 a b 2. Definition 3. We call a function h( ) is µ-strongly convex (µ 0) if a, b h(b) h(a) + ζ T (b a) + µ 2 a b 2, where ζ h(a) = {c h(b) h(a) + c T (b a), a, b}. If h( ) is differentiable, then ζ = h(a). If µ = 0, we also call h( ) is convex.

3 Definition 4. b and function h( ), the conjugate function h ( ) is h (b) = sup(b T a h(a)), a which is always a convex function. Throughout this paper, we assume that F (w) = 1 n n i=1 f i(w) is strongly convex and each f i ( ) is L-smooth. We do not assume that each f i ( ) is convex. 3. Proximal SCOPE In this paper, we focus on distributed learning with one master (server) and p workers in the cluster, although the algorithm and theory of this paper can also be easily extended to cases with multiple servers like the Parameter Server framework (Li et al., 2014; Xing et al., 2015). The parameter w is stored in the master, and the training set D = {x i, y i } n i=1 are partitioned into p parts denoted as D 1, D 2,..., D p. Here, D k contains a subset of instances from D, and D k will be assigned to the kth worker. D = p k=1 D k. Based on this data partition scheme, the proximal SCOPE () for distributed sparse learning is presented in Algorithm 1. The main task of master is to add and average vectors received from workers. Specifically, it needs to calculate the full gradient z = F (w t ) = 1 n k=1 z k. Then it needs to calculate w t+1 = 1 p k=1 u k,m. The main task of workers is to update the local parameters u 1,m, u 2,m,..., u p,m initialized with u k,0 = w t. Specifically, for each worker k, after it gets the full gradient z from master, it calculates a stochastic gradient v k,m = f ik,m (u k,m ) f ik,m (w t ) + z, (4) and then update its local parameter u k,m by proximal mapping u k,m+1 = prox R,η (u k,m ηv k,m ), (5) where η is the learning rate. From Algorithm 1, we can find that is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The cooperative operation is mainly though adding and averaging in the master. During the autonomous local learning procedure in each outer iteration which contains M inner iterations (see Algorithm 1), there is no communication. Hence, the communication cost for each epoch of is constant, which is much less than the mini-batch based strategy Algorithm 1 Proximal SCOPE 1: Initialize w 0 and the learning rate η; 2: Task of master: 3: for t = 0, 1, 2,...T 1 do 4: Send w t to each worker; 5: Wait until receiving z 1, z 2,..., z p from all workers; 6: Calculate the full gradient z = 1 n k=1 z k and send z to each worker; 7: Wait until receiving u 1,M, u 2,M,..., u p,m from all workers and calculate w t+1 = 1 p k=1 u k,m ; 8: end for 9: Task of the kth worker: 10: for t = 0, 1, 2,...T 1 do 11: Wait until receiving w t from master; 12: Let u k,0 = w t, calculate z k = i D k f i(w t) and send z k to master; 13: Wait until receiving z from master; 14: for m = 0, 1, 2,...M 1 do 15: Randomly choose an instance x ik,m D k ; 16: Calculate v k,m = f ik,m (u k,m ) f ik,m (w t) + z; 17: Update u k,m+1 = prox R,η(u k,m ηv k,m ); 18: end for 19: Send u k,m to master 20: end for with O(n) communication cost for each epoch (Li et al., 2016; Huo et al., 2016; Meng et al., 2017). is a proximal generalization of SCOPE (Zhao et al., 2017). The main differences between them are listed as follows: Although is mainly motivated by sparse learning with L 1 regularization, the algorithm and theory of can be used for both L 1 and L 2 regularization (refer to Theorem 2). On the contrary, SCOPE cannot be used for sparse learning with L 1 regularization. The update rule of SCOPE for the inner iterations is u k,m+1 = u k,m ηv k,m + c(u k,m w t ) where c > L µ > 0 is a hyper-parameter (Zhao et al., 2017). According to the theory in SCOPE, the positive number c is necessary for SCOPE to converge, and the convergence rate will be slower when c becomes larger. If a L 2 regularization has been integrated into f i ( ) and there is no R(w) in (1), the problem in (1) is just the problem solved by SCOPE. In this special case, the update rule in (5) of will be degenerated to u k,m+1 = u k,m ηv k,m. Hence, for the problem with only L 2 regularization, the update rule of is similar to that of SCOPE but c = 0 in. This does not mean the theory in SCOPE is wrong. If we do not put any constraint on the data partition strategy, c cannot be 0 for convergence guarantee. This is just what SCOPE has proved. However, if we can design some good data partition strategy, the algorithm can still be convergent even if c = 0. This

4 is one of the main contributions of this paper. This content will be introduced in the following section. 4. Effect of Data Partition In our experiment, we find that the data partition affects the convergence of the learning procedure. Hence, in this section we propose a metric to measure the goodness of a data partition, based on which the convergence of can be theoretically proved. Due to space limitation, the detailed proof of Lemmas and Theorems are moved to the supplementary material Partition First, we give the following definition: Definition 5. Define π = [φ 1 ( ),..., φ p ( )]. We call π a partition w.r.t. P ( ), if F (w) = 1 p k=1 φ k(w) and each φ k ( )(k = 1,..., p) is µ k -strongly convex and L k -smooth (µ k, L k > 0). Here, P ( ) and F ( ) are defined in (1). We denote A(P ) = {π π is a partition w.r.t. P ( )}. Remark 1. Here, π is an ordered sequence of functions. In particular, if we construct another partition π by permuting φ i ( ) of π, we consider them to be two different partitions. Furthermore, two functions φ i ( ), φ j ( )(i j) in π can be the same. Two partitions π 1 = [φ 1 ( ),..., φ p ( )], π 2 = [ψ 1 ( ),..., ψ p ( )] are considered to be equal, i.e., π 1 = π 2, if and only if φ k (w) = ψ k (w)(k = 1,..., p), w. For any partition π = [φ 1 ( ),..., φ p ( )] w.r.t. P ( ), we construct new functions by introducing a new parameter a: where φ k (w; a) = φ k (w) + G k (a) T w, (6) P k (w; a) = φ k (w; a) + R(w), (7) G k (a) = F (a) φ k (a). (8) It is easy to get P (w) = 1 p k=1 P k(w; a), w, a since 1 p k=1 G k(a) = 0. We call such a P k (w; a) the local objective function w.r.t. π. In particular, given a data partition D 1, D 2,..., D p of the training set D, let F k (w) = 1 D k i D k f i (w) which is also called the local loss function. Assume each F k ( ) is strongly convex and smooth, and F (w) = 1 p k=1 F k(w). Then, we can find that π = [F 1 ( ),..., F p ( )] is a partition w.r.t. P ( ). Moreover, let f i (w; a) = f i (w) + G k (a) T w, i D k. According to (6) and (7), we can find some interesting properties about F k (w; a) = 1 D k i D k f i (w; a) and P k (w; a) as follows: P k (w; w t ) = 1 D k i D k f i (w; w t ) + R(w). The full gradient of F k (w; w t ) at w = w t is F k (w t ; w t ) = F k (w t ) + G k (w t ) = F (w t ). v k,m defined in (4) can be rewritten as v k,m = f i (w k,m ; w t ) f i (w t ; w t ) + F k (w t ; w t ). According to the theory in (Xiao & Zhang, 2014), in the inner iterations of each worker actually tries to optimize the local objective function P k (w; w t ) using proximal SVRG with initialization w = w t and training data D k, rather than optimizing F k (w) + R(w) Qualified Partition For a partition π = [φ 1 ( ),..., φ p ( )] w.r.t. P ( ), each worker tries to optimize the local objective function P k (w; a) based on the data assigned to that worker. In general, the data distribution on each worker is different from the distribution of the whole training set. Hence, there exists a gap between each local optimal value and the global optimal value. Intuitively, the whole learning algorithm has slow convergence rate or even cannot converge if this gap is too large. Here, we use qualified partition to denote a partition whose gap is not too large. Given a specific a, the local optimal value is denoted as follows: wk(a) = arg min P k (w; a). (9) w Definition 6. For any partition π w.r.t. P ( ), we define the Local-Global Gap as l π (a) = P (w ) 1 p P k (w p k(a); a). k=1 Definition 7. We call π a qualified partition w.r.t. P ( ) if there is a constant γ <, which is independent of a, such that l π (a) γ a w 2, a. We denote Q(P ) = {π π is a qualified partition w.r.t. P ( )}. Then, we have the following lemmas: Lemma 1. π A(P ), l π (a) l π (w ) = 0, a. Lemma 2. π A(P ), l π (a) = P (w ) + 1 p p Hk( G k (a)), k=1 where H k ( ) is the conjugate function of φ k( ) + R( ). Without loss of generality, we set λ = 1 and prove that if R(w) = w 1, A(P ) = Q(P ), which means that any partition π is a qualified partition if R(w) = w 1. In particular, we have the following theorems.

5 Theorem 1. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). R(w) = w 1. Then, π is a qualified partition. The result in Theorem 1 can be easily extended to smooth regularization. Theorem 2. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). If R( ) is smooth, then π is a qualified partition Good Partition For a qualified partition, the local-global gap can be bounded by γ a w 2. Given a speficic a, the smaller γ is, the smaller the local-global gap will be. According to Definition 7, the constant γ only depends on the partition π. Intuitively, γ can be used to evaluate the goodness of a partition π. We define a good partition as follows: Definition 8. We call π a (ɛ, ξ)-good partition w.r.t. P ( ) if it is a qualified partition and γ(π; ɛ) l π (a) sup a w 2 ɛ a w ξ. (10) 2 Obviously, we wish γ(π; ɛ) to be small for some good partition. Assume the partition π = [F ( ),..., F ( )], where F ( ) = 1 n n i=1 f i( ) is defined in (1). We can find that π is the best partition since l π (a) = 0, a, which implies γ(π ; 0) = 0 In the following content, we will prove that γ(π; ɛ) is continuous w.r.t. π, and γ(π; ɛ) 0 as π approaches π. We first define the distance between two partitions as follows: Definition 9. Let π 1 = [φ 1 ( ),..., φ p ( )] and π 2 = [ψ 1 ( ),..., ψ p ( )] be two partitions w.r.t. P ( ), the distance between π 1 and π 2 is d(π 1, π 2 ) = max {sup φ i (w) ψ i (w), sup φ i (w) ψ i (w) }. i=1,...,p w w We can prove that it is a reasonable definition of distance. For convenience, we assume the domain of these φ k ( ) and ψ k ( ) is closed and bounded, which is denoted as W = {w w B} and w W. Hence, d(π 1, π 2 ) <. With the definition of distance between two partitions, we have Lemma 3. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). l π (a) uniformly converges to l π (a) = 0 as d(π, π ) 0. Besides the uniform convergence of l π (a) w.r.t. π, we also prefer the uniform convergence of γ(π; ɛ) w.r.t. π since it reflects the goodness of partition π. In the appendix of supplementary, we give the condition for convergence of γ(π; 0). However, it needs higher order information, which is not common in first order optimization. In the following, we give the bound of γ(π; ɛ) which is more practical. Lemma 4. Assume π = [F 1 ( ),..., F p ( )] is a partition w.r.t. P ( ), where F k (w) = 1 D k i D k f i (w) is the local loss function, each f i ( ) is Lipschitz continuous and sampled from some unknown distribution P. If we assign these {f i ( )} uniformly to each worker, then with high probability, γ(π; ɛ) 1 p p 1 O( ɛ D k ). k=1 Moreover, if l π (w) is convex w.r.t. w, then γ(π; ɛ) 1 p p 1 O( ɛ Dk ). k=1 Here we ignore the log term and dimensionality d. It implies that as long as the size of training data is large enough, γ(π; ɛ) will be small and π will be a good partition. Please note that the uniformly here means each f i ( ) will be assigned to one of the p workers and each worker has the equal probability to be assigned. We call the partition resulted from uniform assignment uniform partition in this paper. With uniform partition, each worker will have almost the same number of instances. As long as the size of training data is large enough, uniform partition is a good partition. 5. Convergence of Proximal SCOPE In this section, we will prove the convergence of Algorithm 1 for proximal SCOPE () using the results in Section 4. For convenience, we rewrite the local update rule as follows: where u k,m+1 = u k,m ηg k,m, (11) g k,m = 1 η (u k,m prox R,η (u k,m ηv k,m )). (12) Then we have the following theorems about the convergence of. Theorem 3. Assume π = [F 1 ( ),..., F p ( )] is a (ɛ, ξ)- good partition w.r.t. P ( ). F k ( ) is µ-strongly convex

6 (µ > 0) and L-smooth. R( ) is convex. If w t w 2 ɛ, then E w t+1 w 2 [(1 µη + 2L 2 η 2 ) M + 2L2 η + 2ξ µ + 2L 2 η ] w t w 2. By setting appropriate η and M, we can always make [(1 µη + 2L 2 η 2 ) M + 2L2 η+2ξ µ+2l 2 η ] < 1. Hence, if the partition π = [F 1 ( ),..., F p ( )] is a (ɛ, ξ)-good partition w.r.t. P ( ), is convergent with a linear convergence rate. Furthermore, the smaller ξ is, the faster convergence rate will have. Because smaller ξ means better partition and the partition π corresponds to data partition in Algorithm 1, we can see that better data partition implies faster convergence rate. Corollary 1. Assume π = [F 1 ( ),..., F p ( )] is a (ɛ, µ 8 )- good partition w.r.t. P ( ). F k ( ) is µ-strongly convex (µ > 0) and L-smooth. If w t w 2 ɛ, taking η = µ 12L, 2 M = 20κ 2, where κ = L µ is the conditional number, then we have E w t+1 w w t w 2. To get the ɛ-suboptimal solution, the computation complexity of each worker is O((n/p + κ 2 ) log( 1 ɛ )). Corollary 2. When p = 1, which means we only use one worker, degenerates to proximal SVRG (Xiao & Zhang, 2014). Assume F ( ) is µ-strongly convex (µ > 0) and L-smooth. Taking η = µ 6L 2, M = 13κ 2, we have E w t+1 w w t w 2. To get the ɛ-optimal solution, the computation complexity is O((n + κ 2 ) log( 1 ɛ )). Because has linear convergence rate if the partition is (ɛ, ξ)-good, we need T = O(log( 1 ɛ )) outer iterations to get a ɛ-optimal solution. For all inner iterations, each worker updates u k,m without any communication. Hence, the communication cost is O(log( 1 ɛ )), which is much small than the mini-batch based strategy with O(n) communication cost for each epoch (Li et al., 2016; Huo et al., 2016; Meng et al., 2017).. Furthermore, in the above theorems and corollaries, we only assume that the local loss function F k ( ) is convex. We do not need each f i ( ) to be convex. Hence, M = O(κ 2 ) and it is weaker than the assumption in proximal SVRG (Xiao & Zhang, 2014) whose computation complexity is O((n + κ) log( 1 ɛ )) when p = 1. Please note that proximal SVRG (Xiao & Zhang, 2014) is a centralized method. In addition, without convexity assumption for each f i ( ), our result for the degenerate case p = 1 is consistent with that in (Shalev-Shwartz, 2016). 6. Handle High Dimensional Sparse Data For the cases with high dimensional sparse data, we can take the sparsity to speedup the training procedure. Here, we adopt the widely used elastic net (Zou & Hastie, 2005) as an example for illustration. Elastic net can be formulated as follows: min w P (w) := 1 n n i=1 h i(x T λ1 i w)+ 2 w 2 + λ 2 w 1, where h i : R R is a convex function. With f i (w) = h i (x T λ1 i w) + 2 w 2, R(w) = λ 2 w 1, the elastic net can be rewritten as the form in (1). We assume many instances in {x i R d i [n]} are sparse vectors and let C i = {j x (j) i 0} which denotes the nonzero dimensions of instance i. For elastic net, the update rule in (5) can be written as: u (j) k,m+1 = δ (j) λ 2 η, if δ (j) λ 2 η 0, if δ (j) < λ 2 η δ (j) + λ 2 η, if δ (j) λ 2 η (13) where δ (j) = u (j) k,m ηv(j) k,m, u(j) k,m and v(j) k,m are the jth coordinate values of u k,m and v k,m respectively. This is unacceptable when the data dimensionality d is too large, since we need to execute the conditional statements in (13) O(Md) times which is time consuming. The update in (5) consists of calculation of v k,m and proximal mapping. The calculation of v k,m for elastic net can be written as: v k,m = λ 1 u k,m + h s(x T s u k,m )x s h s(x T s w t )x s + z, where s = i k,m for short. The proximal mapping has two cases: if j C s, then the update is the same as that in (13). if j / C s, δ (j) can be rewritten as δ (j) = (1 λ 1 η)u (j) k,m ηz(j). (14) Since z (j) is a constant during the update of local parameter u k,m, we will design a recovery strategy to recover it when necessary. More specifically, in each inner iteration, with the random index s = i k,m, we only recover u (j) to calculate the inner product x T s u k,m and update u (j) k,m for j C s. For those j / C s, we do not immediately update u (j) k,m. The learning algorithm with recovery strategy is shown in Algorithm 2. We design recovery rules to guarantee that Algorithm 1 and Algorithm 2 are exactly equivalent. is: for some coordinate j, we can calculate u (j) k,m 2 The basic idea of these recovery rules directly from u (j) k,m 1, rather than doing iterations from m = m 1 to m 2. Here, 0 m 1 < m 2 M. It will save about O(d(m 2 m 1 )(1 ρ)) times of conditional statements,

7 where ρ is the sparsity of {x i R d i [n]}. This reduction of computation is significant especially for high dimensional sparse training data. Due to space limitation, these recovery rules are moved to the Appendix in the supplementary material. Algorithm 2 Efficient Learning for Sparse Data 1: Task of the kth worker: 2: for t = 0, 1, 2,...T 1 do 3: Wait until receiving w t from master; 4: Let u k,0 = w t, calculate z k = i D k f i(w t) and send z k to master; 5: Wait until receiving z from master; 6: Set r = (0, 0,..., 0) R d ; 7: for m = 0, 1, 2,...M 1 do 8: Randomly choose an instance index s and denote C s = {j x (j) s 0}; 9: j C s, recover u (j) k,m according to the recovery rules with m 1 = r (j), m 2 = m; 10: Calculate the inner products x T s u k,m and x T s w t; 11: for j C s do 12: v (j) k,m = h s(x T s u k,m )x s (j) h s(x T s w t)x s (j) + z (j) ; 13: u (j) k,m prox λ 2 u,η((1 ηλ 1)u (j) k,m ηv(j) k,m ); 14: r (j) = m + 1; 15: end for 16: end for 17: j [d], recover u (j) to get u k,m ; 18: Send u k,m to master 19: end for 7. Experiment We use two sparse learning models for evaluation. One is logistic regression (LR) with elastic net (Zou & Hastie, 2005): P (w) = 1 n n i=1 log(1 + e yixt i w ) + λ1 2 w 2 + λ 2 w 1. The other is Lasso regression (Tibshirani, 1994): P (w) = 1 n 2n i=1 (xt i w y i) 2 + λ 2 w 1. These two models are among the most representative sparse learning models. All experiments are conducted on a cluster of multiple machines. The CPU for each machine has 12 Intel E cores, and the memory of each machine is 96GB. The machines are connected by 10GB Ethernet Datasets Evaluation is based on four datasets: cov, rcv1, avazu, kdd2012. All of them can be downloaded from LibSVM website 1. The last three datasets are high-dimensional sparse data. Details about datasets and hyper-parameters are shown in Table cjlin/ libsvmtools/datasets/ Table 1. Datasets #instances #features λ 1 λ 2 cov 581, rcv1 677,399 47, avazu 23,567,843 1,000, kdd ,705,032 54,686, Baselines We compare our with six representative baselines: proximal gradient descent based method (Beck & Teboulle, 2009), ADMM type method DFAL (Aybat et al., 2015), Newton type method (Gong & Ye, 2015), proximal SGD/SVRG based method AsyProx- SVRG (Meng et al., 2017), proximal SDCA based method PROXCOCOA+ (Smith et al., 2015), and distributed block coordinate descent (DBCD) method (Mahajan et al., 2017). For those baselines, like and, which are serial. We design distributed versions of them, in which workers distributively compute the gradients and then master gather the gradients from workers for parameter update. All methods use 8 workers. One master will be used if necessary. Unless otherwise stated, all methods except DBCD and PROXCOCOA+ use the same data partition, which is got by uniformly assigning each instance to each worker (uniform partition). Hence, different workers will have almost the same number of instances. This uniform partition strategy satisfies the condition in Lemma 4. Hence, it is a good partition. DBCD and PROXCOCOA+ adopts a coordinate distributed strategy to partition the data Results The convergence results on LR with elastic net are shown in Figure 1. Please note that as stated in the introduction, PROXCOCOA+ cannot be used for solving LR-type problems. DBCD is too slow, and hence we will separately report the results of it in Table 2. AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. From Figure 1, we can find that outperforms all the other baselines on all datasets. The convergence results on the Lasso regression are shown in Figure 2. Similarly, AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. We can see that also outperforms all other baselines on all datasets for the Lasso regression problem. DBCD is too slow, and hence we separately report the results of it in Table 2. We report the time (in second) of and DBCD on two datasets cov, rcv1 when they

8 cov rcv1 avazu kdd2012 DFAL AsyProxSVRG DFAL AsyProx-SVRG DFAL DFAL Figure 1. Evaluation on LR with elastic net cov rcv1 avazu kdd2012 PROXCOCOA+ AsyProx-SVRG PROXCOCOA+ AsyProx-SVRG PROXCOCOA+ PROXCOCOA Figure 2. Evaluation on Lasso regression get a suboptimal solution. We can find that is much faster than the coordinate descent method DBCD. Table 2. Time comparison (in second) between and DBCD. DBCD cov LR rcv > 1000 cov Lasso rcv > Speedup We also evaluate the speedup of on the four datasets for LR. We run and stop it when the gap P (w) P (w ) The speedup is defined as: Time using one worker Speedup = Time using p workers. We set p = 1, 2, 4, 8. The speedup results are in Figure 3. We can find that gets promising speedup. Speedup Ideal cov rcv1 avazu kdd Workers Figure 3. Speedup of 7.5. Effect of Data Partition We evaluate under different data partitions. We use two datasets cov and rcv1 for illustration, since they are balanced datasets which means the number of positive instances is almost the same as that of negative instances. For each dataset, we construct four data partitions: π : Each worker has the whole data, which is the best partition according to our theory; π 1 : Uniform partition; π 2 : 75% positive instances and 25% negative instances are on the first 4 workers, and other instances are on the last 4 workers; π 3 : All positive instances are on the first 4 workers, and all negative instances are on the last 4 workers. The convergence results are shown in Figure 4. We can see that data partition does affect the convergence of. The best partition π does achieve the best performance, which verifies the theory in this paper. The performance of uniform partition π 1 is similar to that of the best partition π, and is better than the other two data partitions. In real applications with large-scale dataset, it is impractical to assign each worker the whole dataset. Hence, we prefer to choose uniform partition π 1 in real applications, which is also adopted in above experiments of this paper. cov Conclusion * rcv Figure 4. Effect of data partition In this paper, we propose a novel method, called, for distributed sparse learning. Furthermore, we theoretically analyze how the data partition affects the convergence of. is both communication and computation efficient. Experiments on real data show that can outperform other state-of-the-art methods to achieve the best performance. *

9 References Aybat, Necdet S., Wang, Zi, and Iyengar, Garud. An asynchronous distributed proximal gradient method for composite convex optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp , Beck, Amir and Teboulle, Marc. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1): , Bradley, Joseph K., Kyrola, Aapo, Bickson, Danny, and Guestrin, Carlos. Parallel coordinate descent for l1-regularized loss minimization. In Proceedings of the 28th International Conference on Machine Learning, pp , Byrd, Richard H., Hansen, S. L., Nocedal, Jorge, and Singer, Yoram. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2): , Duchi, John C. and Singer, Yoram. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: , Fercoq, Olivier and Richtárik, Peter. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review, 58(4): , Gong, Pinghua and Ye, Jieping. A modified orthant-wise limited memory quasi-newton method with convergence analysis. In Proceedings of the 32nd International Conference on Machine Learning, pp , Huo, Zhouyuan, Gu, Bin, and Huang, Heng. Decoupled asynchronous proximal stochastic gradient descent with variance reduction. CoRR, abs/ , Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp , Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems, pp , Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, Bor-Yiing. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation, pp , Li, Yitan, Xu, Linli, Zhong, Xiaowei, and Ling, Qing. Make workers work harder: Decoupled asynchronous proximal stochastic gradient descent. CoRR, abs/ , Mahajan, Dhruv, Keerthi, S. Sathiya, and Sundararajan, S. A distributed block coordinate descent method for training l1 regularized linear classifiers. Journal of Machine Learning Research, 18:91:1 91:35, Meng, Qi, Chen, Wei, Yu, Jingcheng, Wang, Taifeng, Ma, Zhiming, and Liu, Tie-Yan. Asynchronous stochastic proximal optimization algorithms with variance reduction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp , Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 (4): , Richtárik, Peter and Takác, Martin. Parallel coordinate descent methods for big data optimization. Math. Program., 156(1-2): , Scherrer, Chad, Halappanavar, Mahantesh, Tewari, Ambuj, and Haglin, David. Scaling up coordinate descent algorithms for large l 1 regularization problems. In Proceedings of the 29th International Conference on Machine Learning, Schmidt, Mark W., Roux, Nicolas Le, and Bach, Francis R. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83 112, Shalev-Shwartz, Shai. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, pp , Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1): , Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pp , Shi, Ziqiang and Liu, Rujie. Large scale optimization with proximal stochastic newton-type gradient descent. In ECML/PKDD, pp , Smith, Virginia, Forte, Simone, Jordan, Michael I., and Jaggi, Martin. L1-regularized distributed optimization: A communication-efficient primal-dual framework. CoRR, abs/ , Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58: , Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109 (3): , June ISSN Wang, Jialei, Kolar, Mladen, Srebro, Nathan, and Zhang, Tong. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, pp , Wu, Tong T. and Lange, Kenneth. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1): , Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Xing, Eric P., Ho, Qirong, Dai, Wei, Kim, Jin Kyu, Wei, Jinliang, Lee, Seunghak, Zheng, Xun, Xie, Pengtao, Kumar, Abhimanu, and Yu, Yaoliang. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , 2015.

10 Zhao, Shen-Yi, Xiang, Ru, Shi, Ying-Hao, Gao, Peng, and Li, Wu-Jun. SCOPE: scalable composite optimization for learning on spark. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp , Zou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67: , 2005.

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)