arxiv: v1 [stat.ml] 15 Mar 2018
|
|
- Jocelin Bradford
- 6 years ago
- Views:
Transcription
1 Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv: v1 [stat.ml] 15 Mar 2018 Abstract Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for largescale applications with high-dimensional data. One popular way to implement sparse learning is to use L 1 regularization. In this paper, we propose a novel method, called proximal SCOPE (), for distributed sparse learning with L 1 regularization. is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework of, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, is also communication efficient. Experimental results on real data sets show that can outperform other state-of-the-art distributed methods for sparse learning. 1. Introduction Many machine learning models can be formulated as the following regularized empirical risk minimization problem: min w R d P (w) = 1 n n f i (w) + R(w), (1) i=1 where w is the parameter to learn, f i (w) is generally the loss on training instance i, n is the number of train- 1 Nanjing University. Correspondence to: Shen- Yi Zhao <zhaosy@lamda.nju.edu.cn>, Gong-Duo Zhang <zhanggd@lamda.nju.edu.cn>, Ming-Wei Li <limw@lamda.nju.edu.cn>, Wu-Jun Li <liwujun@nju.edu.cn>. ing instances, and R(w) is a regularization term. For example, given a set of labeled instances {(x i, y i )} n i=1 where x i R d is the feature vector of instance i and y i {+1, 1} is the class label of instance i, if we take f i (w) = (y i x T i w)2 and R(w) = λ w 1, it is known as lasso regression (Tibshirani, 1994). If we take f i (w) = log(1 + e yixt i w ) and R(w) = λ 2 w 2 2, it is known as logistic regression. Here, λ is a hyper-parameter. In some cases, f i (w) can also contain regularization besides the loss. One example is the elastic net (Zou & Hastie, 2005) which will be introduced in Section 6. Recently, sparse learning, which tries to learn a sparse model for prediction, has become a hot topic in machine learning. There are different ways to implement sparse learning (Tibshirani, 1994; Wang et al., 2017). One popular way is to use L 1 regularization, i.e., R(w) = λ w 1. In this paper, we focus on sparse learning with R(w) = λ w 1. Hence, in the following content of this paper, R(w) = λ w 1 unless otherwise stated. One traditional method to solve (1) is proximal gradient descent (pgd) (Beck & Teboulle, 2009), which can be written as follows: w t+1 = prox R,η (w t η F (w t )), (2) where F (w) = 1 n n i=1 f i(w), w t is the value of w at iteration t, η is the learning rate, prox is the proximal mapping defined as prox R,η (u) = arg min(r(v) + 1 v 2η v u 2 ). (3) Recently, stochastic learning methods, including stochastic gradient descent (SGD) (Nemirovski et al., 2009), stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013), and stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz & Zhang, 2013), have been proposed to speedup the learning procedure in machine learning. Inspired by the success of these stochastic learning methods, proximal stochastic methods, including proximal SGD (psgd), proximal block coordinate descent (BCD), proximal SVRG (psvrg) and proximal SDCA (psdca), have also been proposed for sparse learning in recent years (Langford et al., 2008; Duchi & Singer,
2 2009; Tseng, 2001; Wu & Lange, 2008; Scherrer et al., 2012; Shalev-Shwartz & Zhang, 2014; Xiao & Zhang, 2014; Shi & Liu, 2015; Byrd et al., 2016; Schmidt et al., 2017). All these proximal stochastic methods are sequential (serial) and implemented with one single thread. The serial proximal stochastic methods may not be efficient enough for solving large-scale sparse learning problems. Furthermore, the training set might be distributively stored on a cluster of multiple machines in some applications. Hence, distributed sparse learning (Aybat et al., 2015) with a cluster of multiple machines has attracted much attention in recent years, especially for large-scale applications with high-dimensional data. In particular, researchers have recently proposed several distributed proximal stochastic methods for sparse learning. One main branch of the distributed proximal stochastic methods includes distributed psgd (dpsgd) (Li et al., 2016) and distributed psvrg (dpsvrg) (Huo et al., 2016; Meng et al., 2017). Both dpsgd and dpsvrg adopt minibatch based strategy for distributed learning. Although both synchronous and asynchronous communication strategies can be used for these methods, most methods (Li et al., 2016; Huo et al., 2016; Meng et al., 2017) adopt asynchronous strategy because it has lower waiting cost than synchronous strategy. The asynchronous version of these methods can be easily implemented with existing distributed frameworks like Parameter Server (Li et al., 2014; Xing et al., 2015). One shortcoming of these dpsgd and dpsvrg methods is that the communication cost is high. More specifically, the communication cost of each epoch is O(n), where n is the number of training instances. Another branch of the distributed proximal stochastic methods is based on block coordinate descent (Bradley et al., 2011; Richtárik & Takác, 2016; Fercoq & Richtárik, 2016; Mahajan et al., 2017). Although in each iteration these methods update only a block of coordinates, they usually have to pass through the whole data set. For example in linear model, to calculate the gradient of any coordinate, they have to calculate x T i w from i = 1 to n. Due to the partition of data, it would also bring O(n) communication cost in each iteration. Another kind of distributed proximal stochastic methods is based on SDCA. One representative of this kind is PROXCOCOA+ (Smith et al., 2015). Based on the primal-dual framework, the communication cost of PROXCOCOA+ in each epoch is reduced to be constant. However, it can only solve a special problem with the form of f(xw) + λ w 1, where X R n d is the data matrix. Many sparse learning problems, like the L 1 regularized logistic regression, cannot be solved by PROXCOCOA+. In this paper, we propose a novel method, called proximal SCOPE (), for distributed sparse learning with L 1 regularization. is a proximal generalization of the scalable composite optimization for learning (SCOPE) (Zhao et al., 2017). SCOPE cannot be used for sparse learning, while can be used for sparse learning. The contributions of are briefly summarized as follows: is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The CALL framework is communication efficient because there is no communication during the inner iterations of each epoch. is theoretically guaranteed to be convergent with a linear convergence rate if the data partition is good enough, and better data partition implies faster convergence rate. Hence, is also computation efficient. Experimental results on real data sets show that can outperform other state-of-the-art distributed methods for sparse learning. 2. Preliminary In this paper, we use to denote the L 2 norm 2, w to denote the optimal solution of (1). For a vector a, we use a (j) to denote the jth coordinate value of a. [n] denotes the set {1, 2,..., n}. For a function h(a; b), we use h(a; b) to denote the gradient of h(a; b) with respect to (w.r.t.) the first argument a. Furthermore, we give the following definitions. Definition 1. We call a function h( ) is Lipschitz continuous if there exists a positive constant L 0 such that a, b h(a) h(b) L 0 a b. Definition 2. We call a function h( ) is L-smooth if it is differentiable and there exists a positive constant L such that a, b h(b) h(a) + h(a) T (b a) + L 2 a b 2. Definition 3. We call a function h( ) is µ-strongly convex (µ 0) if a, b h(b) h(a) + ζ T (b a) + µ 2 a b 2, where ζ h(a) = {c h(b) h(a) + c T (b a), a, b}. If h( ) is differentiable, then ζ = h(a). If µ = 0, we also call h( ) is convex.
3 Definition 4. b and function h( ), the conjugate function h ( ) is h (b) = sup(b T a h(a)), a which is always a convex function. Throughout this paper, we assume that F (w) = 1 n n i=1 f i(w) is strongly convex and each f i ( ) is L-smooth. We do not assume that each f i ( ) is convex. 3. Proximal SCOPE In this paper, we focus on distributed learning with one master (server) and p workers in the cluster, although the algorithm and theory of this paper can also be easily extended to cases with multiple servers like the Parameter Server framework (Li et al., 2014; Xing et al., 2015). The parameter w is stored in the master, and the training set D = {x i, y i } n i=1 are partitioned into p parts denoted as D 1, D 2,..., D p. Here, D k contains a subset of instances from D, and D k will be assigned to the kth worker. D = p k=1 D k. Based on this data partition scheme, the proximal SCOPE () for distributed sparse learning is presented in Algorithm 1. The main task of master is to add and average vectors received from workers. Specifically, it needs to calculate the full gradient z = F (w t ) = 1 n k=1 z k. Then it needs to calculate w t+1 = 1 p k=1 u k,m. The main task of workers is to update the local parameters u 1,m, u 2,m,..., u p,m initialized with u k,0 = w t. Specifically, for each worker k, after it gets the full gradient z from master, it calculates a stochastic gradient v k,m = f ik,m (u k,m ) f ik,m (w t ) + z, (4) and then update its local parameter u k,m by proximal mapping u k,m+1 = prox R,η (u k,m ηv k,m ), (5) where η is the learning rate. From Algorithm 1, we can find that is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The cooperative operation is mainly though adding and averaging in the master. During the autonomous local learning procedure in each outer iteration which contains M inner iterations (see Algorithm 1), there is no communication. Hence, the communication cost for each epoch of is constant, which is much less than the mini-batch based strategy Algorithm 1 Proximal SCOPE 1: Initialize w 0 and the learning rate η; 2: Task of master: 3: for t = 0, 1, 2,...T 1 do 4: Send w t to each worker; 5: Wait until receiving z 1, z 2,..., z p from all workers; 6: Calculate the full gradient z = 1 n k=1 z k and send z to each worker; 7: Wait until receiving u 1,M, u 2,M,..., u p,m from all workers and calculate w t+1 = 1 p k=1 u k,m ; 8: end for 9: Task of the kth worker: 10: for t = 0, 1, 2,...T 1 do 11: Wait until receiving w t from master; 12: Let u k,0 = w t, calculate z k = i D k f i(w t) and send z k to master; 13: Wait until receiving z from master; 14: for m = 0, 1, 2,...M 1 do 15: Randomly choose an instance x ik,m D k ; 16: Calculate v k,m = f ik,m (u k,m ) f ik,m (w t) + z; 17: Update u k,m+1 = prox R,η(u k,m ηv k,m ); 18: end for 19: Send u k,m to master 20: end for with O(n) communication cost for each epoch (Li et al., 2016; Huo et al., 2016; Meng et al., 2017). is a proximal generalization of SCOPE (Zhao et al., 2017). The main differences between them are listed as follows: Although is mainly motivated by sparse learning with L 1 regularization, the algorithm and theory of can be used for both L 1 and L 2 regularization (refer to Theorem 2). On the contrary, SCOPE cannot be used for sparse learning with L 1 regularization. The update rule of SCOPE for the inner iterations is u k,m+1 = u k,m ηv k,m + c(u k,m w t ) where c > L µ > 0 is a hyper-parameter (Zhao et al., 2017). According to the theory in SCOPE, the positive number c is necessary for SCOPE to converge, and the convergence rate will be slower when c becomes larger. If a L 2 regularization has been integrated into f i ( ) and there is no R(w) in (1), the problem in (1) is just the problem solved by SCOPE. In this special case, the update rule in (5) of will be degenerated to u k,m+1 = u k,m ηv k,m. Hence, for the problem with only L 2 regularization, the update rule of is similar to that of SCOPE but c = 0 in. This does not mean the theory in SCOPE is wrong. If we do not put any constraint on the data partition strategy, c cannot be 0 for convergence guarantee. This is just what SCOPE has proved. However, if we can design some good data partition strategy, the algorithm can still be convergent even if c = 0. This
4 is one of the main contributions of this paper. This content will be introduced in the following section. 4. Effect of Data Partition In our experiment, we find that the data partition affects the convergence of the learning procedure. Hence, in this section we propose a metric to measure the goodness of a data partition, based on which the convergence of can be theoretically proved. Due to space limitation, the detailed proof of Lemmas and Theorems are moved to the supplementary material Partition First, we give the following definition: Definition 5. Define π = [φ 1 ( ),..., φ p ( )]. We call π a partition w.r.t. P ( ), if F (w) = 1 p k=1 φ k(w) and each φ k ( )(k = 1,..., p) is µ k -strongly convex and L k -smooth (µ k, L k > 0). Here, P ( ) and F ( ) are defined in (1). We denote A(P ) = {π π is a partition w.r.t. P ( )}. Remark 1. Here, π is an ordered sequence of functions. In particular, if we construct another partition π by permuting φ i ( ) of π, we consider them to be two different partitions. Furthermore, two functions φ i ( ), φ j ( )(i j) in π can be the same. Two partitions π 1 = [φ 1 ( ),..., φ p ( )], π 2 = [ψ 1 ( ),..., ψ p ( )] are considered to be equal, i.e., π 1 = π 2, if and only if φ k (w) = ψ k (w)(k = 1,..., p), w. For any partition π = [φ 1 ( ),..., φ p ( )] w.r.t. P ( ), we construct new functions by introducing a new parameter a: where φ k (w; a) = φ k (w) + G k (a) T w, (6) P k (w; a) = φ k (w; a) + R(w), (7) G k (a) = F (a) φ k (a). (8) It is easy to get P (w) = 1 p k=1 P k(w; a), w, a since 1 p k=1 G k(a) = 0. We call such a P k (w; a) the local objective function w.r.t. π. In particular, given a data partition D 1, D 2,..., D p of the training set D, let F k (w) = 1 D k i D k f i (w) which is also called the local loss function. Assume each F k ( ) is strongly convex and smooth, and F (w) = 1 p k=1 F k(w). Then, we can find that π = [F 1 ( ),..., F p ( )] is a partition w.r.t. P ( ). Moreover, let f i (w; a) = f i (w) + G k (a) T w, i D k. According to (6) and (7), we can find some interesting properties about F k (w; a) = 1 D k i D k f i (w; a) and P k (w; a) as follows: P k (w; w t ) = 1 D k i D k f i (w; w t ) + R(w). The full gradient of F k (w; w t ) at w = w t is F k (w t ; w t ) = F k (w t ) + G k (w t ) = F (w t ). v k,m defined in (4) can be rewritten as v k,m = f i (w k,m ; w t ) f i (w t ; w t ) + F k (w t ; w t ). According to the theory in (Xiao & Zhang, 2014), in the inner iterations of each worker actually tries to optimize the local objective function P k (w; w t ) using proximal SVRG with initialization w = w t and training data D k, rather than optimizing F k (w) + R(w) Qualified Partition For a partition π = [φ 1 ( ),..., φ p ( )] w.r.t. P ( ), each worker tries to optimize the local objective function P k (w; a) based on the data assigned to that worker. In general, the data distribution on each worker is different from the distribution of the whole training set. Hence, there exists a gap between each local optimal value and the global optimal value. Intuitively, the whole learning algorithm has slow convergence rate or even cannot converge if this gap is too large. Here, we use qualified partition to denote a partition whose gap is not too large. Given a specific a, the local optimal value is denoted as follows: wk(a) = arg min P k (w; a). (9) w Definition 6. For any partition π w.r.t. P ( ), we define the Local-Global Gap as l π (a) = P (w ) 1 p P k (w p k(a); a). k=1 Definition 7. We call π a qualified partition w.r.t. P ( ) if there is a constant γ <, which is independent of a, such that l π (a) γ a w 2, a. We denote Q(P ) = {π π is a qualified partition w.r.t. P ( )}. Then, we have the following lemmas: Lemma 1. π A(P ), l π (a) l π (w ) = 0, a. Lemma 2. π A(P ), l π (a) = P (w ) + 1 p p Hk( G k (a)), k=1 where H k ( ) is the conjugate function of φ k( ) + R( ). Without loss of generality, we set λ = 1 and prove that if R(w) = w 1, A(P ) = Q(P ), which means that any partition π is a qualified partition if R(w) = w 1. In particular, we have the following theorems.
5 Theorem 1. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). R(w) = w 1. Then, π is a qualified partition. The result in Theorem 1 can be easily extended to smooth regularization. Theorem 2. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). If R( ) is smooth, then π is a qualified partition Good Partition For a qualified partition, the local-global gap can be bounded by γ a w 2. Given a speficic a, the smaller γ is, the smaller the local-global gap will be. According to Definition 7, the constant γ only depends on the partition π. Intuitively, γ can be used to evaluate the goodness of a partition π. We define a good partition as follows: Definition 8. We call π a (ɛ, ξ)-good partition w.r.t. P ( ) if it is a qualified partition and γ(π; ɛ) l π (a) sup a w 2 ɛ a w ξ. (10) 2 Obviously, we wish γ(π; ɛ) to be small for some good partition. Assume the partition π = [F ( ),..., F ( )], where F ( ) = 1 n n i=1 f i( ) is defined in (1). We can find that π is the best partition since l π (a) = 0, a, which implies γ(π ; 0) = 0 In the following content, we will prove that γ(π; ɛ) is continuous w.r.t. π, and γ(π; ɛ) 0 as π approaches π. We first define the distance between two partitions as follows: Definition 9. Let π 1 = [φ 1 ( ),..., φ p ( )] and π 2 = [ψ 1 ( ),..., ψ p ( )] be two partitions w.r.t. P ( ), the distance between π 1 and π 2 is d(π 1, π 2 ) = max {sup φ i (w) ψ i (w), sup φ i (w) ψ i (w) }. i=1,...,p w w We can prove that it is a reasonable definition of distance. For convenience, we assume the domain of these φ k ( ) and ψ k ( ) is closed and bounded, which is denoted as W = {w w B} and w W. Hence, d(π 1, π 2 ) <. With the definition of distance between two partitions, we have Lemma 3. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). l π (a) uniformly converges to l π (a) = 0 as d(π, π ) 0. Besides the uniform convergence of l π (a) w.r.t. π, we also prefer the uniform convergence of γ(π; ɛ) w.r.t. π since it reflects the goodness of partition π. In the appendix of supplementary, we give the condition for convergence of γ(π; 0). However, it needs higher order information, which is not common in first order optimization. In the following, we give the bound of γ(π; ɛ) which is more practical. Lemma 4. Assume π = [F 1 ( ),..., F p ( )] is a partition w.r.t. P ( ), where F k (w) = 1 D k i D k f i (w) is the local loss function, each f i ( ) is Lipschitz continuous and sampled from some unknown distribution P. If we assign these {f i ( )} uniformly to each worker, then with high probability, γ(π; ɛ) 1 p p 1 O( ɛ D k ). k=1 Moreover, if l π (w) is convex w.r.t. w, then γ(π; ɛ) 1 p p 1 O( ɛ Dk ). k=1 Here we ignore the log term and dimensionality d. It implies that as long as the size of training data is large enough, γ(π; ɛ) will be small and π will be a good partition. Please note that the uniformly here means each f i ( ) will be assigned to one of the p workers and each worker has the equal probability to be assigned. We call the partition resulted from uniform assignment uniform partition in this paper. With uniform partition, each worker will have almost the same number of instances. As long as the size of training data is large enough, uniform partition is a good partition. 5. Convergence of Proximal SCOPE In this section, we will prove the convergence of Algorithm 1 for proximal SCOPE () using the results in Section 4. For convenience, we rewrite the local update rule as follows: where u k,m+1 = u k,m ηg k,m, (11) g k,m = 1 η (u k,m prox R,η (u k,m ηv k,m )). (12) Then we have the following theorems about the convergence of. Theorem 3. Assume π = [F 1 ( ),..., F p ( )] is a (ɛ, ξ)- good partition w.r.t. P ( ). F k ( ) is µ-strongly convex
6 (µ > 0) and L-smooth. R( ) is convex. If w t w 2 ɛ, then E w t+1 w 2 [(1 µη + 2L 2 η 2 ) M + 2L2 η + 2ξ µ + 2L 2 η ] w t w 2. By setting appropriate η and M, we can always make [(1 µη + 2L 2 η 2 ) M + 2L2 η+2ξ µ+2l 2 η ] < 1. Hence, if the partition π = [F 1 ( ),..., F p ( )] is a (ɛ, ξ)-good partition w.r.t. P ( ), is convergent with a linear convergence rate. Furthermore, the smaller ξ is, the faster convergence rate will have. Because smaller ξ means better partition and the partition π corresponds to data partition in Algorithm 1, we can see that better data partition implies faster convergence rate. Corollary 1. Assume π = [F 1 ( ),..., F p ( )] is a (ɛ, µ 8 )- good partition w.r.t. P ( ). F k ( ) is µ-strongly convex (µ > 0) and L-smooth. If w t w 2 ɛ, taking η = µ 12L, 2 M = 20κ 2, where κ = L µ is the conditional number, then we have E w t+1 w w t w 2. To get the ɛ-suboptimal solution, the computation complexity of each worker is O((n/p + κ 2 ) log( 1 ɛ )). Corollary 2. When p = 1, which means we only use one worker, degenerates to proximal SVRG (Xiao & Zhang, 2014). Assume F ( ) is µ-strongly convex (µ > 0) and L-smooth. Taking η = µ 6L 2, M = 13κ 2, we have E w t+1 w w t w 2. To get the ɛ-optimal solution, the computation complexity is O((n + κ 2 ) log( 1 ɛ )). Because has linear convergence rate if the partition is (ɛ, ξ)-good, we need T = O(log( 1 ɛ )) outer iterations to get a ɛ-optimal solution. For all inner iterations, each worker updates u k,m without any communication. Hence, the communication cost is O(log( 1 ɛ )), which is much small than the mini-batch based strategy with O(n) communication cost for each epoch (Li et al., 2016; Huo et al., 2016; Meng et al., 2017).. Furthermore, in the above theorems and corollaries, we only assume that the local loss function F k ( ) is convex. We do not need each f i ( ) to be convex. Hence, M = O(κ 2 ) and it is weaker than the assumption in proximal SVRG (Xiao & Zhang, 2014) whose computation complexity is O((n + κ) log( 1 ɛ )) when p = 1. Please note that proximal SVRG (Xiao & Zhang, 2014) is a centralized method. In addition, without convexity assumption for each f i ( ), our result for the degenerate case p = 1 is consistent with that in (Shalev-Shwartz, 2016). 6. Handle High Dimensional Sparse Data For the cases with high dimensional sparse data, we can take the sparsity to speedup the training procedure. Here, we adopt the widely used elastic net (Zou & Hastie, 2005) as an example for illustration. Elastic net can be formulated as follows: min w P (w) := 1 n n i=1 h i(x T λ1 i w)+ 2 w 2 + λ 2 w 1, where h i : R R is a convex function. With f i (w) = h i (x T λ1 i w) + 2 w 2, R(w) = λ 2 w 1, the elastic net can be rewritten as the form in (1). We assume many instances in {x i R d i [n]} are sparse vectors and let C i = {j x (j) i 0} which denotes the nonzero dimensions of instance i. For elastic net, the update rule in (5) can be written as: u (j) k,m+1 = δ (j) λ 2 η, if δ (j) λ 2 η 0, if δ (j) < λ 2 η δ (j) + λ 2 η, if δ (j) λ 2 η (13) where δ (j) = u (j) k,m ηv(j) k,m, u(j) k,m and v(j) k,m are the jth coordinate values of u k,m and v k,m respectively. This is unacceptable when the data dimensionality d is too large, since we need to execute the conditional statements in (13) O(Md) times which is time consuming. The update in (5) consists of calculation of v k,m and proximal mapping. The calculation of v k,m for elastic net can be written as: v k,m = λ 1 u k,m + h s(x T s u k,m )x s h s(x T s w t )x s + z, where s = i k,m for short. The proximal mapping has two cases: if j C s, then the update is the same as that in (13). if j / C s, δ (j) can be rewritten as δ (j) = (1 λ 1 η)u (j) k,m ηz(j). (14) Since z (j) is a constant during the update of local parameter u k,m, we will design a recovery strategy to recover it when necessary. More specifically, in each inner iteration, with the random index s = i k,m, we only recover u (j) to calculate the inner product x T s u k,m and update u (j) k,m for j C s. For those j / C s, we do not immediately update u (j) k,m. The learning algorithm with recovery strategy is shown in Algorithm 2. We design recovery rules to guarantee that Algorithm 1 and Algorithm 2 are exactly equivalent. is: for some coordinate j, we can calculate u (j) k,m 2 The basic idea of these recovery rules directly from u (j) k,m 1, rather than doing iterations from m = m 1 to m 2. Here, 0 m 1 < m 2 M. It will save about O(d(m 2 m 1 )(1 ρ)) times of conditional statements,
7 where ρ is the sparsity of {x i R d i [n]}. This reduction of computation is significant especially for high dimensional sparse training data. Due to space limitation, these recovery rules are moved to the Appendix in the supplementary material. Algorithm 2 Efficient Learning for Sparse Data 1: Task of the kth worker: 2: for t = 0, 1, 2,...T 1 do 3: Wait until receiving w t from master; 4: Let u k,0 = w t, calculate z k = i D k f i(w t) and send z k to master; 5: Wait until receiving z from master; 6: Set r = (0, 0,..., 0) R d ; 7: for m = 0, 1, 2,...M 1 do 8: Randomly choose an instance index s and denote C s = {j x (j) s 0}; 9: j C s, recover u (j) k,m according to the recovery rules with m 1 = r (j), m 2 = m; 10: Calculate the inner products x T s u k,m and x T s w t; 11: for j C s do 12: v (j) k,m = h s(x T s u k,m )x s (j) h s(x T s w t)x s (j) + z (j) ; 13: u (j) k,m prox λ 2 u,η((1 ηλ 1)u (j) k,m ηv(j) k,m ); 14: r (j) = m + 1; 15: end for 16: end for 17: j [d], recover u (j) to get u k,m ; 18: Send u k,m to master 19: end for 7. Experiment We use two sparse learning models for evaluation. One is logistic regression (LR) with elastic net (Zou & Hastie, 2005): P (w) = 1 n n i=1 log(1 + e yixt i w ) + λ1 2 w 2 + λ 2 w 1. The other is Lasso regression (Tibshirani, 1994): P (w) = 1 n 2n i=1 (xt i w y i) 2 + λ 2 w 1. These two models are among the most representative sparse learning models. All experiments are conducted on a cluster of multiple machines. The CPU for each machine has 12 Intel E cores, and the memory of each machine is 96GB. The machines are connected by 10GB Ethernet Datasets Evaluation is based on four datasets: cov, rcv1, avazu, kdd2012. All of them can be downloaded from LibSVM website 1. The last three datasets are high-dimensional sparse data. Details about datasets and hyper-parameters are shown in Table cjlin/ libsvmtools/datasets/ Table 1. Datasets #instances #features λ 1 λ 2 cov 581, rcv1 677,399 47, avazu 23,567,843 1,000, kdd ,705,032 54,686, Baselines We compare our with six representative baselines: proximal gradient descent based method (Beck & Teboulle, 2009), ADMM type method DFAL (Aybat et al., 2015), Newton type method (Gong & Ye, 2015), proximal SGD/SVRG based method AsyProx- SVRG (Meng et al., 2017), proximal SDCA based method PROXCOCOA+ (Smith et al., 2015), and distributed block coordinate descent (DBCD) method (Mahajan et al., 2017). For those baselines, like and, which are serial. We design distributed versions of them, in which workers distributively compute the gradients and then master gather the gradients from workers for parameter update. All methods use 8 workers. One master will be used if necessary. Unless otherwise stated, all methods except DBCD and PROXCOCOA+ use the same data partition, which is got by uniformly assigning each instance to each worker (uniform partition). Hence, different workers will have almost the same number of instances. This uniform partition strategy satisfies the condition in Lemma 4. Hence, it is a good partition. DBCD and PROXCOCOA+ adopts a coordinate distributed strategy to partition the data Results The convergence results on LR with elastic net are shown in Figure 1. Please note that as stated in the introduction, PROXCOCOA+ cannot be used for solving LR-type problems. DBCD is too slow, and hence we will separately report the results of it in Table 2. AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. From Figure 1, we can find that outperforms all the other baselines on all datasets. The convergence results on the Lasso regression are shown in Figure 2. Similarly, AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. We can see that also outperforms all other baselines on all datasets for the Lasso regression problem. DBCD is too slow, and hence we separately report the results of it in Table 2. We report the time (in second) of and DBCD on two datasets cov, rcv1 when they
8 cov rcv1 avazu kdd2012 DFAL AsyProxSVRG DFAL AsyProx-SVRG DFAL DFAL Figure 1. Evaluation on LR with elastic net cov rcv1 avazu kdd2012 PROXCOCOA+ AsyProx-SVRG PROXCOCOA+ AsyProx-SVRG PROXCOCOA+ PROXCOCOA Figure 2. Evaluation on Lasso regression get a suboptimal solution. We can find that is much faster than the coordinate descent method DBCD. Table 2. Time comparison (in second) between and DBCD. DBCD cov LR rcv > 1000 cov Lasso rcv > Speedup We also evaluate the speedup of on the four datasets for LR. We run and stop it when the gap P (w) P (w ) The speedup is defined as: Time using one worker Speedup = Time using p workers. We set p = 1, 2, 4, 8. The speedup results are in Figure 3. We can find that gets promising speedup. Speedup Ideal cov rcv1 avazu kdd Workers Figure 3. Speedup of 7.5. Effect of Data Partition We evaluate under different data partitions. We use two datasets cov and rcv1 for illustration, since they are balanced datasets which means the number of positive instances is almost the same as that of negative instances. For each dataset, we construct four data partitions: π : Each worker has the whole data, which is the best partition according to our theory; π 1 : Uniform partition; π 2 : 75% positive instances and 25% negative instances are on the first 4 workers, and other instances are on the last 4 workers; π 3 : All positive instances are on the first 4 workers, and all negative instances are on the last 4 workers. The convergence results are shown in Figure 4. We can see that data partition does affect the convergence of. The best partition π does achieve the best performance, which verifies the theory in this paper. The performance of uniform partition π 1 is similar to that of the best partition π, and is better than the other two data partitions. In real applications with large-scale dataset, it is impractical to assign each worker the whole dataset. Hence, we prefer to choose uniform partition π 1 in real applications, which is also adopted in above experiments of this paper. cov Conclusion * rcv Figure 4. Effect of data partition In this paper, we propose a novel method, called, for distributed sparse learning. Furthermore, we theoretically analyze how the data partition affects the convergence of. is both communication and computation efficient. Experiments on real data show that can outperform other state-of-the-art methods to achieve the best performance. *
9 References Aybat, Necdet S., Wang, Zi, and Iyengar, Garud. An asynchronous distributed proximal gradient method for composite convex optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp , Beck, Amir and Teboulle, Marc. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1): , Bradley, Joseph K., Kyrola, Aapo, Bickson, Danny, and Guestrin, Carlos. Parallel coordinate descent for l1-regularized loss minimization. In Proceedings of the 28th International Conference on Machine Learning, pp , Byrd, Richard H., Hansen, S. L., Nocedal, Jorge, and Singer, Yoram. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2): , Duchi, John C. and Singer, Yoram. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: , Fercoq, Olivier and Richtárik, Peter. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review, 58(4): , Gong, Pinghua and Ye, Jieping. A modified orthant-wise limited memory quasi-newton method with convergence analysis. In Proceedings of the 32nd International Conference on Machine Learning, pp , Huo, Zhouyuan, Gu, Bin, and Huang, Heng. Decoupled asynchronous proximal stochastic gradient descent with variance reduction. CoRR, abs/ , Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp , Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems, pp , Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, Bor-Yiing. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation, pp , Li, Yitan, Xu, Linli, Zhong, Xiaowei, and Ling, Qing. Make workers work harder: Decoupled asynchronous proximal stochastic gradient descent. CoRR, abs/ , Mahajan, Dhruv, Keerthi, S. Sathiya, and Sundararajan, S. A distributed block coordinate descent method for training l1 regularized linear classifiers. Journal of Machine Learning Research, 18:91:1 91:35, Meng, Qi, Chen, Wei, Yu, Jingcheng, Wang, Taifeng, Ma, Zhiming, and Liu, Tie-Yan. Asynchronous stochastic proximal optimization algorithms with variance reduction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp , Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 (4): , Richtárik, Peter and Takác, Martin. Parallel coordinate descent methods for big data optimization. Math. Program., 156(1-2): , Scherrer, Chad, Halappanavar, Mahantesh, Tewari, Ambuj, and Haglin, David. Scaling up coordinate descent algorithms for large l 1 regularization problems. In Proceedings of the 29th International Conference on Machine Learning, Schmidt, Mark W., Roux, Nicolas Le, and Bach, Francis R. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83 112, Shalev-Shwartz, Shai. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, pp , Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1): , Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pp , Shi, Ziqiang and Liu, Rujie. Large scale optimization with proximal stochastic newton-type gradient descent. In ECML/PKDD, pp , Smith, Virginia, Forte, Simone, Jordan, Michael I., and Jaggi, Martin. L1-regularized distributed optimization: A communication-efficient primal-dual framework. CoRR, abs/ , Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58: , Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109 (3): , June ISSN Wang, Jialei, Kolar, Mladen, Srebro, Nathan, and Zhang, Tong. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, pp , Wu, Tong T. and Lange, Kenneth. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1): , Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Xing, Eric P., Ho, Qirong, Dai, Wei, Kim, Jin Kyu, Wei, Jinliang, Lee, Seunghak, Zheng, Xun, Xie, Pengtao, Kumar, Abhimanu, and Yu, Yaoliang. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , 2015.
10 Zhao, Shen-Yi, Xiang, Ru, Shi, Ying-Hao, Gao, Peng, and Li, Wu-Jun. SCOPE: scalable composite optimization for learning on spark. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp , Zou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67: , 2005.
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationParaGraphE: A Library for Parallel Knowledge Graph Embedding
ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More information1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),
SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationAccelerated Gradient Method for Multi-Task Sparse Learning Problem
Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationStochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models
Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models Hongchang Gao, Heng Huang Department of Electrical and Computer Engineering, University of Pittsburgh, USA hongchanggao@gmail.com,
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationStochastic Dual Coordinate Ascent with Adaptive Probabilities
Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationStingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
: Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationBundle CDN: A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression
Bundle : A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression Yatao Bian, Xiong Li, Mingqi Cao, and Yuncai Liu Institute of Image Processing and Pattern Recognition, Shanghai
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationarxiv: v5 [cs.lg] 23 Dec 2017 Abstract
Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationAsynchronous Doubly Stochastic Sparse Kernel Learning
The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationA Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong
More informationAn Asynchronous Parallel Stochastic Coordinate Descent Algorithm
Ji Liu JI-LIU@CS.WISC.EDU Stephen J. Wright SWRIGHT@CS.WISC.EDU Christopher Ré CHRISMRE@STANFORD.EDU Victor Bittorf BITTORF@CS.WISC.EDU Srikrishna Sridhar SRIKRIS@CS.WISC.EDU Department of Computer Sciences,
More informationLOCO: Distributing Ridge Regression with Random Projections
LOCO: Distributing Ridge Regression with Random Projections Brian McWilliams Christina Heinze Nicolai Meinshausen Gabriel Krummenacher Hastagiri P. Vanchinathan Department of Computer Science Seminar für
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationA Brief Overview of Practical Optimization Algorithms in the Context of Relaxation
A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:
More informationParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu
More informationStochastic Methods for l 1 Regularized Loss Minimization
Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods
More informationDistributed Stochastic Optimization of Regularized Risk via Saddle-point Problem
Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Shin Matsushima 1, Hyokun Yun 2, Xinhua Zhang 3, and S.V.N. Vishwanathan 2,4 1 The University of Tokyo, Tokyo, Japan shin
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationarxiv: v1 [cs.lg] 10 Jan 2019
Quantized Epoch-SGD for Communication-Efficient Distributed Learning arxiv:1901.0300v1 [cs.lg] 10 Jan 2019 Shen-Yi Zhao Hao Gao Wu-Jun Li Department of Computer Science and Technology Nanjing University,
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationA DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson
204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,
More informationarxiv: v2 [stat.ml] 16 Jun 2015
Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:
More informationarxiv: v3 [cs.lg] 15 Sep 2018
Asynchronous Stochastic Proximal Methods for onconvex onsmooth Optimization Rui Zhu 1, Di iu 1, Zongpeng Li 1 Department of Electrical and Computer Engineering, University of Alberta School of Computer
More informationAsynchronous Doubly Stochastic Group Regularized Learning
Bin Gu Zhouyuan Huo Heng Huang Department of Electrical and Computer Engineering, University of Pittsburgh, USA Abstract Group regularized learning problems such as group Lasso) are important in machine
More informationFast-and-Light Stochastic ADMM
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering
More informationReferences. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization
References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationPASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent
Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationONLINE VARIANCE-REDUCING OPTIMIZATION
ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com
More information