arxiv: v1 [stat.ml] 15 Mar 2018

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 15 Mar 2018"

Transcription

1 Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate Shen-Yi Zhao 1 Gong-Duo Zhang 1 Ming-Wei Li 1 Wu-Jun Li * 1 arxiv: v1 [stat.ml] 15 Mar 2018 Abstract Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for largescale applications with high-dimensional data. One popular way to implement sparse learning is to use L 1 regularization. In this paper, we propose a novel method, called proximal SCOPE (), for distributed sparse learning with L 1 regularization. is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework of, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, is also communication efficient. Experimental results on real data sets show that can outperform other state-of-the-art distributed methods for sparse learning. 1. Introduction Many machine learning models can be formulated as the following regularized empirical risk minimization problem: min w R d P (w) = 1 n n f i (w) + R(w), (1) i=1 where w is the parameter to learn, f i (w) is generally the loss on training instance i, n is the number of train- 1 Nanjing University. Correspondence to: Shen- Yi Zhao <zhaosy@lamda.nju.edu.cn>, Gong-Duo Zhang <zhanggd@lamda.nju.edu.cn>, Ming-Wei Li <limw@lamda.nju.edu.cn>, Wu-Jun Li <liwujun@nju.edu.cn>. ing instances, and R(w) is a regularization term. For example, given a set of labeled instances {(x i, y i )} n i=1 where x i R d is the feature vector of instance i and y i {+1, 1} is the class label of instance i, if we take f i (w) = (y i x T i w)2 and R(w) = λ w 1, it is known as lasso regression (Tibshirani, 1994). If we take f i (w) = log(1 + e yixt i w ) and R(w) = λ 2 w 2 2, it is known as logistic regression. Here, λ is a hyper-parameter. In some cases, f i (w) can also contain regularization besides the loss. One example is the elastic net (Zou & Hastie, 2005) which will be introduced in Section 6. Recently, sparse learning, which tries to learn a sparse model for prediction, has become a hot topic in machine learning. There are different ways to implement sparse learning (Tibshirani, 1994; Wang et al., 2017). One popular way is to use L 1 regularization, i.e., R(w) = λ w 1. In this paper, we focus on sparse learning with R(w) = λ w 1. Hence, in the following content of this paper, R(w) = λ w 1 unless otherwise stated. One traditional method to solve (1) is proximal gradient descent (pgd) (Beck & Teboulle, 2009), which can be written as follows: w t+1 = prox R,η (w t η F (w t )), (2) where F (w) = 1 n n i=1 f i(w), w t is the value of w at iteration t, η is the learning rate, prox is the proximal mapping defined as prox R,η (u) = arg min(r(v) + 1 v 2η v u 2 ). (3) Recently, stochastic learning methods, including stochastic gradient descent (SGD) (Nemirovski et al., 2009), stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013), and stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz & Zhang, 2013), have been proposed to speedup the learning procedure in machine learning. Inspired by the success of these stochastic learning methods, proximal stochastic methods, including proximal SGD (psgd), proximal block coordinate descent (BCD), proximal SVRG (psvrg) and proximal SDCA (psdca), have also been proposed for sparse learning in recent years (Langford et al., 2008; Duchi & Singer,

2 2009; Tseng, 2001; Wu & Lange, 2008; Scherrer et al., 2012; Shalev-Shwartz & Zhang, 2014; Xiao & Zhang, 2014; Shi & Liu, 2015; Byrd et al., 2016; Schmidt et al., 2017). All these proximal stochastic methods are sequential (serial) and implemented with one single thread. The serial proximal stochastic methods may not be efficient enough for solving large-scale sparse learning problems. Furthermore, the training set might be distributively stored on a cluster of multiple machines in some applications. Hence, distributed sparse learning (Aybat et al., 2015) with a cluster of multiple machines has attracted much attention in recent years, especially for large-scale applications with high-dimensional data. In particular, researchers have recently proposed several distributed proximal stochastic methods for sparse learning. One main branch of the distributed proximal stochastic methods includes distributed psgd (dpsgd) (Li et al., 2016) and distributed psvrg (dpsvrg) (Huo et al., 2016; Meng et al., 2017). Both dpsgd and dpsvrg adopt minibatch based strategy for distributed learning. Although both synchronous and asynchronous communication strategies can be used for these methods, most methods (Li et al., 2016; Huo et al., 2016; Meng et al., 2017) adopt asynchronous strategy because it has lower waiting cost than synchronous strategy. The asynchronous version of these methods can be easily implemented with existing distributed frameworks like Parameter Server (Li et al., 2014; Xing et al., 2015). One shortcoming of these dpsgd and dpsvrg methods is that the communication cost is high. More specifically, the communication cost of each epoch is O(n), where n is the number of training instances. Another branch of the distributed proximal stochastic methods is based on block coordinate descent (Bradley et al., 2011; Richtárik & Takác, 2016; Fercoq & Richtárik, 2016; Mahajan et al., 2017). Although in each iteration these methods update only a block of coordinates, they usually have to pass through the whole data set. For example in linear model, to calculate the gradient of any coordinate, they have to calculate x T i w from i = 1 to n. Due to the partition of data, it would also bring O(n) communication cost in each iteration. Another kind of distributed proximal stochastic methods is based on SDCA. One representative of this kind is PROXCOCOA+ (Smith et al., 2015). Based on the primal-dual framework, the communication cost of PROXCOCOA+ in each epoch is reduced to be constant. However, it can only solve a special problem with the form of f(xw) + λ w 1, where X R n d is the data matrix. Many sparse learning problems, like the L 1 regularized logistic regression, cannot be solved by PROXCOCOA+. In this paper, we propose a novel method, called proximal SCOPE (), for distributed sparse learning with L 1 regularization. is a proximal generalization of the scalable composite optimization for learning (SCOPE) (Zhao et al., 2017). SCOPE cannot be used for sparse learning, while can be used for sparse learning. The contributions of are briefly summarized as follows: is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The CALL framework is communication efficient because there is no communication during the inner iterations of each epoch. is theoretically guaranteed to be convergent with a linear convergence rate if the data partition is good enough, and better data partition implies faster convergence rate. Hence, is also computation efficient. Experimental results on real data sets show that can outperform other state-of-the-art distributed methods for sparse learning. 2. Preliminary In this paper, we use to denote the L 2 norm 2, w to denote the optimal solution of (1). For a vector a, we use a (j) to denote the jth coordinate value of a. [n] denotes the set {1, 2,..., n}. For a function h(a; b), we use h(a; b) to denote the gradient of h(a; b) with respect to (w.r.t.) the first argument a. Furthermore, we give the following definitions. Definition 1. We call a function h( ) is Lipschitz continuous if there exists a positive constant L 0 such that a, b h(a) h(b) L 0 a b. Definition 2. We call a function h( ) is L-smooth if it is differentiable and there exists a positive constant L such that a, b h(b) h(a) + h(a) T (b a) + L 2 a b 2. Definition 3. We call a function h( ) is µ-strongly convex (µ 0) if a, b h(b) h(a) + ζ T (b a) + µ 2 a b 2, where ζ h(a) = {c h(b) h(a) + c T (b a), a, b}. If h( ) is differentiable, then ζ = h(a). If µ = 0, we also call h( ) is convex.

3 Definition 4. b and function h( ), the conjugate function h ( ) is h (b) = sup(b T a h(a)), a which is always a convex function. Throughout this paper, we assume that F (w) = 1 n n i=1 f i(w) is strongly convex and each f i ( ) is L-smooth. We do not assume that each f i ( ) is convex. 3. Proximal SCOPE In this paper, we focus on distributed learning with one master (server) and p workers in the cluster, although the algorithm and theory of this paper can also be easily extended to cases with multiple servers like the Parameter Server framework (Li et al., 2014; Xing et al., 2015). The parameter w is stored in the master, and the training set D = {x i, y i } n i=1 are partitioned into p parts denoted as D 1, D 2,..., D p. Here, D k contains a subset of instances from D, and D k will be assigned to the kth worker. D = p k=1 D k. Based on this data partition scheme, the proximal SCOPE () for distributed sparse learning is presented in Algorithm 1. The main task of master is to add and average vectors received from workers. Specifically, it needs to calculate the full gradient z = F (w t ) = 1 n k=1 z k. Then it needs to calculate w t+1 = 1 p k=1 u k,m. The main task of workers is to update the local parameters u 1,m, u 2,m,..., u p,m initialized with u k,0 = w t. Specifically, for each worker k, after it gets the full gradient z from master, it calculates a stochastic gradient v k,m = f ik,m (u k,m ) f ik,m (w t ) + z, (4) and then update its local parameter u k,m by proximal mapping u k,m+1 = prox R,η (u k,m ηv k,m ), (5) where η is the learning rate. From Algorithm 1, we can find that is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The cooperative operation is mainly though adding and averaging in the master. During the autonomous local learning procedure in each outer iteration which contains M inner iterations (see Algorithm 1), there is no communication. Hence, the communication cost for each epoch of is constant, which is much less than the mini-batch based strategy Algorithm 1 Proximal SCOPE 1: Initialize w 0 and the learning rate η; 2: Task of master: 3: for t = 0, 1, 2,...T 1 do 4: Send w t to each worker; 5: Wait until receiving z 1, z 2,..., z p from all workers; 6: Calculate the full gradient z = 1 n k=1 z k and send z to each worker; 7: Wait until receiving u 1,M, u 2,M,..., u p,m from all workers and calculate w t+1 = 1 p k=1 u k,m ; 8: end for 9: Task of the kth worker: 10: for t = 0, 1, 2,...T 1 do 11: Wait until receiving w t from master; 12: Let u k,0 = w t, calculate z k = i D k f i(w t) and send z k to master; 13: Wait until receiving z from master; 14: for m = 0, 1, 2,...M 1 do 15: Randomly choose an instance x ik,m D k ; 16: Calculate v k,m = f ik,m (u k,m ) f ik,m (w t) + z; 17: Update u k,m+1 = prox R,η(u k,m ηv k,m ); 18: end for 19: Send u k,m to master 20: end for with O(n) communication cost for each epoch (Li et al., 2016; Huo et al., 2016; Meng et al., 2017). is a proximal generalization of SCOPE (Zhao et al., 2017). The main differences between them are listed as follows: Although is mainly motivated by sparse learning with L 1 regularization, the algorithm and theory of can be used for both L 1 and L 2 regularization (refer to Theorem 2). On the contrary, SCOPE cannot be used for sparse learning with L 1 regularization. The update rule of SCOPE for the inner iterations is u k,m+1 = u k,m ηv k,m + c(u k,m w t ) where c > L µ > 0 is a hyper-parameter (Zhao et al., 2017). According to the theory in SCOPE, the positive number c is necessary for SCOPE to converge, and the convergence rate will be slower when c becomes larger. If a L 2 regularization has been integrated into f i ( ) and there is no R(w) in (1), the problem in (1) is just the problem solved by SCOPE. In this special case, the update rule in (5) of will be degenerated to u k,m+1 = u k,m ηv k,m. Hence, for the problem with only L 2 regularization, the update rule of is similar to that of SCOPE but c = 0 in. This does not mean the theory in SCOPE is wrong. If we do not put any constraint on the data partition strategy, c cannot be 0 for convergence guarantee. This is just what SCOPE has proved. However, if we can design some good data partition strategy, the algorithm can still be convergent even if c = 0. This

4 is one of the main contributions of this paper. This content will be introduced in the following section. 4. Effect of Data Partition In our experiment, we find that the data partition affects the convergence of the learning procedure. Hence, in this section we propose a metric to measure the goodness of a data partition, based on which the convergence of can be theoretically proved. Due to space limitation, the detailed proof of Lemmas and Theorems are moved to the supplementary material Partition First, we give the following definition: Definition 5. Define π = [φ 1 ( ),..., φ p ( )]. We call π a partition w.r.t. P ( ), if F (w) = 1 p k=1 φ k(w) and each φ k ( )(k = 1,..., p) is µ k -strongly convex and L k -smooth (µ k, L k > 0). Here, P ( ) and F ( ) are defined in (1). We denote A(P ) = {π π is a partition w.r.t. P ( )}. Remark 1. Here, π is an ordered sequence of functions. In particular, if we construct another partition π by permuting φ i ( ) of π, we consider them to be two different partitions. Furthermore, two functions φ i ( ), φ j ( )(i j) in π can be the same. Two partitions π 1 = [φ 1 ( ),..., φ p ( )], π 2 = [ψ 1 ( ),..., ψ p ( )] are considered to be equal, i.e., π 1 = π 2, if and only if φ k (w) = ψ k (w)(k = 1,..., p), w. For any partition π = [φ 1 ( ),..., φ p ( )] w.r.t. P ( ), we construct new functions by introducing a new parameter a: where φ k (w; a) = φ k (w) + G k (a) T w, (6) P k (w; a) = φ k (w; a) + R(w), (7) G k (a) = F (a) φ k (a). (8) It is easy to get P (w) = 1 p k=1 P k(w; a), w, a since 1 p k=1 G k(a) = 0. We call such a P k (w; a) the local objective function w.r.t. π. In particular, given a data partition D 1, D 2,..., D p of the training set D, let F k (w) = 1 D k i D k f i (w) which is also called the local loss function. Assume each F k ( ) is strongly convex and smooth, and F (w) = 1 p k=1 F k(w). Then, we can find that π = [F 1 ( ),..., F p ( )] is a partition w.r.t. P ( ). Moreover, let f i (w; a) = f i (w) + G k (a) T w, i D k. According to (6) and (7), we can find some interesting properties about F k (w; a) = 1 D k i D k f i (w; a) and P k (w; a) as follows: P k (w; w t ) = 1 D k i D k f i (w; w t ) + R(w). The full gradient of F k (w; w t ) at w = w t is F k (w t ; w t ) = F k (w t ) + G k (w t ) = F (w t ). v k,m defined in (4) can be rewritten as v k,m = f i (w k,m ; w t ) f i (w t ; w t ) + F k (w t ; w t ). According to the theory in (Xiao & Zhang, 2014), in the inner iterations of each worker actually tries to optimize the local objective function P k (w; w t ) using proximal SVRG with initialization w = w t and training data D k, rather than optimizing F k (w) + R(w) Qualified Partition For a partition π = [φ 1 ( ),..., φ p ( )] w.r.t. P ( ), each worker tries to optimize the local objective function P k (w; a) based on the data assigned to that worker. In general, the data distribution on each worker is different from the distribution of the whole training set. Hence, there exists a gap between each local optimal value and the global optimal value. Intuitively, the whole learning algorithm has slow convergence rate or even cannot converge if this gap is too large. Here, we use qualified partition to denote a partition whose gap is not too large. Given a specific a, the local optimal value is denoted as follows: wk(a) = arg min P k (w; a). (9) w Definition 6. For any partition π w.r.t. P ( ), we define the Local-Global Gap as l π (a) = P (w ) 1 p P k (w p k(a); a). k=1 Definition 7. We call π a qualified partition w.r.t. P ( ) if there is a constant γ <, which is independent of a, such that l π (a) γ a w 2, a. We denote Q(P ) = {π π is a qualified partition w.r.t. P ( )}. Then, we have the following lemmas: Lemma 1. π A(P ), l π (a) l π (w ) = 0, a. Lemma 2. π A(P ), l π (a) = P (w ) + 1 p p Hk( G k (a)), k=1 where H k ( ) is the conjugate function of φ k( ) + R( ). Without loss of generality, we set λ = 1 and prove that if R(w) = w 1, A(P ) = Q(P ), which means that any partition π is a qualified partition if R(w) = w 1. In particular, we have the following theorems.

5 Theorem 1. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). R(w) = w 1. Then, π is a qualified partition. The result in Theorem 1 can be easily extended to smooth regularization. Theorem 2. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). If R( ) is smooth, then π is a qualified partition Good Partition For a qualified partition, the local-global gap can be bounded by γ a w 2. Given a speficic a, the smaller γ is, the smaller the local-global gap will be. According to Definition 7, the constant γ only depends on the partition π. Intuitively, γ can be used to evaluate the goodness of a partition π. We define a good partition as follows: Definition 8. We call π a (ɛ, ξ)-good partition w.r.t. P ( ) if it is a qualified partition and γ(π; ɛ) l π (a) sup a w 2 ɛ a w ξ. (10) 2 Obviously, we wish γ(π; ɛ) to be small for some good partition. Assume the partition π = [F ( ),..., F ( )], where F ( ) = 1 n n i=1 f i( ) is defined in (1). We can find that π is the best partition since l π (a) = 0, a, which implies γ(π ; 0) = 0 In the following content, we will prove that γ(π; ɛ) is continuous w.r.t. π, and γ(π; ɛ) 0 as π approaches π. We first define the distance between two partitions as follows: Definition 9. Let π 1 = [φ 1 ( ),..., φ p ( )] and π 2 = [ψ 1 ( ),..., ψ p ( )] be two partitions w.r.t. P ( ), the distance between π 1 and π 2 is d(π 1, π 2 ) = max {sup φ i (w) ψ i (w), sup φ i (w) ψ i (w) }. i=1,...,p w w We can prove that it is a reasonable definition of distance. For convenience, we assume the domain of these φ k ( ) and ψ k ( ) is closed and bounded, which is denoted as W = {w w B} and w W. Hence, d(π 1, π 2 ) <. With the definition of distance between two partitions, we have Lemma 3. Let π = [φ 1 ( ),..., φ p ( )] be a partition w.r.t. P ( ). l π (a) uniformly converges to l π (a) = 0 as d(π, π ) 0. Besides the uniform convergence of l π (a) w.r.t. π, we also prefer the uniform convergence of γ(π; ɛ) w.r.t. π since it reflects the goodness of partition π. In the appendix of supplementary, we give the condition for convergence of γ(π; 0). However, it needs higher order information, which is not common in first order optimization. In the following, we give the bound of γ(π; ɛ) which is more practical. Lemma 4. Assume π = [F 1 ( ),..., F p ( )] is a partition w.r.t. P ( ), where F k (w) = 1 D k i D k f i (w) is the local loss function, each f i ( ) is Lipschitz continuous and sampled from some unknown distribution P. If we assign these {f i ( )} uniformly to each worker, then with high probability, γ(π; ɛ) 1 p p 1 O( ɛ D k ). k=1 Moreover, if l π (w) is convex w.r.t. w, then γ(π; ɛ) 1 p p 1 O( ɛ Dk ). k=1 Here we ignore the log term and dimensionality d. It implies that as long as the size of training data is large enough, γ(π; ɛ) will be small and π will be a good partition. Please note that the uniformly here means each f i ( ) will be assigned to one of the p workers and each worker has the equal probability to be assigned. We call the partition resulted from uniform assignment uniform partition in this paper. With uniform partition, each worker will have almost the same number of instances. As long as the size of training data is large enough, uniform partition is a good partition. 5. Convergence of Proximal SCOPE In this section, we will prove the convergence of Algorithm 1 for proximal SCOPE () using the results in Section 4. For convenience, we rewrite the local update rule as follows: where u k,m+1 = u k,m ηg k,m, (11) g k,m = 1 η (u k,m prox R,η (u k,m ηv k,m )). (12) Then we have the following theorems about the convergence of. Theorem 3. Assume π = [F 1 ( ),..., F p ( )] is a (ɛ, ξ)- good partition w.r.t. P ( ). F k ( ) is µ-strongly convex

6 (µ > 0) and L-smooth. R( ) is convex. If w t w 2 ɛ, then E w t+1 w 2 [(1 µη + 2L 2 η 2 ) M + 2L2 η + 2ξ µ + 2L 2 η ] w t w 2. By setting appropriate η and M, we can always make [(1 µη + 2L 2 η 2 ) M + 2L2 η+2ξ µ+2l 2 η ] < 1. Hence, if the partition π = [F 1 ( ),..., F p ( )] is a (ɛ, ξ)-good partition w.r.t. P ( ), is convergent with a linear convergence rate. Furthermore, the smaller ξ is, the faster convergence rate will have. Because smaller ξ means better partition and the partition π corresponds to data partition in Algorithm 1, we can see that better data partition implies faster convergence rate. Corollary 1. Assume π = [F 1 ( ),..., F p ( )] is a (ɛ, µ 8 )- good partition w.r.t. P ( ). F k ( ) is µ-strongly convex (µ > 0) and L-smooth. If w t w 2 ɛ, taking η = µ 12L, 2 M = 20κ 2, where κ = L µ is the conditional number, then we have E w t+1 w w t w 2. To get the ɛ-suboptimal solution, the computation complexity of each worker is O((n/p + κ 2 ) log( 1 ɛ )). Corollary 2. When p = 1, which means we only use one worker, degenerates to proximal SVRG (Xiao & Zhang, 2014). Assume F ( ) is µ-strongly convex (µ > 0) and L-smooth. Taking η = µ 6L 2, M = 13κ 2, we have E w t+1 w w t w 2. To get the ɛ-optimal solution, the computation complexity is O((n + κ 2 ) log( 1 ɛ )). Because has linear convergence rate if the partition is (ɛ, ξ)-good, we need T = O(log( 1 ɛ )) outer iterations to get a ɛ-optimal solution. For all inner iterations, each worker updates u k,m without any communication. Hence, the communication cost is O(log( 1 ɛ )), which is much small than the mini-batch based strategy with O(n) communication cost for each epoch (Li et al., 2016; Huo et al., 2016; Meng et al., 2017).. Furthermore, in the above theorems and corollaries, we only assume that the local loss function F k ( ) is convex. We do not need each f i ( ) to be convex. Hence, M = O(κ 2 ) and it is weaker than the assumption in proximal SVRG (Xiao & Zhang, 2014) whose computation complexity is O((n + κ) log( 1 ɛ )) when p = 1. Please note that proximal SVRG (Xiao & Zhang, 2014) is a centralized method. In addition, without convexity assumption for each f i ( ), our result for the degenerate case p = 1 is consistent with that in (Shalev-Shwartz, 2016). 6. Handle High Dimensional Sparse Data For the cases with high dimensional sparse data, we can take the sparsity to speedup the training procedure. Here, we adopt the widely used elastic net (Zou & Hastie, 2005) as an example for illustration. Elastic net can be formulated as follows: min w P (w) := 1 n n i=1 h i(x T λ1 i w)+ 2 w 2 + λ 2 w 1, where h i : R R is a convex function. With f i (w) = h i (x T λ1 i w) + 2 w 2, R(w) = λ 2 w 1, the elastic net can be rewritten as the form in (1). We assume many instances in {x i R d i [n]} are sparse vectors and let C i = {j x (j) i 0} which denotes the nonzero dimensions of instance i. For elastic net, the update rule in (5) can be written as: u (j) k,m+1 = δ (j) λ 2 η, if δ (j) λ 2 η 0, if δ (j) < λ 2 η δ (j) + λ 2 η, if δ (j) λ 2 η (13) where δ (j) = u (j) k,m ηv(j) k,m, u(j) k,m and v(j) k,m are the jth coordinate values of u k,m and v k,m respectively. This is unacceptable when the data dimensionality d is too large, since we need to execute the conditional statements in (13) O(Md) times which is time consuming. The update in (5) consists of calculation of v k,m and proximal mapping. The calculation of v k,m for elastic net can be written as: v k,m = λ 1 u k,m + h s(x T s u k,m )x s h s(x T s w t )x s + z, where s = i k,m for short. The proximal mapping has two cases: if j C s, then the update is the same as that in (13). if j / C s, δ (j) can be rewritten as δ (j) = (1 λ 1 η)u (j) k,m ηz(j). (14) Since z (j) is a constant during the update of local parameter u k,m, we will design a recovery strategy to recover it when necessary. More specifically, in each inner iteration, with the random index s = i k,m, we only recover u (j) to calculate the inner product x T s u k,m and update u (j) k,m for j C s. For those j / C s, we do not immediately update u (j) k,m. The learning algorithm with recovery strategy is shown in Algorithm 2. We design recovery rules to guarantee that Algorithm 1 and Algorithm 2 are exactly equivalent. is: for some coordinate j, we can calculate u (j) k,m 2 The basic idea of these recovery rules directly from u (j) k,m 1, rather than doing iterations from m = m 1 to m 2. Here, 0 m 1 < m 2 M. It will save about O(d(m 2 m 1 )(1 ρ)) times of conditional statements,

7 where ρ is the sparsity of {x i R d i [n]}. This reduction of computation is significant especially for high dimensional sparse training data. Due to space limitation, these recovery rules are moved to the Appendix in the supplementary material. Algorithm 2 Efficient Learning for Sparse Data 1: Task of the kth worker: 2: for t = 0, 1, 2,...T 1 do 3: Wait until receiving w t from master; 4: Let u k,0 = w t, calculate z k = i D k f i(w t) and send z k to master; 5: Wait until receiving z from master; 6: Set r = (0, 0,..., 0) R d ; 7: for m = 0, 1, 2,...M 1 do 8: Randomly choose an instance index s and denote C s = {j x (j) s 0}; 9: j C s, recover u (j) k,m according to the recovery rules with m 1 = r (j), m 2 = m; 10: Calculate the inner products x T s u k,m and x T s w t; 11: for j C s do 12: v (j) k,m = h s(x T s u k,m )x s (j) h s(x T s w t)x s (j) + z (j) ; 13: u (j) k,m prox λ 2 u,η((1 ηλ 1)u (j) k,m ηv(j) k,m ); 14: r (j) = m + 1; 15: end for 16: end for 17: j [d], recover u (j) to get u k,m ; 18: Send u k,m to master 19: end for 7. Experiment We use two sparse learning models for evaluation. One is logistic regression (LR) with elastic net (Zou & Hastie, 2005): P (w) = 1 n n i=1 log(1 + e yixt i w ) + λ1 2 w 2 + λ 2 w 1. The other is Lasso regression (Tibshirani, 1994): P (w) = 1 n 2n i=1 (xt i w y i) 2 + λ 2 w 1. These two models are among the most representative sparse learning models. All experiments are conducted on a cluster of multiple machines. The CPU for each machine has 12 Intel E cores, and the memory of each machine is 96GB. The machines are connected by 10GB Ethernet Datasets Evaluation is based on four datasets: cov, rcv1, avazu, kdd2012. All of them can be downloaded from LibSVM website 1. The last three datasets are high-dimensional sparse data. Details about datasets and hyper-parameters are shown in Table cjlin/ libsvmtools/datasets/ Table 1. Datasets #instances #features λ 1 λ 2 cov 581, rcv1 677,399 47, avazu 23,567,843 1,000, kdd ,705,032 54,686, Baselines We compare our with six representative baselines: proximal gradient descent based method (Beck & Teboulle, 2009), ADMM type method DFAL (Aybat et al., 2015), Newton type method (Gong & Ye, 2015), proximal SGD/SVRG based method AsyProx- SVRG (Meng et al., 2017), proximal SDCA based method PROXCOCOA+ (Smith et al., 2015), and distributed block coordinate descent (DBCD) method (Mahajan et al., 2017). For those baselines, like and, which are serial. We design distributed versions of them, in which workers distributively compute the gradients and then master gather the gradients from workers for parameter update. All methods use 8 workers. One master will be used if necessary. Unless otherwise stated, all methods except DBCD and PROXCOCOA+ use the same data partition, which is got by uniformly assigning each instance to each worker (uniform partition). Hence, different workers will have almost the same number of instances. This uniform partition strategy satisfies the condition in Lemma 4. Hence, it is a good partition. DBCD and PROXCOCOA+ adopts a coordinate distributed strategy to partition the data Results The convergence results on LR with elastic net are shown in Figure 1. Please note that as stated in the introduction, PROXCOCOA+ cannot be used for solving LR-type problems. DBCD is too slow, and hence we will separately report the results of it in Table 2. AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. From Figure 1, we can find that outperforms all the other baselines on all datasets. The convergence results on the Lasso regression are shown in Figure 2. Similarly, AsyProx-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. We can see that also outperforms all other baselines on all datasets for the Lasso regression problem. DBCD is too slow, and hence we separately report the results of it in Table 2. We report the time (in second) of and DBCD on two datasets cov, rcv1 when they

8 cov rcv1 avazu kdd2012 DFAL AsyProxSVRG DFAL AsyProx-SVRG DFAL DFAL Figure 1. Evaluation on LR with elastic net cov rcv1 avazu kdd2012 PROXCOCOA+ AsyProx-SVRG PROXCOCOA+ AsyProx-SVRG PROXCOCOA+ PROXCOCOA Figure 2. Evaluation on Lasso regression get a suboptimal solution. We can find that is much faster than the coordinate descent method DBCD. Table 2. Time comparison (in second) between and DBCD. DBCD cov LR rcv > 1000 cov Lasso rcv > Speedup We also evaluate the speedup of on the four datasets for LR. We run and stop it when the gap P (w) P (w ) The speedup is defined as: Time using one worker Speedup = Time using p workers. We set p = 1, 2, 4, 8. The speedup results are in Figure 3. We can find that gets promising speedup. Speedup Ideal cov rcv1 avazu kdd Workers Figure 3. Speedup of 7.5. Effect of Data Partition We evaluate under different data partitions. We use two datasets cov and rcv1 for illustration, since they are balanced datasets which means the number of positive instances is almost the same as that of negative instances. For each dataset, we construct four data partitions: π : Each worker has the whole data, which is the best partition according to our theory; π 1 : Uniform partition; π 2 : 75% positive instances and 25% negative instances are on the first 4 workers, and other instances are on the last 4 workers; π 3 : All positive instances are on the first 4 workers, and all negative instances are on the last 4 workers. The convergence results are shown in Figure 4. We can see that data partition does affect the convergence of. The best partition π does achieve the best performance, which verifies the theory in this paper. The performance of uniform partition π 1 is similar to that of the best partition π, and is better than the other two data partitions. In real applications with large-scale dataset, it is impractical to assign each worker the whole dataset. Hence, we prefer to choose uniform partition π 1 in real applications, which is also adopted in above experiments of this paper. cov Conclusion * rcv Figure 4. Effect of data partition In this paper, we propose a novel method, called, for distributed sparse learning. Furthermore, we theoretically analyze how the data partition affects the convergence of. is both communication and computation efficient. Experiments on real data show that can outperform other state-of-the-art methods to achieve the best performance. *

9 References Aybat, Necdet S., Wang, Zi, and Iyengar, Garud. An asynchronous distributed proximal gradient method for composite convex optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp , Beck, Amir and Teboulle, Marc. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1): , Bradley, Joseph K., Kyrola, Aapo, Bickson, Danny, and Guestrin, Carlos. Parallel coordinate descent for l1-regularized loss minimization. In Proceedings of the 28th International Conference on Machine Learning, pp , Byrd, Richard H., Hansen, S. L., Nocedal, Jorge, and Singer, Yoram. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2): , Duchi, John C. and Singer, Yoram. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: , Fercoq, Olivier and Richtárik, Peter. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review, 58(4): , Gong, Pinghua and Ye, Jieping. A modified orthant-wise limited memory quasi-newton method with convergence analysis. In Proceedings of the 32nd International Conference on Machine Learning, pp , Huo, Zhouyuan, Gu, Bin, and Huang, Heng. Decoupled asynchronous proximal stochastic gradient descent with variance reduction. CoRR, abs/ , Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp , Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems, pp , Li, Mu, Andersen, David G., Park, Jun Woo, Smola, Alexander J., Ahmed, Amr, Josifovski, Vanja, Long, James, Shekita, Eugene J., and Su, Bor-Yiing. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation, pp , Li, Yitan, Xu, Linli, Zhong, Xiaowei, and Ling, Qing. Make workers work harder: Decoupled asynchronous proximal stochastic gradient descent. CoRR, abs/ , Mahajan, Dhruv, Keerthi, S. Sathiya, and Sundararajan, S. A distributed block coordinate descent method for training l1 regularized linear classifiers. Journal of Machine Learning Research, 18:91:1 91:35, Meng, Qi, Chen, Wei, Yu, Jingcheng, Wang, Taifeng, Ma, Zhiming, and Liu, Tie-Yan. Asynchronous stochastic proximal optimization algorithms with variance reduction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp , Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 (4): , Richtárik, Peter and Takác, Martin. Parallel coordinate descent methods for big data optimization. Math. Program., 156(1-2): , Scherrer, Chad, Halappanavar, Mahantesh, Tewari, Ambuj, and Haglin, David. Scaling up coordinate descent algorithms for large l 1 regularization problems. In Proceedings of the 29th International Conference on Machine Learning, Schmidt, Mark W., Roux, Nicolas Le, and Bach, Francis R. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83 112, Shalev-Shwartz, Shai. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, pp , Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1): , Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pp , Shi, Ziqiang and Liu, Rujie. Large scale optimization with proximal stochastic newton-type gradient descent. In ECML/PKDD, pp , Smith, Virginia, Forte, Simone, Jordan, Michael I., and Jaggi, Martin. L1-regularized distributed optimization: A communication-efficient primal-dual framework. CoRR, abs/ , Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58: , Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109 (3): , June ISSN Wang, Jialei, Kolar, Mladen, Srebro, Nathan, and Zhang, Tong. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, pp , Wu, Tong T. and Lange, Kenneth. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1): , Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Xing, Eric P., Ho, Qirong, Dai, Wei, Kim, Jin Kyu, Wei, Jinliang, Lee, Seunghak, Zheng, Xun, Xie, Pengtao, Kumar, Abhimanu, and Yu, Yaoliang. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , 2015.

10 Zhao, Shen-Yi, Xiang, Ru, Shi, Ying-Hao, Gao, Peng, and Li, Wu-Jun. SCOPE: scalable composite optimization for learning on spark. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp , Zou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67: , 2005.

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

ParaGraphE: A Library for Parallel Knowledge Graph Embedding ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models

Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models Hongchang Gao, Heng Huang Department of Electrical and Computer Engineering, University of Pittsburgh, USA hongchanggao@gmail.com,

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent : Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Bundle CDN: A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression

Bundle CDN: A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression Bundle : A Highly Parallelized Approach for Large-scale l 1 -regularized Logistic Regression Yatao Bian, Xiong Li, Mingqi Cao, and Yuncai Liu Institute of Image Processing and Pattern Recognition, Shanghai

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

Asynchronous Doubly Stochastic Sparse Kernel Learning

Asynchronous Doubly Stochastic Sparse Kernel Learning The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong

More information

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm Ji Liu JI-LIU@CS.WISC.EDU Stephen J. Wright SWRIGHT@CS.WISC.EDU Christopher Ré CHRISMRE@STANFORD.EDU Victor Bittorf BITTORF@CS.WISC.EDU Srikrishna Sridhar SRIKRIS@CS.WISC.EDU Department of Computer Sciences,

More information

LOCO: Distributing Ridge Regression with Random Projections

LOCO: Distributing Ridge Regression with Random Projections LOCO: Distributing Ridge Regression with Random Projections Brian McWilliams Christina Heinze Nicolai Meinshausen Gabriel Krummenacher Hastagiri P. Vanchinathan Department of Computer Science Seminar für

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

Stochastic Methods for l 1 Regularized Loss Minimization

Stochastic Methods for l 1 Regularized Loss Minimization Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods

More information

Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem

Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Distributed Stochastic Optimization of Regularized Risk via Saddle-point Problem Shin Matsushima 1, Hyokun Yun 2, Xinhua Zhang 3, and S.V.N. Vishwanathan 2,4 1 The University of Tokyo, Tokyo, Japan shin

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

arxiv: v1 [cs.lg] 10 Jan 2019

arxiv: v1 [cs.lg] 10 Jan 2019 Quantized Epoch-SGD for Communication-Efficient Distributed Learning arxiv:1901.0300v1 [cs.lg] 10 Jan 2019 Shen-Yi Zhao Hao Gao Wu-Jun Li Department of Computer Science and Technology Nanjing University,

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

arxiv: v3 [cs.lg] 15 Sep 2018

arxiv: v3 [cs.lg] 15 Sep 2018 Asynchronous Stochastic Proximal Methods for onconvex onsmooth Optimization Rui Zhu 1, Di iu 1, Zongpeng Li 1 Department of Electrical and Computer Engineering, University of Alberta School of Computer

More information

Asynchronous Doubly Stochastic Group Regularized Learning

Asynchronous Doubly Stochastic Group Regularized Learning Bin Gu Zhouyuan Huo Heng Huang Department of Electrical and Computer Engineering, University of Pittsburgh, USA Abstract Group regularized learning problems such as group Lasso) are important in machine

More information

Fast-and-Light Stochastic ADMM

Fast-and-Light Stochastic ADMM Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information