FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Size: px
Start display at page:

Download "FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč"

Transcription

1 FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom ABSTRACT We propose an efficient distributed randomized coordinate descent method for minimizing regularized non-strongly convex loss functions. The method attains the optimal O1/k 2 convergence rate, where k is the iteration counter. The core of the work is the theoretical study of stepsize parameters. We have implemented the method on Archer the largest supercomputer in the UK and show that the method is capable of solving a synthetic LASSO optimization problem with 50 billion variables. Index Terms Coordinate descent, distributed algorithms, acceleration. 1. INTRODUCTION In this paper we are concerned with solving regularized convex optimization problems in huge dimensions in cases when the loss being minimized is not strongly convex. That is, we consider the problem min x R d Lx := fx + Rx, 1 where f : R d R is a convex differentiable loss function, d is huge, and Rx d i=1 R ix i, where R i : R R {+ } are possibly nonsmooth convex regularizers and x i denotes the i-th coordinate of x. We make the following technical assumption about f: there exists a n-by-d matrix A such that for all x, h R d, fx + h fx + f x h h A Ah. 2 For examples of regularizers R and loss functions f satisfying the above assumptions, relevant to machine learning applications, we refer the reader to [1, 2, 3]. It is increasingly the case in modern applications that the data describing the problem encoded in A and R in the above model is so large that it does not fit into the RAM of a single computer. In such a case, unless the application at hand can tolerate slow performance due to frequent HDD reads/writes, it is often necessary to distribute the data among the nodes of a cluster and solve the problem in a distributed manner. THANKS TO EPSRC FOR FUNDING VIA GRANTS EP/K02325X/1, EP/I017127/1 AND EP/G036136/1. While efficient distributed methods exist in cases when the regularized loss L is strongly convex e.g., Hydra [3], here we assume that L is not strongly convex. Problems of this type arise frequently: for instance, in the SVM dual, f is typically a non-strongly convex quadratic, d is the number of examples, n is the number of features, A encodes the data, and R is a 0- indicator function encoding box constraints e.g., see [2]. If d > n, then L will not be strongly convex. In this paper we propose Hydra 2 Hydra squared ; Algorithm 1 the first distributed stochastic coordinate descent CD method with fast O1/k 2 convergence rate for our problem. The method can be seen both as a specialization of the APPROX algorithm [4] to the distributed setting, or as an accelerated version of the Hydra algorithm [3] Hydra converges as O1/k for our problem. The core of the paper forms the development of new stepsizes, and new efficiently computable bounds on the stepsizes proposed for distributed CD methods in [3]. We show that Hydra 2 is able to solve a big data problem with d equal to 50 billion. 2. THE ALGORITHM Assume we have c nodes/computers available. In Hydra 2 Algorithm 1, the coordinates i [d] := {1, 2,..., d} are first partitioned into c sets {P l, l = 1,..., c}, each of size P l = s := d/c. The columns of A are partitioned accordingly, with those belonging to partition P l stored on node l. During one iteration, all computers l {1,..., c}, in parallel, pick a subset Ŝl of coordinates from those they own, i.e., from P l, uniformly at random, where 1 s is a parameter of the method Step 6. From now on we denote by Ŝ the union of all these sets, Ŝ := lŝl, and refer to it by the name distributed sampling. Hydra 2 maintains two sequences of iterates: u k, z k R d. Note that this is usually the case with accelerated/fast gradient-type algorithms [5, 6, 7]. Also note that the output of the method is a linear combination of these two vectors. These iterates are stored and updated in a distributed way, with the i-th coordinate stored on computer l if i P l. Once computer l picks Ŝl, it computes for each i Ŝl an update scalar t i k, which is then used to update zi k and ui k, in parallel using the multicore capability of computer l. The main work is done in Step 8 where the update scalars

2 Algorithm 1 Hydra 2 1 INPUT: {P l } c l=1, 1 s, {D ii} d i=1, z 0 R d 2 set θ 0 = /s and u 0 = 0 3 for k 0 do 4 z k+1 z k, u k+1 u k 5 for each computer l {1,..., c} in parallel do 6 pick a random set of coordinates Ŝl P l, Ŝl = 7 for each i Ŝl in parallel do 8 t i k = argmin t f iθ 2 ku k +z k t+ sθ kd ii 2 9 zk+1 i zi k + ti k, ui k+1 ui k 1 θk 2 10 end parallel for 11 end parallel for 12 θ k+1 = 1 2 θk 4 + 4θ2 k θ2 k 13 end for 14 OUTPUT: θ 2 k u k+1 + z k+1 t 2 +R i z i k +t s θ k t i k t i k are computed. The partial derivatives are computed at θk 2u k+z k. For the algorithm to be efficient, the computation should be performed without computing the sum θk 2u k + z k see [4] for details on how this can be done. Steps 8 and 9 depend on a deterministic scalar sequence θ k, which is being updated in Step 12 as in [6]. Note that by taking θ k = θ 0 for all k, u k remains equal to 0, and Hydra 2 reduces to Hydra [3]. The output of the algorithm is x k+1 = θk 2u k+1 + z k+1. We only need to compute this vector sum at the end of the execution and when we want to track Lx k. Note that one should not evaluate x k and Lx k at each iteration since these computations have a non-negligible cost. 3. CONVERGENCE RATE The magnitude of D ii > 0 directly influences the size of the update t i k. In particular, note that when there is no regularizer R i = 0, then t i k = 2f i θ2 k u k + z k sθ k D ii 1. That is, small D ii leads to a larger step t i k which is used to update zk i and ui k. For this reason we refer to {D ii} d i=1 as stepsize parameters. Naturally, some technical assumptions on {D ii } d i=1 should be made in order to guarantee the convergence of the algorithm. The so-called ESO Expected Separable Overapproximation assumption has been introduced in this scope. For h R d, denote hŝ := i Ŝ hi e i, where e i is the ith unit coordinate vector. Assumption 3.1 ESO. Assume that for all x R d and h R d we have [ ] E fx + hŝ fx+ E[ Ŝ ] d f x h+ 1 2 h Dh, 3 where D is a diagonal matrix with diag. elements D ii > 0 and Ŝ is the distributed sampling described above. The above ESO assumption involves the smooth function f, the sampling Ŝ and the parameters {D ii} d i=1. It has been first introduced by Richtárik and Takáč [2] for proposing a generic approach in the convergence analysis of the Parallel Coordinate Descent Methods PCDM. Their generic approach boils down the convergence analysis of the whole class of PCDMs to the problem of finding proper parameters which make the ESO assumption hold. The same idea has been extended to the analysis of many variants of PCDM, including the Accelerated Coordinate Descent algorithm [4] APPROX and the Distributed Coordinate Descent method [3] Hydra. In particular, the following complexity result, under the ESO assumption 3.1, can be deduced from [4, Theorem 3] using Markov inequality: Theorem 3.2 [4]. Let x R d be any optimal point of 1, C 1 := 1 s Lx0 Lx, C 2 := 1 2 x 0 x Dx 0 x, and choose 0 < ρ < 1, ɛ < Lx 0 Lx and k 2s C 1 + C ρɛ Then under Assumption 3.1, the iterates {x k } k 1 of Algorithm 1 satisfy ProbLx k Lx ɛ 1 ρ. The bound on the number of iterations k in 4 suggests to choose the smallest possible {D ii } d i=1 satisfying the ESO assumption STEPSIZES In this section we will study stepsize parameters D ii for which the ESO assumption is satisfied New stepsizes For any matrix G R d d, denote by D G the diagonal matrix such that Dii G = G ii for all i and Di G = 0 for i. We denote by B G R d d the block matrix associated with the partition {P 1,..., P c } such that Bi G = G i whenever {i, } P l for some l, and Bi G = 0 otherwise. Let ω be the number of nonzeros in the th row of A and ω be the number of partitions active at row, i.e., the number of indexes l {1,..., c} for which the set {i P l : A i 0} is nonempty. Note that since A does not have an empty row or column, we know that 1 ω n and 1 ω c. Moreover, we have the following characterization of these two quantities. For notational convenience we denote M = A : A : for all {1,..., n}.

3 Lemma 4.1. For {1,..., n} we have: ω = max{x M x : x D M x 1}, 5 ω = max{x M x : x B M x 1}. 6 Proof. For l {1,..., c} and y R d, denote y l := y i i Pl. That is, y l is the subvector of y composed of coordinates belonging to partition P l. Then we have: x M x = c l=1 A l : xl 2, x B M x = c A l : xl 2. l=1 Let S = {l : A l : 0}, then ω = S and by Cauchy- Schwarz we have l S A l : xl 2 ω l S A l : xl 2. Equality is reached when A l : xl = α for some constant α for all l S this is feasible since the subsets {P 1,..., P c } are disoint. Hence, we proved 6. The characterization 5 follows from 6 by setting c = d. We shall also need the following lemma. Lemma 4.2 [3]. Fix arbitrary G R d d and x R d and let = max{1, s 1}. Then E[xŜ GxŜ] is equal to [ s α1 x D G x + α 2 x Gx + α 3 x G B G x ], 7 where α 1 = 1 1, α 2 = 1, α 3 = s 1. Below is the main result of this section. Theorem 4.1. For a convex differentiable function f satisfying 2 and a distributed sampling Ŝ described in Section 2, the ESO assumption 3.1 is satisfied for where D 1 ii D ii = α A 2 i, i = 1,..., d, 8 α := α 1, + α 2,, α 1, = 1 + 1ω 1, α 2, = s 1 ω 1 ω ω. Proof. Fix any h R d. It is easy to see that E[f x hŝ] = E[ Ŝ ] d f x h; this follows by noting that Ŝ is a uniform sampling we refer the reader to [2] and [8] for more identities of this type. In view of 2, we only need to show that 9 [ AhŜ] E hŝ A E[ Ŝ ] d h Dh. 10 By applying Lemma 4.2 with G = M and using Lemma 4.1, we get: [hŝ hŝ] E M s [α 1h D M h + α 2 ω h D M h + α ω h M h] s [α 1 + α 2 ω + α ω ω ]h D M h = E[ Ŝ ] d α h D M h. 11 Since E[hŜ A AhŜ] = E[hŜ M hŝ], 10 can be obtained by summing up 11 over from 1 to n Existing stepsizes For simplicity, let M := A A. Define: σ := max{x Mx : x R d ; x D M x 1}, σ := max{x Mx : x R d ; x B M x 1}. 12 The quantities σ and σ are identical to those defined in [3] although the definitions are slightly different. The following stepsize parameters have been introduced in [3]: Lemma 4.3 [3]. The ESO assumption 3.1 is satisfied for where D 2 ii D ii = β β := β 1 + β 2 n β 1 = 1 + 1σ 1, β 2 = 4.3. Known bounds on existing stepsizes A 2 i, i = 1,..., d, 13 s 1 σ 1 σ σ. 14 In general, the computation of σ and σ can be done using the Power iteration method [9]. However, the number of operations required in each iteration of the Power method is at least twice as the number of nonzero elements in A. Hence the only computation of the parameters σ and σ would already require quite a few number of passes through the data. Instead, if one could provide some easily computable upper bound for σ and σ, where by easily computable we mean computable by only one pass through the data, then we can run Algorithm 1 immediately without spending too much time on the computation of σ and σ. Note that the ESO assumption will still hold when σ and σ in 13 are replaced by some upper bounds. In [3], the authors established the following bound β 2β 1 = σ 1, 15 which is independent of the partition {P l } l=1,...,c. This bound holds for 2. Further, they showed that σ max ω. 16

4 Then in view of 15, 16 and Lemma 4.3, the ESO assumption 3.1 is also satisfied for the following easily computable parameters: D 3 ii D ii = max ω 1 A 2 i Improved bounds on existing stepsizes In what follows, we show that both 15 and 16 can be improved so that smaller parameters {D ii } d i=1 are allowed in the algorithm. Lemma 4.4. Suppose 2 note that then s 2. For all 1 η s the following holds s 1 η η 1 18 Proof. Both sides of the inequality are linear functions of η. It therefore suffices to verify that the inequality holds for η = 1 and η = s, which can be done. The following result is an improvement on 15, which was shown in [3, Lemma 2]. Lemma 4.5. If 2, then β β 1. Proof. We only need to apply Lemma 4.4 to 14 with η = σ, and additionally use the bound σ 1 σ 1. This gives β 2 β1/ 1, from which the result follows. Remark 4.1. Applying the same reasoning to the new stepsizes α we can show that α 1+1/ 1α 1, for all {1,..., n}. Remark that this is not as useful as Lemma 4.5 since α2, does not need to be approximated ω is easily computable. However, the same conclusion holds for the new stepsize parameters: whatever the partition {P l } l=1,...,c is, each parameter D ii is at most 1 + 1/ 1 times than the smallest one we can use by choosing an optimal partition, if the latter one exists. We next give a better upper bound of σ than 16. Lemma 4.6. If we let v i := ω A 2 i, i = 1,..., d, 19 then σ σ := max i v i. A2 i Proof. In view of Lemma 4.1, we know that M = M ω D M The rest follows from the definition of σ. = DiagvD M max v i D M. i By combining Lemma 4.3, Lemma 4.5 and Lemma 4.6, we obtain the following smaller compared to 17 admissible and easily computable stepsize parameters: D 4 ii D ii = σ 1 1 n A 2 i Comparison between the new and old stepsizes So far we have seen four different admissible stepsize parameters. For ease of reference, let us call {D 1 ii }d i=1 the new parameters defined in 8, {D 2 ii }d i=1 the existing one given in [3] see 13, {D 3 ii }d i=1 the upper bound of {D2 ii }d i=1 used in [3] see 17 and {D 4 ii }d i=1 the new upper bound of {D 2 ii }d i=1 defined in 20. The next lemma compares the four admissible ESO parameters. The lemma implies that D 1 and D 2 are uniformly smaller i.e., better see Theorem 3.2 than D 4 and D 3. However, D 2 involves quantities which are hard to compute e.g., σ. Moreover, D 4 is always better than D 3. Lemma 4.7. Let 2. The following holds for all i: D 1 ii D 4 ii D 3 ii 21 D 2 ii D 4 ii D 3 ii 22 Proof. We only need to show that D 1 ii D4 ii, the other relations are already proved. It follows from 19 that for all i we have ω σa 2 i Moreover, letting and using Lemma 4.4 with η = σ, we get σ = max ω 24 s 1 σ 1 σ σ 1 σ Now, for all i, we can write D 1 ii D 4 ii = 23 s A 2 i ω 1 ω s 1 1 ω σ + ω 1 ω s 1 σ 1 σ ω 1 σ ω 1 σ ω σa 2 i A 2 i Note that we do not have a simple relation between {D 1 ii }d i=1 and {D2 ii }d i=1. Indeed, for some coordinates i the new stepsize parameters {D 1 ii }d i=1 could be smaller than, see Figure 1. in the next section for an illustration. {D 2 ii }d i=1

5 5. NUMERICAL EXPERIMENTS In this section we will show some preliminary computational result and demonstrate that Hydra 2 can, indeed, outperform non-accelerated Hydra. We also show that the runtime of Hydra 2 is as expected less than twice more expensive when compared with Hydra, but the convergence speed is significantly better on non-strongly convex problems. The experiments were performed on two problems: 1. Dual of SVM: This problem can be formulated as finding x [0, 1] d that minimizes D ii D 1 D 2 D 3 D 4 Lx = 1 2λd 2 d b i A i x i 2 1 d i=1 d x i + I [0,1] dx, i=1 Here d is the number of examples, each row of the matrix A corresponds to a feature, b R d, λ > 0 and I [0,1] d denotes the indicator function of the set [0, 1] d, defined as { 0 if x [0, 1] d, I [0,1] dx = + if x [0, 1] d. 2. LASSO: Lx = 1 2 Ax b 2 + λ x The computation were performed on Archer archer.ac.uk/, which is UK #1 supercomputer Benefit from the new stepsizes In this section we compare the four different stepsize parameters and show how they influence the convergence of Hydra 2 through numerical experiments. The dataset used in this section is astro-ph dataset. 1 The problem that we solve is the SVM dual problem, see [10, 11]. Figure 1 displays the values of D 1 ii, D2 ii, D3 ii and D4 ii with respect to different i in {1,..., d} for c, = 32, 10 on the astro-ph dataset. For this particular normalized dataset the diagonal elements of A A are all equal to 1, the parameters {D 2 ii } i are equal to a constant for all i, the same for {D 3 ii } i and {D 4 ii } i. Recall that according to Theorem 3.2, a better convergence rate is guaranteed if smaller stepsize parameters {D ii } i are used. Hence it is clear that {D 1 ii } i and {D 2 ii } i are better choices than {D 3 ii } i and {D 4 ii } i. However, as we mentioned, computing the parameters {D 1 ii } i would require a considerable computational effort, while computing {D 2 ii } i is as easy as passing one time the dataset. For this relatively small astro-ph dataset, the time is about 1 minute for computing {D 2 ii } i and less than 1 second for {D 1 ii } i. 1 Astro-ph is a binary classification problem which consist of abstracts of papers from physics. The dataset consist of d = 29, 882 samples and the feature space has dimension n = 99, i x 10 4 Fig. 1: Plots of i v.s. D 1 ii, i v.s. D2 ii, i v.s. D3 ii and i v.s. D4 ii In order to investigate the benefit of the new stepsize parameters, we solved the SVM dual problem on the astro-ph dataset for c, = 32, 10, using different stepsize parameters. Figure 2 shows evolutions of the duality gap, obtained by using the four stepsize parameters mentioned previously. We see clearly from the figure that smaller stepsize parameters yield better convergence speed, as what is predicted by Theorem 3.2. Moreover, using our easily computable new stepsize parameters {D 1 ii } i, we achieve comparable convergence speed with respect to the existing ESO parameter {D 2 ii } i. Duality gap D 1 D 2 D 3 D Iterations Fig. 2: Duality gap v.s. number of iterations, for 4 different stepsize parameters 5.2. Hydra 2 vs Hydra In this section, we report experimental results comparing Hydra with Hydra 2 on the LASSO problem. We generated a sparse matrix A having the same block angular structure as the one used in [3, Sec 7.], with d = 50 billion, c = 8, s = 62, 500, 000 note that d = cs and n = 500, 000.

6 The average number of nonzero elements per row of A, i.e., ω /n is 5, 000, and the maximal number of nonzero elements in a row, i.e., max ω is 201, 389. The dataset size is 400GB. We show in Figure 3 the error decrease with respect to the number of iterations, plotted in log scale. It is clear that Hydra 2 provides better iteration convergence rate than Hydra. Moreover, we see from Figure 4 that the speedup in terms of the number of iterations is sufficiently large so that Hydra 2 converges faster even in time than Hydra even though the run time of Hydra 2 per iteration is more expensive than Hydra hydra hydra 2 [2] Peter Richtárik and Martin Takáč, Parallel coordinate descent methods for big data optimization, arxiv: , [3] Peter Richtárik and Martin Takáč, Distributed coordinate descent method for learning with big data, arxiv: , [4] Olivier Fercoq and Peter Richtárik, Accelerated, parallel and proximal coordinate descent, arxiv: , [5] Yurii Nesterov, A method of solving a convex programming problem with convergence rate O1/k 2, Soviet Mathematics Doklady, vol. 27, no. 2, pp , Lx k L * 10 0 [6] Paul Tseng, On accelerated proximal gradient methods for convex-concave optimization, Submitted to SIAM Journal on Optimization, [7] Amir Beck and Marc Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp , Iterations Fig. 3: Evolution of Lx k L in number of iterations, plotted in log scale. Lx k L * hydra hydra 2 [8] Olivier Fercoq and Peter Richtárik, Smooth minimization of nonsmooth functions with parallel coordinate descent methods, arxiv: , [9] Grégoire Allaire and Sidi Mahmoud Kaber, Numerical linear algebra, vol. 55 of Texts in Applied Mathematics, Springer, New York, 2008, Translated from the 2002 French original by Karim Trabelsi. [10] Shai Shalev-Shwartz and Tong Zhang, Stochastic dual coordinate ascent methods for regularized loss, The Journal of Machine Learning Research, vol. 14, no. 1, pp , [11] Martin Takáč, Avleen Biral, Peter Richtárik, and Nathan Srebro, Mini-batch primal and dual methods for SVMs, in ICML 2013, Elapsed time [s] Fig. 4: Evolution of Lx k L in time, plotted in log scale. 6. REFERENCES [1] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin, Parallel coordinate descent for l1- regularized loss minimization, in ICML 2011, 2011.

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Coordinate gradient descent methods. Ion Necoara

Coordinate gradient descent methods. Ion Necoara Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression

The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro Peter Richtárik August 28, 27 Abstract We study the ridge regression L 2 regularized least squares problem and

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

On Nesterov s Random Coordinate Descent Algorithms

On Nesterov s Random Coordinate Descent Algorithms On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

arxiv: v1 [math.na] 19 Jan 2018

arxiv: v1 [math.na] 19 Jan 2018 The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression Ademir Alves Ribeiro a,,, Peter Richtárik b,2 arxiv:8.6354v [math.na] 9 Jan 28 a Department of Mathematics, Federal University of

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent : Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson 1 Carlos Guestrin 1 Abstract Coordinate descent (CD) is a scalable and simple algorithm for solving many optimization problems

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Iteration Complexity of Feasible Descent Methods for Convex Optimization Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ???? DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

A Polynomial Column-wise Rescaling von Neumann Algorithm

A Polynomial Column-wise Rescaling von Neumann Algorithm A Polynomial Column-wise Rescaling von Neumann Algorithm Dan Li Department of Industrial and Systems Engineering, Lehigh University, USA Cornelis Roos Department of Information Systems and Algorithms,

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24 Outline 1 Problems

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

arxiv: v2 [math.na] 28 Jan 2016

arxiv: v2 [math.na] 28 Jan 2016 Stochastic Dual Ascent for Solving Linear Systems Robert M. Gower and Peter Richtárik arxiv:1512.06890v2 [math.na 28 Jan 2016 School of Mathematics University of Edinburgh United Kingdom December 21, 2015

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.

More information