arxiv: v1 [math.oc] 18 Mar 2016

Size: px
Start display at page:

Download "arxiv: v1 [math.oc] 18 Mar 2016"

Transcription

1 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu Princeton University arxiv: v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing fx that is an average of n convex, smooth functions f i x, and provide the first direct stochastic gradient method Katyusha that has the accelerated convergence rate. It converges to an -approximate minimizer using On + nκ log fx0 fx stochastic gradients where κ is the condition number. Katyusha is a primal-only method, supporting proximal updates, non-euclidean norm smoothness, mini-batch sampling, as well as non-uniform sampling. It also resolves the following open questions in machine learning If fx is not strongly convex e.g., Lasso, logistic regression, Katyusha gives the first stochastic method that achieves the optimal 1/ rate. If fx is strongly convex and each f i x is rank-one e.g., SVM, Katyusha gives the first stochastic method that achieves the optimal 1/ rate. If fx is not strongly convex and each f i x is rank-one e.g., L1SVM, Katyusha gives the first stochastic method that achieves the optimal 1/ rate. The main ingredient in Katyusha is a novel negative momentum on top of momentum that can be elegantly coupled with the existing variance reduction trick for stochastic gradient descent. As a result, since variance reduction has been successfully applied to fast growing list of practical problems, our paper implies that one had better hurry up and give Katyusha a hug in each of them, in hoping for a faster running time also in practice. 1 Introduction Consider the following composite convex minimization problem { min F x def = fx + ψx def = 1 x R d n n i=1 } f i x + ψx. 1.1 Here, fx = 1 n n i=1 f ix is a convex function that is a finite average of n smooth, convex functions f i x, and ψx is convex, lower semicontinuous but possibly non-differentiable function, sometimes referred to as the proximal function. We shall mostly focus on the case when ψx is σ-strongly convex in this paper. Both the smoothness assumptions on f i x and the strong convexity assumption on ψx can be relaxed, see Section 1.. We are interested in finding an approximate minimizer x R d satisfying F x F x +, where x is a minimizer of F x. Problem 1.1 arises in many places in machine learning, statistics, and operations research. For instance, all convex regularized empirical risk minimization ERM problems fall into this category,

2 see Section 1. for details. It has also been recently observed that, efficient stochastic algorithms for solving 1.1 gives rise to fast training algorithms for neural nets [, 15]. Perhaps the simplest first-order method to solve 1.1 is by proximal gradient descent: x k+1 arg min y R d { 1 η y x k + fx k, y + ψy Above, η is the step length, and if the proximal function ψy equals zero, the update simply reduces to x k+1 x k η fx k. Since computing the full gradient f is usually very expensive, stochastic gradient update rules have been proposed instead: x k+1 arg min y R d { 1 η y x k + k, y + ψy where k is a random vector satisfying E[ k ] = fx k and is referred to as the gradient estimator. Given the finite average structure fx = 1 n n i=1 f ix, a popular choice for the gradient estimator is to set k = f i x k for some random index i [n] per iteration. Methods based on this choice are known as stochastic gradient descent SGD methods [9, 34]. As the computation of f i x is usually n times faster than that of fx, SGD is suitable for large-scale machine learning tasks. Variance Reduction. Recently, the convergence speed of SGD has been improved with the variance-reduction technique [8, 10, 11, 15,, 3, 8 30, 3, 33]. In all of these cited results, the authors have, in one way or another, shown that SGD converges much faster if one makes a better choice of the gradient estimator k so that its variance E[ k fx k ] reduces as k increases. One particular way to choose this estimator can be described as follows. Keep a snapshot x = x k after every m stochastic update steps where m is some parameter that is usually on the order of n, and compute the full gradient f x only for such snapshots. Then, set k = f i x k f i x + f x as the gradient estimator. One can verify that, under this choice of k, it satisfies E[ k ] = fx k and lim k E[ k fx k ] = 0. Unfortunately, all of these cited results on variance reduction provide non-accelerated convergence rates for solving 1.1. For instance, the widely-used SVRG and SAGA algorithms obtain -approximate minimizers for 1.1 in O n + L σ log F x 0 F x iterations of stochastic gradient updates. It is often denoted by κ def = L/σ the condition number of the problem, and it is an open question regarding how to obtain an accelerated method that gives the optimal square-root dependence on κ, rather than the linear dependence on κ. The recent work of Catalyst [13, 19] by two independent groups of researchers partially answered this open question. They demonstrate that one can solve 1.1 in only O n + nκ log κ log 1 stochastic iterations, through a black-box reduction that they refer to as Catalyst to non-accelerated methods. Their result is still imperfect at least for the following reasons: Optimality. It does not match the optimal dependence on κ. It does not give the optimal rate 1/ if F is not strongly convex. It does not give the optimal rate 1/ if f is non-smooth. It does not give the optimal rate 1/ if both F is not strongly convex and f is non-smooth. To the best of our knowledge, it does not support non-euclidean norm smoothness on f i. Practicality. Catalyst is not very practical since each of its inner iteration needs to be very accurately executed. This makes the stopping criterion hard to be tuned, and makes Catalyst slower than its competitors for several subclass of problem 1.1 such as the famous ERM problems, see Section 1.. } },.

3 Non-convexity. The non-accelerated variance-reduction algorithms apply very well even to non-convex problems in practice such as training neural nets both empirically [15] and theoretically []. Therefore, it is very desirable to develop a direct accelerated variance-reduction method due to its potential applicability to non-convex problems as well. Unfortunately, the Catalyst reduction does not seem to help in non-convex cases. 1.1 Our Results In this paper, we provide a direct, accelerated stochastic gradient method Katyusha for solving 1.1 in O n + nκ log F x 0 F x iterations, where x0 is the given starting vector. Each iteration of Katyusha requires only the computation of O1 stochastic gradients f i x. This gives both the optimal dependence on κ and on which, to the best of our knowledge, was never obtained before among the class of stochastic gradient methods. 1 If F is not strongly convex, our Katyusha can also work for such objectives and needs a total of O n log F x 0 F x iterations. This gives the optimal rate 1/ which, to the + nl x0 x best of our knowledge, was never obtained before among the class of stochastic gradient methods. Our Algorithm. If ignoring the proximal term ψ, our Katyusha method iteratively updates: x k+1 z k + τ x + 1 τ y k ; define k+1 f x + f i x k+1 f i x where i is a random index in [n]; y k+1 x k+1 1 3L k+1, and z k+1 z k α k+1. Above, x is a snapshot point which is updated every n iterations. k+1 is the gradient estimator that satisfies E i [ k+1 ] = fx k+1 and is defined in the variance-reduction manner similar to known non-accelerated variance-reduction methods. The reason for keeping a sequence of three vectors x k, y k, z k is a common ingredient that can be found in all existing accelerated methods. 3 Our New Technique Negative Momentum. The most surprising part of Katyusha is the novel choice of x k+1 which is a convex combination of three vectors: y k, z k, and x. Our theoretical analysis suggests the parameter choices τ = 0.5 and = min{ nσl, 0.5}. To properly explain this novel combination, let us recall a momentum view of accelerated methods. In a classical accelerated non-stochastic gradient method, x k+1 is only a convex combination of y k and z k see for instance [4]. In fact, z k plays the role of momentum which adds a weighted sum of the history of the gradients into y k+1. As an illustrative example, suppose that τ = 0, = τ, and x 0 = y 0 = z 0. Then, one can compute that x 0 1 3L 1, k = 1; y k = x 0 1 3L 1 τ 3L + τα 1, k = ; x 0 1 3L 3 1 τ 3L + τα 1 τ 3L τ α 1, k = 3. Since the parameter α is usually much larger than 1/3L, the above recursion suggests that one can gradually increase the weight of gradients from earlier iterations, and this is known as momentum which is at the heart of accelerated first-order methods. Unfortunately, momentum is very dangerous for stochastic gradients. For instance, if one of the historical gradient estimator t is somewhat inaccurate i.e., it is very different from fx t, then 1 Of course, the non-stochastic full-gradient method of Nesterov can achieve such optimal dependence on κ and. Of course, the non-stochastic full-gradient method of Nesterov can achieve such optimal convergence rate 1/. 3 One can of course rewrite the algorithm and keep track of only two vectors per iteration. However, that will make the statement of the algorithm less clean so we refrain from doing so in this paper. 3

4 further moving in this direction may put us in trouble and not decrease the objective anymore. This is one of the major reasons that a majority of the researchers working on stochastic gradient descent have found acceleration / momentum not very useful in practice. In Katyusha, we put a magnet around x, which we define it to be essentially the average y k of the most recent n iterations. Whenever we define x k+1, it will be attracted by the magnet x and we define the weight τ = 0.5. This is a very strong magnet: it ensures that x k+1 is not too far away from x so the gradient estimator remains somewhat accurate; at the same time, it retracts x k+1 back to x, which can be understood as introducing a negative momentum which removes a fraction of the past stochastic gradients. This summarizes the high-level idea behind Katyusha, and its formal convergence analysis can be found in the subsequent sections. Comparison with Other Accelerated Gradient Methods. For smooth convex minimization problems, gradient descent converges at a rate L or L σ log 1 if the objective is σ-strongly convex. This is not optimal among the class of first-order methods. In 1983, Nesterov showed that the optimal rate should be L or L σ log 1 if the objective is σ-strongly convex and this was achieved by his celebrated accelerated gradient descent method [4]. Randomized Coordinate Descent. An alternative way to define the gradient estimator is to set k = d i fx k where i fx k is the coordinate gradient and i is randomly chosen in {1,,..., d}. This is known as randomized coordinate descent as opposed to stochastic gradient descent. We emphasize here that designing accelerated methods for coordinate descent is significantly easier than designing that for stochastic gradient descent, and this has indeed been done in many previous results including [7, 1, 18, 0, 1, 6]. 4 The state-of-the-art accelerated coordinate descent method is NUACDM [7] Linear Coupling. In a recent work by Allen-Zhu and Orecchia, the authors have proposed a new framework called linear coupling that facilitates the design of accelerated gradient methods [4]. Their new framework not only reconstructs Nesterov s accelerated full-gradient method [4], provides even faster accelerated coordinate descent method [7], but also leads to many recent breakthroughs for designing accelerated methods on non-smooth problems such as positive LP [5, 6] and positive SDP [3] or even general non-convex problems []. This present paper also falls into this linear-coupling framework. 1. Optimal Convergence Rates for Empirical Risk Minimization Problems There are a few interesting subcategories of problem 1.1 and each of them correspond to some well-known training problem in machine learning. Suppose we are given n vectors a 1,..., a n R d that are the feature vectors of n samples. Then, it is interesting to study special case of 1.1 where each f i x is rank-one structured: that is, f i x def = f i a i, x. Assuming rank-one simplifies the notations; all of the results stated in this subsection generalize to rank-o1 structured f i x s. In such a case, we rewrite 1.1 as { min F x def = fx + ψx def = 1 x R d n n i=1 } f i a i, x + ψx The reason behind this can be understood as follows. If a function f is L smooth with respect to each coordinate i, then a constant-step update x x 1 ifxei at least guarantees that it decreases the objective, i.e., L fx + 1 ifxei < fx. Decreasing the objective value is usually viewed as an important component in existing L accelerated methods see for instance the gradient descent step summarized in [4]. Unfortunately, this property is false for stochastic gradient descent, because fx k η k may be even larger than fx k even for very small step length η > 0. 4

5 Without loss of generality, we assume that each a i has Euclidean norm 1. Denoting by l i be the training label of data sample a i, one can consider the following 4 interesting classes of 1.: Case 1: ψx is σ strongly convex and fx is L-smooth. Examples: ridge regression: fx = 1 n n i=1 a i, x l i and ψx = σ x. elastic net: fx = 1 n n i=1 a i, x l i and ψx = σ x + λ x 1. Case : ψx is not strongly convex and fx is L-smooth. Examples: Lasso: fx = 1 n n i=1 a i, x l i and ψx = λ x 1. l 1 logistic regression: fx = 1 n n i=1 log1 + exp l i a i, x and ψx = λ x 1. Case 3: ψx is σ strongly convex and fx is non-smooth but Lipschitz continuous. Examples: SVM : fx = 1 n n i=1 max{0, 1 l i a i, x } and ψx = σ x. Case 4: ψx is not strongly convex and fx is non-smooth but Lipschitz continuous. Examples: l 1 -SVM : fx = 1 n n i=1 max{0, 1 l i a i, x } and ψx = λ x 1. For all of the four classes above, accelerated stochastic methods have already been introduced in the literature, most notably AccSDCA [31], APCG [0], SPDC [35]. However, to the best of our knowledge, all known accelerated methods have suboptimal convergence rates for Case, 3 and 4. 5 In particular, the best known convergence rate was log1/ for Case and 3, and was log1/. This is a factor log 1 worse than the optimal rate for each of the three classes, and is especially interesting to the optimization community because obtaining the optimal rate is one of the ultimate goals in optimization. See for instance an interesting attempt by Lan and Zhou [17] trying to fix this log factor, as well as successful attempts on fixing a similar log factor but in online learning which is a different problem [14, 16, 7]. Our Katyusha algorithm simultaneously closes the gap for all of the three classes of problems with the help from the recent optimal reductions by Allen-Zhu and Hazan [1]. In short, we obtain an -approximate minimizer for Case in O n log 1 + nl iterations, and for Case 4 in O n log 1 + nl iterations, for Case 3 in O n log 1 + n σ σ iterations. In contrast, none of the existing accelerated methods can lead to such optimal rates even if the optimal reductions of [1] are used, see discussions in Section 5. Despite obtaining the optimal methods for the mentioned classes, our algorithm is also a primalonly method because we only need to compute f i x at different points x and for different indices i. None of the existing accelerated stochastic methods for solving ERM problems are primal-only, and this is essentially the reason that their convergence rates are suboptimal Other Extensions Mini-batch. Katyusha naturally extends to the minibatch scenario. Instead of using a single stochastic gradient f i per iteration, one can use the average of b stochastic gradients j S f i where S is a random subset of [n] with cardinality b. All of our result extends 1 b 5 In fact, they also have the suboptimal dependence on the condition number L/σ for Case 1. 6 This is so because even for Case 1, converting an approximate maximizer for the dual objective to the primal, one only obtains an nκ approximate minimizer on the primal objective. As a result, algorithms like APCG who directly work on the dual, algorithms like SPDC who maintain both primal and dual variables, and algorithms like RPDG [17] that are primal-like but still use dual analysis, have to suffer from this log loss in the convergence rates. 5

6 to this setting, where the only change needed in the algorithm is to re-compute the snapshot every n/b iterations rather than every n iterations. Non-Uniform Sampling. If each f i has a different smooth parameter L i, then one has to select the random index i from a non-uniform distribution in order to obtain the fastest running time. This can be done following the same techniques proposed in [7], but will make the notations significantly heavier especially with the presence of the proximal term ψ. We refrain from doing so in this version of the paper. Non-Euclidean Norms. If the smoothness of the functions f i x are with respect to a non- Euclidean norm such as the well known l 1 norm case over the simplex, our results in this paper still hold. In particular, our update on the y k+1 side becomes the non-euclidean norm gradient descent, and the update on the z k+1 side becomes the non-euclidean norm mirror descent with respect to the Bregman divergence term of a strongly convex potential function. Our analysis in this paper can be translated into this more general scenario following the techniques of [4]. Since this extension is simple but complicates the notations, we defer it to a full version of this paper. 1.4 Conclusion Roadmap. We provide necessary notations and useful theorems in Section. In Section 3, we focus on analyzing one single iteration of Katyusha, and in Section 4 we provide the convergence analysis on Katyusha for the strongly convex case of problem 1.1. In Section 5, we apply Katyusha to non-strongly convex or non-smooth objectives by applying the optimal reductions in [1]. In Section 6, we provide a direct algorithm for solving the non-strongly case of problem 1.1 with the optimal 1/ rate, and compare it with the literature. We shall include experimental results in a next version of this paper. Preliminaries Throughout this paper, we denote by the Euclidean norm. We denote by fx the full gradient vector of function f if it is differentiable, or the subgradient vector if f is only Lipschitz continuous. Recall some classical definitions on strong convexity and smoothness. Definition.1 Smoothness and strong convexity. For a convex function f : R n R, We say f is σ-strongly convex if x, y R n, it satisfies fy fx+ fx, y x + σ x y. We say f is L-smooth if x, y R n, it satisfies fx fy L x y. We also need to use the following definition by Allen-Zhu and Hazan: Definition. [1]. An algorithm solving the strongly convex case of problem 1.1 satisfies the homogenous objective decrease HOOD property with time TimeL, σ, if for every starting point x 0, it produces an output x satisfying E [ F x ] F x F x 0 F x 4 in time at most TimeL, σ. Allen-Zhu and Hazan provided three black-box reductions algorithms AdaptReg, AdaptSmooth, and JointAdaptRegSmooth in their paper [1] to convert any algorithm satisfying the HOOD property with time TimeL, σ respectively to 1 the non-strongly convex but smooth case, the strongly convex but non-smooth case, and 3 the non-strongly convex and non-smooth case. We simplify and restate their theorems as follows: 6

7 Algorithm 1 Katyushax 0, S, σ, L 1: m n; the time window for re-computing the snapshot x : τ 1, min { mσ 3L, 1 }, α 1 parameters 3 L ; 3: y 0 = z 0 = x 0 x 0 ; initial vectors 4: for s 1 to S do 5: µ s f x s ; compute the full gradient only once every m iterations 6: for j 0 to m 1 do 7: k sm + j; 8: x k+1 z k + τ x s + 1 τ y k ; 9: k+1 µ s + f { i x k+1 f i x s where i is randomly chosen from {1,,..., n}; 10: z k+1 = arg min 1 z α z z k + k+1, z + ψz } ; { 11: Option I: y k+1 arg min 3L y y x k+1 + k+1, y + ψy } ; 1: Option II: y k+1 x k+1 + z k+1 z k we analyze only Option I in this paper, but Option II also works 13: end for 14: x s+1 m ασj 1 m ασj x sm+j+1 ; weighted average of the previous m iterations 15: end for 16: return x S. Theorem.3 AdaptReg. Suppose that in problem 1.1 f is L-smooth and x 0 is a starting vector. Then, AdaptReg produces an output x satisfying E [ F x ] F x O in a total running time of T 1 t=0 TimeL, σ 0 t where σ 0 = F x 0 F x F x x 0 and T = log 0 F x x. Theorem.4 AdaptSmooth. Suppose that in problem 1., ψ is σ strongly convex, each f i is G-Lipschitz continuous, and x 0 is a starting vector. Then, AdaptSmooth produces an output x satisfying E [ F x ] F x O in a total running time of T 1 t=0 Timet /λ 0, σ where λ 0 = F x 0 F x G and T = log F x 0 F x. Theorem.5 JointAdaptRegSmooth. Suppose that in problem 1., each f i is G-Lipschitz continuous and x 0 is a starting vector. Then, JointAdaptRegSmooth produces an output x satisfying E [ F x ] F x O in a total running time of T 1 t=0 Timet /λ 0, σ 0 t where λ 0 = F x 0 F x, σ G 0 = F x 0 F x F x x 0 and T = log 0 F x x x 0. x 3 One-Iteration Analysis In this section, we focus on analyzing the behavior of Katyusha see Algorithm 1 in a single iteration i.e., for a fixed k. We view y k, z k and x k+1 as fixed in this section so the only randomness comes from the choice of i in iteration k. We abbreviate in this section by x = x s where s is the epoch that iteration k belongs to, and denote by σk+1 def = fx k+1 k+1 so E[σ k+1 ] is the variance of the gradient estimator k+1 in this iteration. Our first lemma lower bounds the expected objective decrease F x k+1 E[F y k+1 ]. Our Progx k+1 defined below is a non-negative, classical quantity that would be a lower bound on the amount of objective decrease if k+1 were equal to fx k+1, see for instance [4]. However, since the variance σk+1 is non-zero, this lower bound must be compensated by a negative term that depends on E[σk+1 ]. 7

8 Lemma 3.1 proximal gradient descent. If we have Proof. y k+1 = arg min y {3L y x k+1 + k+1, y x k+1 + ψy ψx k+1 }, and {3L Progx k+1 def = min y y x k+1 + k+1, y x k+1 + ψy ψx k+1 } 0, F x k+1 E [ F y k+1 ] E [ Progx k+1 ] 1 4L E[ σk+1 ]. Progx k+1 = min{ 3L y y x k+1 + k+1, y x k+1 + ψy ψx k+1 } 1 3L = y k+1 x k+1 + k+1, y k+1 x k+1 + ψy k+1 ψx k+1 L = y k+1 x k+1 + fx k+1, y k+1 x k+1 + ψy k+1 ψx k+1 + fx k+1 k+1, y k+1 x k+1 L y k+1 x k+1 fy k+1 fx k+1 + ψy k+1 ψx k L fx k+1 k+1. Above, 1 is by the definition of y k+1, and uses the smoothness of function f, as well as the inequality a, b 1 b 1 a. Taking expectation on both sides we arrive at the desired result. The following lemma provides a novel upper bound on the expected variance of the gradient estimator. Note that all known variance reduction analysis for convex optimization, in one way or another, upper bounds this variance essentially by 4L f x fx, the objective distance to the minimizer c.f. [10, 15]. The recent breakthrough of Allen-Zhu and Hazan [1] upper bounds it by the point distance x k+1 x for non-convex objectives, which is tighter if x is close to x k+1 but unfortunately not enough for the purpose of this paper. In this paper, we upper bound it by the tightest possible quantity which is almost L f x fx k+1 4L f x fx. Unfortunately, this upper bound needs to be compensated by an additional term fx k+1, x x k+1, which could be positive but we shall cancel it using the introduced negative momentum. Lemma 3.. E [ k+1 fx k+1 ] L f x fx k+1 fx k+1, x x k+1. Proof. Each f i x, being convex and L-smooth, implies the following inequality which is classical in convex optimization and can be found for instance in Theorem.1.5 of the textbook of Nesterov [5]. f i x k+1 f i x L f i x f i x k+1 f i x k+1, x x k+1 Therefore, taking expectation over the random choice of i, we have E [ k+1 fx k+1 ] = E [ f i x k+1 f i x fx k+1 f x ] 1 E [ f i x k+1 f i x ] = L E [ f i x f i x k+1 f i x k+1, x x k+1 ] = L f x fx k+1 fx k+1, x x k+1. Above, 1 is because for any random vector ζ R d, it holds that E ζ Eζ = E ζ Eζ. 8

9 The next lemma is a classical one for proximal mirror descent. Lemma 3.3 proximal mirror descent. Suppose ψ is σ strongly convex. If k+1 is fixed and then it satisfies for all u R d, {1 z k+1 = arg min z z z k + α k+1, z z k + αψz αψz k }, α k+1, z k+1 u + αψz k+1 αψu 1 z k z k z k u 1 + ασ z k+1 u. Proof. By the minimality definition of z k+1, we have that z k+1 z k + α k+1 + αg = 0 where g is some subgradient of ψz at point z = z k+1. This implies that for every u it satisfies 0 = z k+1 z k + α k+1 + αg, z k+1 u. At this point, using the equality z k+1 z k, z k+1 u = 1 z k z k+1 1 z k u + 1 z k+1 u, as well as the inequality g, z k+1 u ψz k+1 ψu σ z k+1 u which comes from the strong convexity of ψ, we can write α k+1, z k+1 u + αψz k+1 αψu = z k+1 z k, z k+1 u αg, z k+1 u + αψz k+1 αψu 1 z k z k z k u 1 + ασ z k+1 u. The following lemma combines Lemma 3.1, Lemma 3. and Lemma 3.3 all together, using the special choice of x k+1 which is a convex combination of y k, z k and x: Lemma 3.4 coupling step 1. If x k+1 = z k + τ x + 1 τ y k, where 3 αl and τ = 1, then it satisfies α fx k+1, z k u αψu α F x k+1 E [ F y k+1 ] + τ F x τ E [ fx k+1 ] τ fx k+1, x x k z k u 1 + ασ E [ z k+1 u ] + α1 τ ψy k α ψx k+1 τ Proof. We first apply Lemma 3.3 and get α k+1, z k u + αψz k+1 αψu = α k+1, z k z k+1 + α k+1, z k+1 u + αψz k+1 αψu α k+1, z k z k+1 1 z k z k z k u 1 + ασ z k+1 u

10 By defining v def = z k+1 + τ x + 1 τ y k, we have x k+1 v = z k z k+1 and therefore [ E α k+1, z k z k+1 1 z k z k+1 ] [ α = E τ k+1, x k+1 v 1 x k+1 v ] 1 [ α = E τ k+1, x k+1 v 1 1 x k+1 v ψv + ψx k+1 + α ] ψv ψx k+1 α + α ] ψv ψx k+1 1 [ α E τ k+1, x k+1 v 3L 1 x k+1 v ψv + ψx k+1 [ α E F x k+1 F y k L σ k+1 + α ] ψv ψx k+1 3 [ α E F x k+1 F y k f x fxk+1 fx k+1, x x k+1 + α ] ψz k+1 + τ ψ x + 1 τ ψy k ψx k Above, 1 uses our choice 3 αl, uses Lemma 3.1, 3 uses Lemma 3.. Finally, noticing that E[ k+1, z k u ] = fx k+1, z k u and τ = 1, we obtain the desired inequality by combining 3.1 and 3.. The next lemma simplifies the left hand side of Lemma 3.4 using the convexity of f, and gives an inequality that relates the objective-distance-to-minimizer quantities F y k F x, F y k+1 F x, and F x F x to the point-distance-to-minimizer quantities z k x and z k+1 x. Lemma 3.5 coupling step. Under the same choices of, τ as in Lemma 3.4, we have 0 α1 τ F y k F x α E [ F yk+1 ] F x + ατ F x τ F x Proof. We first compute that α fx k+1 fu 1 α fx k+1, x k+1 u = α fx k+1, x k+1 z k + α fx k+1, z k u τ z k x 1 + ασ E [ z k+1 x ]. = ατ fx k+1, x x k+1 + α1 τ fx k+1, y k x k+1 + α fx k+1, z k u 3 = ατ fx k+1, x x k+1 + α1 τ fy k fx k+1 + α fx k+1, z k u. Above, 1 uses the convexity of f, uses the choice that x k+1 = z k + τ x + 1 τ y k, and 3 uses the convexity of f again. By applying Lemma 3.4 to the above inequality, we have α fx k+1 F u α1 τ F y k fx k+1 + α F x k+1 E [ F y k+1 ] +τ F x τ fx k z k u 1 + ασ E [ z k+1 u ] α ψx k+1 which implies α F x k+1 F u α1 τ F y k F x k+1 + α F x k+1 E [ F y k+1 ] + τ F x τ F x k z k u 1 + ασ E [ z k+1 u ]. 10

11 After rearranging and setting u = x, the above inequality yields 0 α1 τ F y k F x α E [ F yk+1 F x ] + ατ F x τ F x + 1 z k x 1 + ασ E [ z k+1 x ]. 4 Strongly Convex Case In this section we telescope Lemma 3.5 from the previous section across all iterations k, and prove the following theorem: Theorem 4.1. If each f i x is convex, L-smooth, and ψx is σ-strongly convex in 1.1, then Katyushax 0, S, σ, L satisfies E [ F x S ] { F x O1 + ασ Sm F x 0 F x, if mσ L 3 4 ; O 1.5 S F x 0 F x, if mσ L > 3 4. In other words, Katyusha achieves an -additive error i.e., E [ F x S ] F x using at most O n + nl/σ log F x 0 F x iterations. 7 Remark 4.. Because m = n, each iteration of Katyusha computes only O1 stochastic gradients f i in the amortized sense. Therefore, the per-iteration time complexity of Katyusha is dominated by the computation time of f i, the proximal update in Line 10 of Algorithm 1, plus an overhead Od. If f i has at most d d non-zero entries, this overhead Od is improvable to Od using a sparse implementation of Katyusha. 8 As a result, for all the ERM problems defined in 1., the amortized per-iteration complexity of Katyusha is only Od where d is the sparsity of feature vectors, asymptotically the same as the per-iteration complexity of SGD. def Proof of Theorem 4.1. We define D k = F y k F x, as follows: D s def = F x s F x, and rewrite Lemma τ D k 1 D k+1 + τ E [ Ds ] + 1 α z k x 1 + ασ α E[ z k+1 x ]. At this point, let us define θ = 1 + ασ and multiply the above inequality by θ j for each k = sm + j. Then, we sum up the resulting m inequalities for all j = 0, 1,..., m 1: [ m 1 1 τ1 τ 0 E D sm+j θ j 1 m α z sm x θm [ zs+1m x ]. α D sm+j+1 θ j] + τ m 1 Ds Note that in the above inequality we have assumed all the randomness in the first s 1 epochs are fixed and the only source of randomness comes from epoch s. We can rearrange the terms in the 7 Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least /3, Katyusha satisfies F x S F x in the same stated asymptotic running time. 8 This requires one to defer the updates such as x k+1 z k + τ x s + 1 τ y k without naively implementing it. An experienced programmer can consult for instance [5] for a detailed treatment but on a different problem. θ j 11

12 above inequality and get [ τ1 + τ 1 1/θ E m j=1 D sm+j θ j] 1 τ D sm θ m E [ ] D τ s+1m + τ m 1 Ds α z sm x θm α E[ z s+1m x ]. Using the special choice that x s+1 = m 1 θj 1 m 1 x sm+j+1 θ j and the convexity of F, we derive that D s+1 m 1 θj 1 m 1 D sm+j+1 θ j. Substituting this into the above inequality, we get + τ 1 1/θ θe [ ] m 1 τ Ds+1 θ j 1 τ D sm θ m E [ ] D 1 τ s+1m + τ m 1 Ds 1 We consider two cases next. + 1 α z sm x θm α E[ z s+1m x ]. 4.1 Case 1. Suppose that mσ L 3 4. In this case, we choose α = 1 3mσL and = 1 3αL = mασ = mσ 3L [0, 1 ] as in Katyusha. Our parameter choices imply ασ 1/m and therefore the following inequality holds: τ θ m /θ = ασm ασ m 1ασ + ασ = mασ =. In other words, we have + τ 1 1/θ τ θ m 1 and thus 4.1 implies that [ m 1 τ E Ds+1 θ j + 1 τ D τ s+1m α z s+1m x ] m 1 θ m τ Ds θ j + 1 τ D sm + 1 α z sm x If we telescope the above inequality over all epochs s = 0, 1,..., S 1, we obtain E [ F x S F x ] = E [ ] DS 1 θ Sm O D0 + D 0 + θ Sm O 1 + αmσ αm x 0 x F x 0 F x 3 = O1 + ασ Sm F x 0 F x. 4. Above, 1 uses the fact that m 1 θj m and τ = 1 ; uses the strong convexity of F which implies F x 0 F x σ x 0 x ; and 3 uses our choice of. Case. Suppose that mσ L > 3 4. In this case, we choose = 1 and α = 1 Our parameter choices help us simplify 4.1 as E [ ] m 1 Ds+1 θ j D m 1 s 3 L = 3L θ j + 1 α z sm x θm α E[ z s+1m x ]. θ j θ j as in Katyusha. 1

13 Since θ m = 1 + ασ m 1 + ασm = 1 + σm 3L 3, the above inequality implies 3 E[ ] Ds+1 + 9L 8 E[ z s+1m x ] m 1 θ j D m 1 s θ j + 3L 4 z sm x. If we telescope this inequality over all the epochs s = 0, 1,..., S 1, we immediately have m 1 E[ DS θ j + 3L 4 z Sm x ] S 3 D0 m 1 θ j + 3L 4 z 0 x. Finally, since m 1 θj m and σ z 0 x F x 0 F x owing to the strong convexity of F, we conclude that E [ F x S F x ] O 1.5 S F x 0 F x. 4.3 Combining 3.1 and 3. we finish the proof of Theorem Corollaries Via Reduction It is immediately clear from Theorem 4.1 that Katyusha satisfies the HOOD property: Corollary 5.1. Katyusha satisfies the HOOD property with T L, σ = O n + nl σ iterations. Remark 5.. Notice that existing accelerated stochastic methods even only for solving the simpler problem 1. either do not satisfy HOOD property or satisfy HOOD with an additional factor logl/σ in the number of iterations. This is why they can not be combined with the reductions in [1] to get the optimal convergence rates. Based on the HOOD property, we can apply Theorem.3,.4 and.5 in Section to deduce that Corollary 5.3. If each f i x is convex, L-smooth and ψ is not necessarily strongly convex in 1.1, then by applying AdaptReg on Katyusha with a starting vector x 0, we obtain an output x satisfying E[F x] F x in at most O n log F x 0 F x nl x0 x + iterations. In contrast, the best known convergence rate was only log 1/ applying the new AdaptReg reduction on Catalyst. due to Catalyst, or log1/ Corollary 5.4. If each f i x is G-Lipschitz continuous and ψx is σ-strongly convex in 1., then by applying AdaptSmooth on Katyusha with a starting vector x 0, we obtain an output x satisfying E[F x] F x in at most O n log F x 0 F x ng + iterations. σ In contrast, the best known convergence rate was only log1/ due to APCG and SPDC. by 13

14 Algorithm Katyusha ns x 0, S, σ, L 1: m n; the time window for re-computing snapshot x : τ 1 ; 3: y 0 = z 0 = x 0 x 0 ; initial vectors 4: for s 1 to S do 5:,s s+4, α s 1 3,s L different parameter choices comparing to Katyusha 6: µ s f x s ; compute the full gradient only once every m iterations 7: for j 0 to m 1 do 8: k sm + j; 9: x k+1,s z k + τ x s + 1,s τ y k ; 10: k+1 µ s + f { i x k+1 f i x s where i is randomly chosen from {1,,..., n}; 11: z k+1 = arg min 1 z α s z z k + k+1, z + ψz } ; { 1: Option I: y k+1 arg min 3L y y x k+1 + k+1, y + ψy } ; 13: Option II: y k+1 x k+1 +,s z k+1 z k we analyze only Option I in this paper, but Option II also works 14: end for 15: x s+1 1 m m j=1 x sm+j; 16: end for 17: return x S. Corollary 5.5. If each f i x is G-Lipschitz continuous and ψx is not necessarily strongly convex in 1., then by applying JointAdaptRegSmooth on Katyusha with a starting vector x 0, we obtain an output x satisfying E[F x] F x in at most O n log F x 0 F x + In contrast, the best known convergence rate was only log1/ 6 Non-Strongly Convex Case ng x0 x iterations. due to APCG and SPDC. In this section we consider a variant of Katyusha that directly works on non-strongly convex objectives F for problem 1.1. We call this algorithm Katyusha ns, see Algorithm. As in the strongly convex case, we set τ = 1 throughout the algorithm, but choose =,s to be a parameter that depends on the epoch index s, and accordingly α s = 1 3L,s. These parameter choices will satisfy the presumptions in Lemma 3.4. We prove the following theorem in this section: Theorem 6.1. If each f i x is convex, L-smooth in 1.1 and ψ is not necessarily strongly convex, then Katyusha ns x 0, S, L satisfies E [ F x S ] F F x x0 F x O S + L x 0 x ns In other words, Katyusha ns achieves an -additive error i.e., E [ F x S ] F x using at n F x0 F x most O + nl x0 x stochastic gradients and the same number of iterations. 14

15 Remark 6.. Katyusha ns is a direct, accelerated solver for the non-strongly convex case of problem 1.1. One should compare it with the convergence lemma of a direct, non-accelerated solver for the same setting, which can be usually written as follows when translated to our notations see for instance SAGA [10]: E [ F x ] F F x x0 F x O S + L x 0 x ns It is clear at this moment that Katyusha ns is at least a factor S faster than non-accelerated methods such as SAGA. This convergence can also be written in terms of the number of iterations which is O nf x 0 F x + L x 0 x. Remark 6.3. Our stated convergence result in Theorem 4.1 is a slightly worse than the desired complexity O n log F x 0 F x + nl x0 x obtained from using the optimal reduction, see Corollary 5.5. This can be fixed by making some non-trivial changes to the epoch lengths, and is omitted in the current version of this paper. 9 def Proof of Theorem 6.1. Again by defining D k = F y k F x and rewrite Lemma 3.5 as follows:. D s def = F x s F x, we can 0 α s1,s τ D k α s E [ ] α s τ D k+1 + n,s,s τ D s + 1 1,s z k x 1 E[ z k+1 x ]. Summing up the above inequality for all the iterations k = sm, sm + 1,..., sm + m 1, we have [ 1,s τ,s + τ E α s D τ s+1m + α s 1,s,s m ] D sm+j j=1 α s 1,s τ,s D sm + α s τ,s n D s + 1 z sm x 1 E[ z s+1m x ]. 6.1 Note that in the above inequality we have assumed all the randomness in the first s 1 epochs are fixed and the only source of randomness comes from epoch s. If we define x s = 1 m m j=1 x s 1m+j, then by the convexity of function F we have m D s n j=1 D s 1m+j. Therefore, for every s 1 we can derive from 6.1 that [ 1 E τ1,s D s+1m + τ m 1 1,s + τ ] τ1,s D sm+j j=1 1,s τ1,s D sm + τ m 1 τ1,s D s 1m+j + 3L z sm x 3L E[ z s+1m x ]. 6. j=1 For the base case s = 0, we can also rewrite 6.1 as O n+l [ 1 E τ1,0 D m + τ m 1 1,0 + τ ] τ1,0 D j j=1 1,0 τ τ1,0 D 0 + τ n D 0 τ1,0 + 3L z 0 x 3L E[ z m x ] More precisely, recall that a similar issue has also happened in the non-accelerated world: the iteration complexity in SAGA can be improved to On log 1 + L using varying epoch length [8]. The technique in [8] also applies to this paper. 15

16 At this point, if we choose,s = s+4 1, it satisfies 1 τ 1,s 1,s+1 τ 1,s+1 and,s + τ τ 1,s τ τ s+1. Using these two inequalities, we can telescope 6.3 and 6. for all s = 0, 1,..., S 1. We obtain in the end that [ E 1 τ 1,S 1 D s+1m +,S 1 + τ τ 1,S 1 m 1 j=1 D S 1m+j + 3L z Sm z ] 1,0 τ τ1,0 D 0 + τ n D 0 τ1,0 + 3L z 0 x 6.4 Since we have D S 1 m m j=1 D S 1m+j which is no greater than,s 1 m times the left hand side of 6.4, we conclude that E [ F x S F x ] = E [ DS ] O τ 1,S m = O 1 ms 1 τ1,0 τ τ 1,0 D 0 + τ n D 0 τ1,0 + 3L z 0 x m F x 0 F x + L x 0 x. References [1] Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. ArXiv e-prints, abs/ , March 016. [] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. ArXiv e-prints, abs/ , March 016. [3] Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a widthindependent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 7th ACM-SIAM Symposium on Discrete Algorithms, SODA 16, 016. [4] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. ArXiv e-prints, abs/ , July 014. [5] Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC 15, 015. [6] Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 6th ACM-SIAM Symposium on Discrete Algorithms, SODA 15, 015. [7] Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. ArXiv e-prints, abs/ , December 015. [8] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non- Convex Objectives. ArXiv e-prints, abs/ , June 015. [9] Léon Bottou. Stochastic gradient descent. 16

17 [10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems, NIPS 014, 014. [11] Aaron J. Defazio, Tibério S. Caetano, and Justin Domke. Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems. In Proceedings of the 31st International Conference on Machine Learning, ICML 014, 014. [1] Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 54: , 015. First appeared on ArXiv in 013. [13] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1 8, 015. [14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 151:489 51, January 014. [15] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 013, pages , 013. [16] Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an o1/t convergence rate for the projected stochastic subgradient method. ArXiv e-prints, abs/11.00, 01. [17] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/ , October 015. [18] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In Foundations of Computer Science FOCS, 013 IEEE 54th Annual Symposium on, pages IEEE, 013. [19] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 015. [0] Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In Advances in Neural Information Processing Systems, NIPS 014, pages , 014. [1] Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, pages 1 8, 013. [] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages , 013. [3] Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 5:89 855, April 015. Preliminary version appeared in ICML

18 [4] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O1/k. In Doklady AN SSSR translated as Soviet Mathematics Doklady, volume 69, pages , [5] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 004. [6] Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, :341 36, jan 01. [7] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 9th International Conference on Machine Learning, ICML 1, 01. [8] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arxiv preprint arxiv: , pages 1 45, 013. Preliminary version appeared in NIPS 01. [9] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arxiv preprint arxiv: , pages 1 18, 01. [30] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14: , 013. [31] Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 014, pages 64 7, 014. [3] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 44: , 014. [33] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages , 013. [34] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 1st International Conference on Machine Learning, ICML 004, 004. [35] Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In Proceedings of the 3nd International Conference on Machine Learning, ICML 015,

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods Journal of Machine Learning Research 8 8-5 Submitted 8/6; Revised 5/7; Published 6/8 : The First Direct Acceleration of Stochastic Gradient Methods Zeyuan Allen-Zhu Microsoft Research AI Redmond, WA 985,

More information

arxiv: v5 [math.oc] 2 May 2017

arxiv: v5 [math.oc] 2 May 2017 : The First Direct Acceleration of Stochastic Gradient Methods version 5 arxiv:3.5953v5 [math.oc] May 7 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March 8,

More information

arxiv: v6 [math.oc] 24 Sep 2018

arxiv: v6 [math.oc] 24 Sep 2018 : The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

arxiv: v2 [cs.lg] 14 Sep 2017

arxiv: v2 [cs.lg] 14 Sep 2017 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

SEGA: Variance Reduction via Gradient Sketching

SEGA: Variance Reduction via Gradient Sketching SEGA: Variance Reduction via Gradient Sketching Filip Hanzely 1 Konstantin Mishchenko 1 Peter Richtárik 1,2,3 1 King Abdullah University of Science and Technology, 2 University of Edinburgh, 3 Moscow Institute

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

arxiv: v4 [math.oc] 11 Jun 2018

arxiv: v4 [math.oc] 11 Jun 2018 Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

arxiv: v1 [math.oc] 5 Feb 2018

arxiv: v1 [math.oc] 5 Feb 2018 for Convex-Concave Saddle Point Problems without Strong Convexity Simon S. Du * 1 Wei Hu * arxiv:180.01504v1 math.oc] 5 Feb 018 Abstract We consider the convex-concave saddle point problem x max y fx +

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information