Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Size: px
Start display at page:

Download "Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization"

Transcription

1 Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv: v4 [mathoc] 9 Sep 07 Abstract In this paper, we develop a new accelerated stochastic gradient method for efficiently solving the convex regularized empirical risk minimization problem in mini-batch settings The use of mini-batches is becoming a golden standard in the machine learning community, because mini-batch settings stabilize the gradient estimate and can easily make good use of parallel computing The core of our proposed method is the incorporation of our new double acceleration technique and variance reduction technique We theoretically analyze our proposed method and show that our method much improves the mini-batch efficiencies of previous accelerated stochastic methods, and essentially only needs size n mini-batches for achieving the optimal iteration complexities for both nonstrongly and strongly convex objectives, where n is the training set size Further, we show that even in non-mini-batch settings, our method achieves the best known convergence rate for both non-strongly and strongly convex objectives Introduction We consider a composite convex optimization problem associated with regularized empirical risk minimization, which often arises in machine learning In particular, our goal is to minimize the sum of finite smooth convex functions and a relatively simple possibly non-differentiable convex function by using first order methods in mini-batch settings The use of mini-batches is becoming a golden standard in the machine learning community, because it is generally more efficient to execute matrix-vector multiplications over a mini-batch than an equivalent amount of NTT DATA Mathematical Systems Inc Tokyo, Japan Graduate School of Information Science and Technology, The University of Tokyo Correspondence to: Tomoya Murata <murata@msicojp>, Taiji Suzuki <taiji@mistiu-tokyoacjp> vector-vector ones each over a single instance; and more importantly, mini-batch settings can easily make good use of parallel computing Traditional and effective methods for solving the abovementioned problem are the proximal gradient PG method and accelerated proximal gradient APG method Nesterov et al, 007; Beck & Teboulle, 009; Tseng, 008 These methods are well known to achieve linear convergence for strongly convex objectives Particularly, APG achieves optimal iteration complexities for both non-strongly and strongly convex objectives However, these methods need a per iteration cost of Ond, where n denotes the number of components of the finite sum, and d is the dimension of the solution space In typical machine learning tasks, n and d correspond to the number of instances and features respectively, which can be very large Then, the per iteration cost of these methods can be considerably high A popular alternative is the stochastic gradient descent SGD method Singer & Duchi, 009; Hazan et al, 007; Shalev-Shwartz & Singer, 007 As the per iteration cost of SGD is only Od in non-mini-batch settings, SGD is suitable for many machine learning tasks However, SGD only achieves sublinear rates and is ultimately slower than PG and APG Recently, a number of stochastic gradient methods have been proposed; they use a variance reduction technique that utilizes the finite sum structure of the problem stochastic averaged gradient SAG method Roux et al, 0; Schmidt et al, 03, stochastic variance reduced gradient SVRG method Johnson & Zhang, 03; Xiao & Zhang, 04 and SAGA Defazio et al, 04 Even though the per iteration costs of these methods are same as that of SGD, they achieve a linear convergence for strongly convex objectives Consequently, these methods dramatically improve the total computational cost of PG However, in size b mini-batch settings, the rate is essentially b times worse than in non-mini-batch settings This means that there is little benefit in applying mini-batch scheme to these methods

2 ated stochastic methods for the composite finite sum problem accelerated stochastic dual coordinate ascent ASDCA method Shalev-Shwartz & Zhang, 03, Universal Catalyst UC Lin et al, 05, accelerated proximal coordinate gradient APCG method Lin et al, 04a, stochastic primal-dual coordinate SPDC method Zhang & Xiao, 05, and Katyusha Allen-Zhu, 06 ASDCA UC, APCG, SPDC and Katyusha essentially achieve the optimal total computational cost for strongly convex objectives in non-mini-batch settings However, in size b mini-batch settings, the rate is essentially b times worse than that in non-mini-batch settings, and these methods need size On mini-batches for achieving the optimal iteration complexity, which is essentially the same as APG In addition, Nitanda 04; 05 has proposed the accelerated mini-batch proximal stochastic variance reduced gradient AccProxSVRG method and its variant, the accelerated efficient mini-batch stochastic variance reduced gradient AMSVRG method In nonmini-batch settings, AccProxSVRG only achieves the same rate as SVRG However, in mini-batch settings, AccProx- SVRG significantly improves the mini-batch efficiency of non-accelerated variance reduction methods, and surprisingly, AccProxSVRG essentially only needs size O κ mini-batches for achieving the optimal iteration complexity for strongly convex objectives, where κ is the condition number of the problem However, the necessary size of mini-batches depends on the condition number and gradually increases when the condition number increases and ultimately matches with On for a large condition number Main contribution We propose a new accelerated stochastic variance reduction method that achieves better convergence than existing methods do, and it particularly utilizes mini-batch settings well; it is called the doubly accelerated stochastic variance reduced dual averaging DASVRDA method Our method significantly improves the mini-batch efficiencies of state-of-the-art methods, and our method essentially only needs size O n mini-batches 3 for achieving the More precisely, the rate of ASDCA UC is with extra logfactors, and near but worse than the one of APCG, SPDC and Katyusha This means that ASDCA UC cannot be optimal Katyusha also achieves a near optimal total computational cost for non-strongly convex objectives 3 Actually, when L/ nlog n and L/µ n, our method needs size Õn /L and On µ/l mini-batches, respectively, which are larger than O n, but smaller than On Achieving the optimal iteration complexity for solving high accuracy and bad conditioned problems is much more important than doing ones with low accuracy and well-conditioned ones, because the former needs more overall computational cost than the latter optimal iteration complexities 4 for both non-strongly and strongly convex objectives We list the comparisons of our method with several preceding methods in Table Preliminary In this section, we provide several notations and initions used in this paper Then, we make assumptions for our analysis We use to denote the Euclidean L norm : x x i x i For natural number n, [n] denotes set,, n} Definition We say that a function f : R d R is L-smooth L > 0 if f is differentiable and satisfies fx fy L x y x, y R d If f is convex, is equivalent to the following: see Nesterov 03: fx + fx, y x + L fx fy fy x, y R d Definition A convex function f : R d R is called µ-strongly convex µ 0 if f satisfies µ x y + ξ, y x + fx fy, x, y R d, ξ fx where fx denotes the set of the subgradients of f at x Note that if f is µ-strongly convex, then f has the unique minimizer Definition 3 We say that a function f : R d R is µ- optimally strongly convex µ 0 if f has a minimizer and satisfies µ x x fx fx x R d, x argmin x R dfx If a function f is µ-strongly convex, then f is clearly µ- optimally strongly convex Definition 4 We say that a convex function f : R d R is relatively simple if computing the proximal mapping of f at y, prox f y argmin x R d } x y + fx, takes at most Od computational cost, for any y R d 4 We refer to optimal iteration complexity as the iteration complexity of deterministic Nesterov s acceleration method Nesterov 03

3 SVRG SVRG ++ ASDCA UC APCG SPDC Katyusha AccProxSVRG DASVRDA µ-strongly convex Non-strongly convex Total computational cost Necessary size of mini-batches Total computational cost Necessary size of mini-batches in size b mini-batch settings L/µ n L/µ n in size b mini-batch settings L/ nlog / L/ nlog / O d n + bl µ log Unattainable Unattainable O d nlog + bl Unattainable Unattainable nbl Õ d n + µ log n+ Unattainable Unattainable Õ d nbl Unattainable Unattainable nbl O d n + µ log On On No direct analysis Unattainable Unattainable nbl O d n + µ log On On No direct analysis Unattainable Unattainable nbl O d n + µ log On On O d nlog + nbl On On n b L O d n + n µ + b L µ log L O µ O n µ L No direct analysis Unattainable Unattainable nl O d n + µ + b L µ log O n O n µ L O d nlog + nl + b L O n Õ n L Table Comparisons of our method with SVRG SVRG ++ Allen-Zhu & Yuan, 06, ASDCA UC, APCG, SPDC, Katyusha and AccProxSVRG n is the number of components of the finite sum, d is the dimension of the solution space, b is the mini-batch size, L is the smoothness parameter of the finite sum, µ is the strong convexity parameter of objectives see Def and Def for their initions, and is accuracy Necessary size of mini-batches indicates the order of the necessary size of mini-batches for achieving optimal iteration complexities O L/µlog/ and O L/ for strongly and non-strongly convex objectives, respectively We regard one computation of a full gradient as n/b iterations in size b mini-batch settings, for a fair comparison Unattainable implies that the algorithm cannot achieve the optimal iteration complexity even if it uses size n mini-batches Õ hides extra log-factors The results marked in red denote the main contributions of this paper As f is convex, function / x y +fx is -strongly convex, and function prox f is well-ined Now, we formally describe the problem to be considered in this paper and the assumptions for our theory In this paper, we consider the following composite convex minimization problem: min x R d P x F x + Rx}, 3 where F x n n i f ix Here each f i : R d R is a L i -smooth convex function and R : R d R is a relatively simple and possibly non-differentiable convex function Problems of this form often arise in machine learning and fall under regularized empirical risk minimization ERM In ERM problems, we are given n training examples a i, b i } n i, where each a i R d is the feature vector of example i, and each b i R is the label of example i The following regression and classification problems are typical examples of ERM on our setting: Lasso: f i x a i x b i and Rx λ x Ridge logistic regression: f i x log + exp b i a i x and Rx λ x Support vector machines: Rx λ x f i x h ν i a i x and Here, h ν i is a smooth variant of hinge loss for the inition, for example, see Shalev-Shwartz & Zhang 03 We make the following assumptions for our analysis: Assumption There exists a minimizer x of 3 Assumption Each f i is convex and L i -smooth The above examples satisfy this assumption with L i a i for the squared loss, L i a i /4 for the logistic loss, and L i ν a i for the smoothed hinge loss Assumption 3 Regularization function R is convex and is relatively simple For example, Elastic Net regularizer Rx λ x + λ / x λ, λ 0 satisfies this assumption Indeed, we can analytically compute the proximal mapping of R by prox R z /+λ signz j max z j λ, 0} d j, and this costs only Od We always consider Assumption, and 3 in this paper Assumption 4 There exists µ > 0 such that objective function P is µ-optimally strongly convex If each f i is convex, then for Elastic Net regularization function Rx λ x + λ / x λ 0 and λ > 0, Assumption 4 holds with µ λ We further consider Assumption 4 when we deal with strongly convex objectives 3 Our Approach: Double Acceleration In this section, we provide high-level ideas of our main contribution called double acceleration First, we consider deterministic PG Algorithm and non-mini-batch SVRG Algorithm 3 PG is an extension of the steepest descent to proximal settings SVRG is a stochastic gradient method using the variance reduction technique, which utilizes the finite sum structure of

4 Algorithm PG x 0, η, S Input: x 0 R d, η > 0, S N for s to S do x s One Stage PG x s, η end for Output: S S s x s Algorithm One Stage PG x, η Input: x R d, η > 0 x + prox ηr x η F x Output: x + the problem, and it achieves a faster convergence rate than PG does As SVRG Algorithm 3 matches with PG Algorithm when the number of inner iterations is m, SVRG can be seen as a generalization of PG The key element of SVRG is employing a simple but powerful technique called the variance reduction technique for gradient estimate The variance reduction of the gradient is realized by setting g k f ik x k f ik x + F x rather than vanilla stochastic gradient f ik x k Generally, stochastic gradient f ik x k is an unbiased estimator of F x k, but it may have high variance In contrast, g k is also unbiased, and one can show that its variance is reduced ; that is, the variance converges to zero as x k and x to x Next, we explain to the method of accelerating SVRG and obtaining an even faster convergence rate based on our new but quite natural idea outer acceleration First, we would like to remind you that the procedure of deterministic APG is given as described in Algorithm 5 APG uses the famous momentum scheme and achieves the optimal iteration complexity Our natural idea is replacing One Stage PG in Algorithm 5 with One Stage SVRG With slight modifications, we can show that this algorithm improves the rates of PG, SVRG and APG, and is optimal We call this new algorithm outerly accelerated SVRG Note that the algorithm matches with APG when m and thus, can be seen as a direct generalization of APG However, this algorithm has poor mini-batch efficiency, because in size b mini-batch settings, the rate of this algorithm is essentially b times worse than that of non-mini-batch settings Stateof-the-art methods APCG, SPDC, and Katyusha also suffer from the same problem in the mini-batch setting Now, we illustrate that for improving the mini-batch efficiency, using the inner acceleration technique is beneficial Nitanda Nitanda, 04 has proposed AccProxSVRG in mini-batch settings He applied the momentum scheme to One Stage SVRG, and we call this technique inner acceleration He showed that the inner acceleration could significantly improve the mini-batch efficiency of vanilla SVRG This fact indicates that inner acceleration is essen- Algorithm 3 SVRG x 0, η, m, S Input: x 0 R d, η > 0, m, S N for s to S do x s One Stage SVRG x s, η, m end for Output: S S s x s Algorithm 4 One Stage SVRG x, η, m Input: x R d, η > 0, m N x 0 x for k to m do Pick i k,, n} randomly g k f ik x k f ik x + F x x k prox ηr x k ηg k end for Output: n n k x k tial to fully utilize the mini-batch settings However, AccProxSVRG is not a truly accelerated method, because in non-mini-batch settings, the rate of AccProxSVRG is same as that of vanilla SVRG In this way, we arrive at our main high-level idea called double acceleration, which involves applying momentum scheme to both outer and inner algorithms This enables us not only to lead to the optimal total computational cost in non-mini-batch settings, but also to improving the minibatch efficiency of vanilla acceleration methods We have considered SVRG and its accelerations so far; however, we actually adopt stochastic variance reduced dual averaging SVRDA rather than SVRG itself, because we can construct lazy update rules of innerly accelerated SVRDA for sparse data see Section 6 In Section F of supplementary material, we briefly discuss a SVRG version of our proposed method and provide its convergence analysis 4 Algorithm Description In this section, we describe the concrete procedure of the proposed algorithm in detail DASVRDA for non-strongly convex objectives We provide details of the doubly accelerated SVRDA DASVRDA method for non-strongly convex objectives in Algorithm 6 Our momentum step is slightly different from that of vanilla deterministic accelerated methods: we not only add momentum term θ s / θ s x s x s to the current solution x s but also add term θ s / θ s z s x s, where z s is the current more aggressively updated solution rather than x s ; thus, this

5 Algorithm 5 APG x 0, η, S Input: x 0 R d, η > 0, S N x x 0 θ 0 0 for s to S do θ s s+ ỹ s x s + θ s θ s x s x s x s One Stage PGỹ s, η end for Output: x S term also can be interpreted as momentum 5 Then, we feed ỹ s to One Stage Accelerated SVRDA Algorithm 7 as an initial point Note that Algorithm 6 can be seen as a direct generalization of APG, because if we set m, One Stage Accelerated SVRDA is essentially the same as one iteration PG with initial point ỹ s ; then, we can see that z s x s, and Algorithm 6 essentially matches with deterministic APG Next, we move to One Stage Accelerated SVRDA Algorithm 7 Algorithm 7 is essentially a combination of the accelerated regularized dual averaging AccSDA method Xiao, 009 with the variance reduction technique of SVRG It updates z k by using the weighted average of all past variance reduced gradients ḡ k instead of only using a single variance reduced gradient g k Note that for constructing variance reduced gradient g k, we use the full gradient of F at x s rather than the initial point ỹ s Remark In Algorithm 7, we pick b indexes according to iid non-uniform distribution Q q i } L} L i n Instead of that, we can pick b indexes, such that each index i l k is uniformly picked from B l, where B l } b l is the preined disjoint partition of [n] with size B l n/b If we adopt this scheme, when we parallelize the algorithm using b machines, each machine only needs to store the corresponding partition of the data set rather than the whole dataset, and this can reduce communication cost and memory cost In this setting, the convergence analysis in Section 5 can be easily revised by simply replacing L with L max max i [n] L i DASVRDA for strongly convex objectives Algorithm 8 is our proposed method for strongly convex objectives Instead of directly accelerating the algo- 5 This form also arises in Monotone APG Li & Lin, 05 In Algorithm 7, x x m can be rewritten as /mm + m k kz k, which is a weighted average of z k ; thus, we can say that z is updated more aggressively than x For the outerly accelerated SVRG that is a combination of Algorithm 6 with vanilla SVRG, see section 3, z and x correspond to x m and /m m k x k in Xiao & Zhang, 04, respectively Thus, we can also see that z is updated more aggressively than x rithm using a constant momentum rate, we restart Algorithm 6 Restarting scheme has several advantages both theoretically and practically First, the restarting scheme only requires the optimal strong convexity of the objective Def 3 instead of the ordinary strong convexity Def Whereas, non-restarting accelerated algorithms essentially require the ordinary strong convexity of the objective Second, for restarting algorithms, we can utilize adaptive restart schemes O Donoghue & Candes, 05 The adaptive restart schemes have been originally proposed for deterministic cases The schemes are heuristic but quite effective empirically The most fascinating property of these schemes is that we need not prespecify the strong convexity parameter µ, and the algorithms adaptively determine the restart timings O Donoghue & Candes 05 have proposed two heuristic adaptive restart schemes: the function scheme and gradient scheme We can easily apply these ideas to our method, because our method is a direct generalization of the deterministic APG For the function scheme, we restart Algorithm 6 if P x s > P x s For the gradient scheme, we restart the algorithm if ỹ s x s ỹ s+ x s > 0 Here ỹ s x s can be interpreted as a one stage gradient mapping of P at ỹ s As ỹ s+ x s is the momentum, this scheme can be interpreted such that we restart whenever the momentum and negative one Stage gradient mapping form an obtuse angle this means that the momentum direction seems to be bad We numerically demonstrate the effectiveness of these schemes in Section 7 DASVRDA ns with warm start Algorithms 9 is a combination of DASVRDA ns with warm start scheme At the warm start phase, we repeatedly run One Stage AccSVRDA and increment the number of its inner iterations m u exponentially until m u n/b After that, we run vanilla DASVRDA ns We can show that this algorithm gives a faster rate than vanilla DASVRDA ns Remark For DASVRDA sc, the warm start scheme for DASVRDA ns is not needed because the theoretical rate is identical to the one without warm start Parameter tunings For DASVRDA ns, only learning rate η needs to be tuned, because we can theoretically obtain the optimal choice of γ, and we can naturally use m n/b as a ault epoch length see Section 5 For DASVRDA sc, both learning rate η and fixed restart interval S need to be tuned

6 Algorithm 6 DASVRDA ns x 0, z 0, γ, L i } n i, m, b, S Input: x 0, z 0 R d, γ >, L i > 0} n i, m N, S N, b [n] x z 0, θ 0 γ L n n i L i Q q i } L} L i n η + γm+ b L for s to S do θ s γ s+ ỹ s x s + θ s θ s x s x s + θ s z s θ s x s x s, z s One Stage AccSVRDAỹ s, x s, η, m, b, Q end for Output: x S Algorithm 7 One Stage AccSVRDA ỹ, x, η, m, b, Q Input: ỹ, x, η > 0, m N, b [n], Q x 0 z 0 ỹ, ḡ 0 0, θ 0 for k to m do Pick independently i k,, ib k Q, I k i l k }b l k+ y k x k + z k g k b i I k f i y k f i x + F x ḡ k z k argmin z R d ḡ k + g k ḡ k, z + Rz + η z z 0 } prox ηθk R z 0 η ḡ k x k x k + z k end for Output: x m, z m 5 Convergence Analysis of DASVRDA Method In this section, we provide the convergence analysis of our algorithms First, we consider the DASVRDA ns algorithm Theorem 5 Suppose that Assumptions, and 3 hold Let x 0, z 0 R d, γ 3, m N, b [n] and S N Then DASVRDA ns x 0, z 0, γ, L i } n i, m, b, S satisfies 4 E [P x S P x ] S + P x 0 P x 8 + z 0 x γ ηs + m + m The proof of Theorem 5 is found in the supplementary material Section A We can easily see that the optimal Algorithm 8 DASVRDA sc ˇx 0, γ, L i } n i, m, b, S, T Input: ˇx 0 R d, γ, L i > 0} n i, m N, b [n], S, T N for t to T do ˇx t DASVRDA ns ˇx t, ˇx t, γ, L i } n i, m, b, S end for Output: ˇx T Algorithm 9 DASVRDA ns with warm start x 0, γ, L i } n i, m 0, m, b, U, S z 0 x 0, L n n i L i, Q q i } L L} i n for u to U do m u γm u + m u end for m U m U + m U / /γ η + γm U + b L for u to U do x u, z u One Stage AccSVRDA z u, x u, η, m u, b, Q end for Output: DASVRDA ns x U, z U, γ, L i } n i, m U, b, S choice of γ is b/m + / O + b/m see Section A of supplementary material We denote this value as γ Using Theorem 5, we can establish the convergence rate of DASVRDA ns with warm start Algorithm 9 Theorem 5 Suppose that Assumptions, and 3 hold Let x 0 R d, γ γ, m N, m 0 min + γm + /b L x0 x P x 0 P x, m N, b [n], U log γm/m 0 and S N Then DASVRDA ns with warm start x 0, γ, L i } n i, m 0, m, b, U, S satisfies E [P x S P x ] O S m + L x 0 x mb The proof of Theorem 5 is found in the supplementary material Section B From Theorem 5, we obtain the following corollary: Corollary 53 Suppose that Assumptions,, and 3 hold Let x 0 R d, γ γ, m n/b, } m 0 min + γm + /b L x0 x P x 0 P x, m N, b [n] and U log γm/m 0 If we appropriately choose S O + /m + / mb L x0 x /, then a total computational cost of DASVRDA ns with warm start x 0, γ, L i } n i, m 0, m, b, U, S for E [P x S P x ] is P x0 P x O d nlog + b + n L x0 x }

7 For the proof of Corollary 53, see Section C of supplementary material Remark Corollary 53 implies that if the mini-batch size b is O n, DASVRDA ns with warm start x 0, γ, L i } n i, m 0, n/b, b, U, S still achieves the total computational cost of Odnlog/ + n L/, which is better than Odnlog/ + nb L/ of Katyusha Remark Corollary 53 also implies that DASVRDA ns with warm start only needs size O n mini-batches for achieving the optimal iteration complexity of O L/, when L/ nlog / In contrast, Katyusha needs size On mini-batches for achieving the optimal iteration complexity Note that even when L/ nlog /, our method only needs size Õn /L mini-batches 6, that is typically smaller than On of Katyusha Next, we analyze the DASVRDA sc algorithm for optimally strongly convex objectives Combining Theorem 5 with the optimal strong convexity of the objective function immediately yields the following theorem, which implies that the DASVRDA sc algorithm achieves a linear convergence Theorem 54 Suppose that Assumptions,, 3 and 4 hold Let ˇx 0 R d, γ γ, m N, b [n] and T N Define ρ 4 /γ +4/ηm+mµ}/ /γ S+ } If S is sufficiently large such that ρ 0,, then DASVRDA sc ˇx 0, γ, L i } n i, m, b, S, T satisfies E[P ˇx T P x ] ρ T [P ˇx 0 P x ] From Theorem 54, we have the following corollary Corollary 55 Suppose that Assumptions,, 3 and 4 hold Let ˇx 0 R d, γ γ, m n/b, b [n] There exists S O + b/n + / n L/µ, such that /log/ρ O Moreover, if we appropriately choose T OlogP ˇx 0 P x /, then a total computational cost of DASVRDA sc ˇx 0, γ, L i } n i, m, b, S, T for E [P ˇx T P x ] is O d n + b + n L P ˇx0 P x log µ Remark Corollary 55 implies that if the mini-batch size b is O n, DASVRDA sc ˇx 0, γ, L i } n i, n/b, b, S, T still achieves the total computational cost of Odn + n L/µlog/, which is much better than Odn + nb L/µlog/ of APCG, SPDC, and Katyusha Remark Corollary 55 also implies that DASVRDA sc only needs size O n mini-batches for achieving the optimal iteration complexity O L/µlog/, when L/µ n 6 Note that we regard one computation of a full gradient as n/b iterations in size b mini-batch settings In contrast, APCG, SPDC and Katyusha need size On mini-batches and AccProxSVRG does O L/µ ones for achieving the optimal iteration complexity Note that even when L/µ n, our method only needs size On µ/l mini-batches 7 This size is smaller than On of APCG, SPDC, and Katyusha, and the same as that of AccProx- SVRG 6 Efficient Implementation for Sparse Data: Lazy Update In this section, we briefly comment on the sparse implementations of our algorithms Originally, lazy update was proposed in online settings Duchi et al, 0 Generally, it is difficult for accelerated stochastic variance reduction methods to construct lazy update rules because i generally, variance reduced gradients are not sparse even if stochastic gradients are sparse; ii if we adopt the momentum scheme, the updated solution becomes a convex combination of previous solutions; and iii for non-strongly convex objectives, the momentum rate must not be constant Konečnỳ et al 06 have tackled the problem of i on non-accelerated settings and derived lazy update rules of the mini-batch semi-stochastic gradient descent msgd method Allen-Zhu 06 has only mentioned that lazy updates can be applied to Katyusha but did not give explicit lazy update rules of Katyusha Particularly, for non-strongly convex objectives, it seems to be difficult to derive lazy update rules owing to the difficulty of iii The reason we adopt the stochastic dual averaging scheme Xiao, 009 rather than stochastic gradient descent for our method is to be able to overcome the difficulties faced in i, ii, and iii The lazy update rules of our method support both non-strongly and strongly convex objectives The formal lazy update algorithms of our method can be found in the supplementary material Section D 7 Numerical Experiments In this section, we provide numerical experiments to demonstrate the performance of DASVRDA We numerically compare our method with several wellknown stochastic gradient methods in mini-batch settings: SVRG Xiao & Zhang, 04 and SVRG ++ Allen-Zhu & Yuan, 06, AccProxSVRG Nitanda, 04, Universal Catalyst Lin et al, 05, APCG Lin et al, 04a, and Katyusha Allen-Zhu, 06 The details of the implemented algorithms and their parameter tunings are found in the supplementary material In the experiments, we focus on the regularized logistic regression problem for binary 7 Note that the required size is On µ/l On, which is not On L/µ On

8 a a9a, λ, λ 0 4, 0 b a9a, λ, λ 0 4, 0 6 c a9a, λ, λ 0, 0 6 d rcv, λ, λ 0 4, 0 e rcv, λ, λ 0 4, 0 6 f rcv, λ, λ 0, 0 6 g sido0, λ, λ 0 4, 0 h sido0, λ, λ 0 4, 0 6 i sido0, λ, λ 0, 0 6 Figure Comparisons on a9a top, rcv middle and sido0 bottom data sets, for regularization parameters λ, λ 0 4, 0 left, λ, λ 0 4, 0 6 middle and λ, λ 0, 0 6 right classification, with regularizer λ + λ / We used three publicly available data sets in the experiments Their sizes n and dimensions d, and common minbatch sizes b for all implemented algorithms are listed in Table Data sets n d b a9a 3, rcv 0, 4 47, sido0, 678 4, Table Summary of the data sets and mini-batch size used in our numerical experiments For regularization parameters, we used three settings λ, λ 0 4, 0, 0 4, 0 6, and 0, 0 6 For the former case, the objective is non-strongly convex, and for the latter two cases, the objectives are strongly convex Note that for the latter two cases, the strong convexity of the objectives is µ 0 6 and is relatively small; thus, it makes acceleration methods beneficial Figure shows the comparisons of our method with the different methods described above on several settings Objective Gap means P x P x for the output solution x Gradient Evaluations /n is the number of computations of stochastic gradients f i divided by n Restart DASVRDA means DASVRDA with heuristic adaptive restarting We can observe the following from these results: Our proposed DASVRDA and Restart DASVRDA significantly outperformed all the other methods overall DASVRDA with the heuristic adaptive restart scheme efficiently made use of the local strong convexities of non-strongly convex objectives and significantly outperformed vanilla DASVRDA on a9a and rcv data sets For the other settings, the algorithm was still

9 comparable to vanilla DASVRDA UC+SVRG did not work as well as it did in theory, and its performances were almost the same as that of vanilla SVRG UC+AccProxSVRG sometimes outperformed vanilla AccProxSVRG but was always outperformed by our methods APCG sometimes performed unstably and was outperformed by vanilla SVRG On sido0 data set, for Elastic Net Setting, APCG significantly outperformed all other methods Katyusha outperformed vanilla SVRG overall However, sometimes Katyusha was slower than vanilla SVRG for Elastic Net Settings This is probably because SVRG is almost adaptive to local strong convexities of loss functions, whereas Katyusha is not see the remark in supplementary material 8 Conclusion In this paper, we developed a new accelerated stochastic variance reduced gradient method for regularized empirical risk minimization problems in mini-batch settings: DASVRDA We have shown that DASVRDA achieves the total computational costs of Odnlog/ + nl/ + b L/ and Odn + nl/µ + b L/µlog/ in size b mini-batch settings for non-strongly and strongly convex objectives, respectively In addition, DASVRDA essentially achieves the optimal iteration complexities only with size O n mini-batches for both settings In the numerical experiments, our method significantly outperformed state-of-the-art methods, including Katyusha and AccProxSVRG Acknowledgment This work was partially supported by MEXT kakenhi , 500, , 5H0678 and 5H05707, JST-PRESTO and JST-CREST References Agarwal, Alekh, Negahban, Sahand, and Wainwright, Martin J Fast global convergence rates of gradient methods for high-dimensional statistical recovery Advances in Neural Information Processing Systems, pp 37 45, 00 Agarwal, Alekh, Negahban, Sahand, and Wainwright, Martin J Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions Advances in Neural Information Processing Systems, pp , 0 Allen-Zhu, Zeyuan Katyusha: The first direct acceleration of stochastic gradient methods ArXiv e-prints, abs/ , 06 Allen-Zhu, Zeyuan and Yuan, Yang Univr: A universal variance reduction framework for proximal stochastic gradient method arxiv preprint arxiv:506097, 05 Allen-Zhu, Zeyuan and Yuan, Yang Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives In Proceedings of the 33rd International Conference on Machine Learning, ICML 6, 06 Full version available at Beck, Amir and Teboulle, Marc A fast iterative shrinkagethresholding algorithm for linear inverse problems SIAM journal on imaging sciences, :83 0, 009 Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon Saga: A fast incremental gradient method with support for non-strongly convex composite objectives Advances in Neural Information Processing Systems, pp , 04 Duchi, John, Hazan, Elad, and Singer, Yoram Adaptive subgradient methods for online learning and stochastic optimization Journal of Machine Learning Research, Jul: 59, 0 Hazan, Elad, Agarwal, Amit, and Kale, Satyen Logarithmic regret algorithms for online convex optimization Machine Learning, 69-3:69 9, 007 Johnson, Rie and Zhang, Tong Accelerating stochastic gradient descent using predictive variance reduction Advances in Neural Information Processing Systems, pp 35 33, 03 Konečnỳ, Jakub, Liu, Jie, Richtárik, Peter, and Takáč, Martin Mini-batch semi-stochastic gradient descent in the proximal setting IEEE Journal of Selected Topics in Signal Processing, 0:4 55, 06 Li, Huan and Lin, Zhouchen Accelerated proximal gradient methods for nonconvex programming Advances in Neural Information Processing Systems, pp , 05 Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid A universal catalyst for first-order optimization Advances in Neural Information Processing Systems, pp , 05

10 Lin, Qihang, Lu, Zhaosong, and Xiao, Lin An accelerated proximal coordinate gradient method Advances in Neural Information Processing Systems, pp , 04a Lin, Qihang, Xiao, Lin, et al An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization ICML, pp 73 8, 04b Xiao, Lin and Zhang, Tong A proximal stochastic gradient method with progressive variance reduction SIAM Journal on Optimization, 44: , 04 Zhang, Yuchen and Xiao, Lin Stochastic primal-dual coordinate method for regularized empirical risk minimization Proceedings of the 3nd International Conference on Machine Learning, 95:05, 05 Nesterov, Yurii Introductory lectures on convex optimization: A basic course Springer Science & Business Media, 87, 03 Nesterov, Yurii et al Gradient methods for minimizing composite objective function Technical report, UCL, 007 Nitanda, Atsushi Stochastic proximal gradient descent with acceleration techniques Advances in Neural Information Processing Systems, pp , 04 Nitanda, Atsushi Accelerated stochastic gradient descent for minimizing finite sums arxiv preprint arxiv: , 05 O Donoghue, Brendan and Candes, Emmanuel Adaptive restart for accelerated gradient schemes Foundations of computational mathematics, 53:75 73, 05 Roux, Nicolas L, Schmidt, Mark, and Bach, Francis R A stochastic gradient method with an exponential convergence rate for finite training sets Advances in Neural Information Processing Systems, pp , 0 Schmidt, Mark, Roux, Nicolas Le, and Bach, Francis Minimizing finite sums with the stochastic average gradient arxiv preprint arxiv:309388, 03 Shalev-Shwartz, Shai and Singer, Yoram Logarithmic regret algorithms for strongly convex repeated games The Hebrew University, 007 Shalev-Shwartz, Shai and Zhang, Tong Stochastic dual coordinate ascent methods for regularized loss The Journal of Machine Learning Research, 4: , 03 Singer, Yoram and Duchi, John C Efficient learning using forward-backward splitting Advances in Neural Information Processing Systems, pp , 009 Tseng, Paul On accelerated proximal gradient methods for convex-concave optimization submitted to siam j J Optim, 008 Xiao, Lin Dual averaging method for regularized stochastic learning and online optimization Advances in Neural Information Processing Systems, pp 6 4, 009

11 Supplementary material: Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization In this supplementary material, we give the proofs of Theorem 5 and the optimality of γ Section A, Theorem 5 Section B and Corollary 53 Section C, the lazy update algorithm of our method Section D and the experimental details Section E Finally, we briefly discuss DASVRG method, which is a variant of DASVRDA method Section F A Proof of Theorem 5 In this section, we give the comprehensive proof of Theorem 5 algorithm Lemma A The sequence } k ined in Algorithm 7 satisfies First we analyze One Stage Accelerated SVRDA for k, where θ 0 Proof Since k+ for k 0, we have that k + k Lemma A The sequence } k ined in Algorithm 7 satisfies m θ m θ m k Proof Observe that θ m θ m mm + 4 m k m k k Lemma A3 For every x, y R d, F y + F y, x y + Rx P x L n Proof Since f i is convex and L i -smooth, we have see Nesterov, 03 n i f i x f i y f i y + f i y, x y f i x L i f i x f i y By the inition of q i }, summing this inequality from i to n and dividing it by n results in F y + F y, x y F x L n Adding Rx to the both sides of this inequality gives the desired result n i f i x f i y

12 Lemma A4 DASVRDA for Regularized ERM ḡ k k g k k k Proof For k, ḡ g k g k by the inition of θ 0 Assume that the claim holds for some k Then ḡ k+ + k + ḡ k + 4 k + k + g k+ k k k+ 4 k + k + g k k + k+ k g k g k + k + g k+ The first equality follows from the inition of ḡ k+ Second equality is due to the assumption of the induction This finishes the proof for Lemma A4 Next we prove the following main lemma for One Stage Accelerated SVRDA The proof is inspired by the one of AccSDA given in Xiao, 009 Lemma A5 Let η < / L For One Stage Accelerated SVRDA, we have that E[P x m P x] ηm + m z 0 x ηm + m E z m x m + k + ke g k F y k m + m k 4 k L n for any x R d, where the expectations are taken with respect to I k k m Proof We ine n E f i y k f i x, i l k x F y k + F y k, x y k + Rx, ˆl k x F y k + g k, x y k + Rx Observe that l k, ˆl k is convex and l k P by the convexity of F and R Moreover, for k we have that k ˆl k z k k F y k + k k g k, z y k + k ḡ k, z + Rz + k Rz k k F y k k k g k, y k The second equality follows from Lemma A4 and Lemma A Thus we see that z k k argmin k ˆl k z + η z z 0 } Observe that F is convex and L-smooth Thus we have that z R d F x k F y k + F y k, x k y k + L x k y k 4 k

13 Hence we see that P x k l k x k + L x k y k l k x k + z k + L θk l k x k + l k z k + θk P x k + θk x k + θk z k y k L θ k z k z k ˆlk z k + L z k z k g k F y k, z k y k θk P x k + ˆlk z k + η z k z k z k z k g k F y k, z k z k g k F y k, z k y k The first inequality follows from 4 The first equality is due to the inition of x k The second inequality is due to the convexity of l k and the inition of y k The third inequality holds because l k P and θk Since η > L, we have that z k z k g k F y k, z k z k g k F y k g k F y k The first inequality is due to Young s inequality The second inequality holds because Using this inequality, we get that P x k θk P x k + ˆlk z k + η z k z k + g k F y k g k F y k, z k y k Multiplying the both sides of the above inequality by yields P x k P x k + ˆlk z k + η z k z k + g k F y k g k F y k, z k y k 5 By the fact that k k ˆl k z+ η z z 0 is η -strongly convex and z k is the minimizer of k k ˆl k z+ η z z 0 for k, we have that k k ˆl k z k + η z k z 0 + k η z k z k k ˆl k z k + η z k z 0

14 for k and, for k, we ine 0 k 0 Using this inequality, we obtain DASVRDA for Regularized ERM P x k k k P x k ˆl k z k η z k z 0 k k g k F y k, z k y k ˆl k z k η z k z 0 + g k F y k Here, the inequality follows from Lemma A we ined θ 0 Summing the above inequality from k to m results in θ m θ m P x m m k ˆlk z m η z m z 0 m g k F y k m k g k F y k, z k y k k Using η -strongly convexity of the function m k ˆl k z + η z z 0 and the optimality of z m, we have that m k ˆlk z m + η z m z 0 m k ˆlk x + η z 0 x η z m x From this inequality, we see that θ m θ m P x m m ˆlk x + η z 0 x η z m x k m g k F y k m + k g k F y k, z k y k k m l k x + η z 0 x η z m x k m g k F y k m + k g k F y k, z k x k By Lemma A3 with x x and y y k, we have that l k x P x L n n i f i x f i y k

15 Applying this inequality to the above inequality yields θ m θ m P x m DASVRDA for Regularized ERM m P x k η z 0 x η z m x m + g k F y k k L n m g k F y k, z k x k n f i x f i y k Using Lemma A and dividing the both sides of the above inequality by θ m θ m result in i P x m P x z 0 x z m x ηθ m θ m ηθ m θ m m + g k F y k θ m θ m k L n m g k F y k, z k x θ m θ m k n f i x f i y k Taking the expectations with respect to I k k m on the both sides of this inequality yields E[P x m P x] z 0 x E z m x ηθ m θ m ηθ m θ m m + E g k F y k θ m θ m k L n i n E f i x f i y k Here we used the fact that E[g k F y k ] 0 for k,, m This finishes the proof of Lemma A5 Now we need the following lemma Lemma A6 For every x R d, n n i Proof From the argument of the proof of Lemma A3, we have n n i i f i x f i x LP x P x f i x f i x LF x F x, x x F x By the optimality of x, there exists ξ Rx such that F x + ξ 0 Then we have F x, x x ξ, x x Rx Rx, and hence n n i f i x f i x LP x P x

16 Proposition A7 Let γ > and η / + γm + /b L For One Pass Accelerated SVRDA, it follows that and E[P x m P x] E[P x m P x ] γ P x P x + where the expectations are taken with respect to I k k m ηm + m ỹ x ηm + m E z m x, ηm + m ỹ x ηm + m E z m x, Proof We bound the variance of the averaged stochastic gradient E g k F y k : E g k F y k E [ E Ik g k F y k [k ] ] b E [ E i Q f i y k f i x/ + F x F y k [k ] ] b E [ E i Q f i y k f i x/ [k ] ] [ ] b E n f i y k f i x n b E [ n i n i [ + b E n n i [ b E n n i f i y k f i x ] f i x f i x f i y k f i x ] ] L b P x P x 7 The second equality follows from the independency of the random variables i,, i b } and the unbiasedness of f i y k f i x/ + F x The first inequality is due to the fact that E X E[X] E X The second inequality follows from Young s inequality The final inequality is due to Lemma A6 Since η + γm+ L and γ >, using 6 yields By Lemma A5 with x x we have b [ k + k 4 E g k F y k k L E n E[P x m P x] Similarly, combining Lemma A5 with x x with 7 results in These are the desired results E[P x m P x ] γ P x P x + n i ] f i y k f i x 0 ηm + m ỹ x ηm + m E z m x ηm + m ỹ x ηm + m E z m x

17 Lemma A8 The sequence θ s } s ined in Algorithm 6 satisfies θ s θ s + γ θ s for any s Proof Since θ s s+ γ for s 0, we have This finishes the proof of Lemma A8 θ s θ s + γ s + γ γ θ s Now we are ready to proof Theorem 5 Proof of Theorem 5 By Proposition A7, we have and E[P x s P x s ] E[P x s P x ] γ E[P x s P x ] + ss + 4 s + γ + γ ηm + m E ỹ s x s ηm + m E z s x s, ηm + m E ỹ s x ηm + m E z s x, where the expectations are taken with respect to the history of all random variables Hence we have and E[P x s P x s ] E[P x s P x ] γ E[P x s P x ] + 4 ηm + m E z s ỹ s, x s ỹ s ηm + m E z s ỹ s, 8 4 ηm + m E z s ỹ s, x ỹ s ηm + m E z s ỹ s 9 Since γ 3, we have θ s for s Multiplying 8 by θ s θ s 0 and adding θ s 9 yield θ se[p x s P x ] θ s θ s + E[P x s P x ] γ 4 ηm + m E θ s z s ỹ s, θ s x s θ s ỹ s + x ηm + m E θ s z s ỹ s By Lemma A8, we have θ s θ s + γ θ s for s

18 Thus we get θ se[p x s P x ] θ s E[P x s P x ] 4 ηm + m E θ s z s ỹ s, θ s x s θ s ỹ s + x ηm + m E θ s z s ỹ s E θ s x s ηm + m θ s ỹ s + x E θ s x s θ s z s + x Since ỹ s x s + θ s θ s x s x s + θ s z s x s, we have θ s θ s x s θ s ỹ s + x θ s x s θ s z s + x Therefore summing the above inequality from s to S, we obtain θ se[p x S P x ] θ 0P x 0 P x + γ P x 0 P x + ηm + m θ 0 x θ 0 z 0 + x ηm + m z 0 x Dividing both sides by θ s finishes the proof of Theorem 5 Optimal choice of γ We can choose the optimal value of γ based on the following lemma Lemma A9 Define gγ + γm+ b γ for γ > Then, γ argmin gγ γ> b m + Proof First observe that Hence we have g γ m+ b γ + γm+ b γ γ γ g γ 0 m + b γ m + γ γ b b + + γ 3γ m + 0 γ b m + γm + b γm + 0 b γ γ γ 0 Here the second and last equivalencies hold from γ > Moreover observe that g γ > 0 for γ > γ and g γ < 0 for < γ < γ This means that γ is the minimizer of g on the region γ >

19 B Proof of Theorem 5 In this section, we give a proof of Theorem 5 Proof of Theorem 5 Since η / + γm U + /b L / + γm u + /b L, from Proposition A7, we have E[P x u P x ] + γ P x u P x + P x u P x + γ Since m u γm u + m u, we have Using this inequality, we obtain that γ ηm u + m u E z u x ηm u + m u z u x γ ηm u + m u z u x γ ηm u + m u ηm u + m u E[P x U P x ] + P x U P x + γ U P x 0 P x + γ U O P x 0 P x + γ U P x 0 P x ηm U + m U E z U x ηm U + m U E z U x z 0 x ηm 0 + m 0 x 0 x ηm 0 + m 0 The last equality is due to the initions of m 0 and η, and the fact m U Om U O γ U m 0 Om see the arguments in the proof of Corollary 53 Since we get m U + m U m U + m U, γ E[P x U P x ] + E z U x γ ηm U + m U O γ U P x 0 P x Using the initions of U and m 0 and combining this inequality with Theorem 5, we obtain that desired result C Proof of Corollary 53 In this section, we give a proof of Corollary 53 Proof of Corollary 53 Observe that the total computational cost at the warm start phase becomes U O dnu + db m u u

20 Since m u γm u + γ + γm u + γ γ m u + γ + γ γ u m 0 + u u γ u O γ u m 0, we have U O dnu + db m u O dnu + db γ U m 0 u Suppose that m m 0 P x0 P x / Then, this condition implies U log γm/m 0 log γ P x 0 P x / Hence we only need to run u Olog γ P x 0 P x / U iterations at the warm start phase and running DASVRDA ns is not needed Then the total computational cost becomes O d nlog P x 0 P x P x0 P x + bm 0 O d nlog P x 0 P x, here we used mb On Next, suppose that m m 0 P x0 P x / In this case, the total computational cost at the warm start phase with full U iterations becomes O d nlog m + mb O d nlog P x 0 P x m 0 Finally, using Theorem 5 yields the desired total computational cost D Lazy Update Algorithm of DASVRDA Method In this section, we discuss how to efficiently compute the updates of the DASVRDA algorithm for sparse data Specifically, we derive lazy update rules of One Stage Accelerated SVRDA for the following empirical risk minimization problem: n n i ψ i a i x + λ x + λ x, λ, λ 0 For the sake of simplicity, we ine the one dimensional soft-thresholding operator as follows: softz, λ sign z max z λ, 0}, for z R Moreover, in this section, we denote [z, z ] as z Z z z z } for integers z, z Z The explicit algorithm of the lazy updates for One Stage Accelerated SVRDA is given by Algorithm 0 Let us analyze the iteration cost of the algorithm Suppose that each feature vector a i is sparse and the expected number of the nonzero elements is Od First note that A k Obd expectedly if d d For updating x k, by Proposition D, we need to compute k K ± / + η λ and j k K ± θk j / + η λ for each j A k For this, we first make lists S k } m k k k / + η λ } m k and S k }m k k k θk / + η λ } m k before running the algorithm This needs only Om Note that these lists are not depend on coordinate j Since K ± j are sets of continuous integers in [k j +, k] or unions of two sets of continuous integers in [k j +, k], we can efficiently compute the above sums For example, if K + j [k j +, s ] [s +, k] for some integers s ± [k j +, k], we can compute k K + / + η λ as S s S kj+ + S k S s+ j and this costs only O Thus, for computing x k and y k, we need only Obd computational cost For computing g k, we need to compute the inner product a i y k for each i I k and this costs Obd expectedly The expected cost of the rest of the updates is apparently Obd Hence, the total expected iteration cost of our algorithm in serial settings becomes Obd rather than Obd Furthermore, we can extend our algorithm to parallel computing settings Indeed, if we have b processors, processor b runs on the set A b k j [d] a ib,j 0} Then the total iteration cost per processor becomes ideally Od Generally the overlap among the sets A b k may cause latency, however for sufficiently sparse data, this latency is negligible The following proposition guarantees that Algorithm 0 is equivalent to Algorithm 7 when Rx λ x + λ / x

21 Algorithm 0 Lazy Updates for One Stage AccSVRDA ỹ, x, η, m, b, Q Input: ỹ, x, η > 0, m N, b [n], Q x 0 z 0 ỹ g0,j sum 0 j [d] θ 0 k j 0 j [d] F x for k to m do Sample independently i,, i b Q I k i,, i b } A k j [d] b [b] : a ib,j 0} k+ for j A k do Update x k,j, y k,j as in Proposition D end for for j A k do g k,j b i I t ψ i a i y ka i,j ψ i a i xa i,j + j g sum k,j gk sum + θ j,j k g k,j + j j j +η λ softz 0,j ηgk,j sum, η λ z k,j x k,j k j k end for end for Output: x m, z m x k,j + z k,j Proposition D Suppose that Rx λ x + λ x with λ, λ 0 Let j [d], k j [m] 0} and k k j + Assume that j f i y k j f i x 0 for any i [b] and k [k j +, k ] In Algorithm 7, the following results hold: x 0,j k j j θ x k,j x kj,j + k k K + j +η λ z 0,j M + k,j k, θ + k k K j +η λ z 0,j M k,j x 0,j k y k,j x k,j + +η λ k, soft z 0,j ηgk sum ηθ j,j k j j j, η λ and where z k,j M ± k,j + η λ softz 0,j ηg sum k,j, η λ, η j ± λ + ηg sum k j,j ηj j j, and K ± j [k j +, k] are ined as follows: Let c η j 4, c ηλ 4 and c 3 ηgk sum ηθ j,j k j θ kj j to simplify the notation Note that c 0 Moreover, we ine D ± c ± c + 4c ± c z 0,j c 3, s + ± s ± c + c ± D +, c + c c c ± D c c,

22 where if s ± ± are not well ined, we simply assign 0 or any number to s ± ± If c > c, then K + j K j D + 0 [k j +, k] [ s +, s + + ] D + > 0 [k j +, k] D 0 [k j +, s ] [ s +, k] D > 0, If c c, then K + j K j c 0 z 0,j c 3 [k j +, k] c 0 z 0,j > c 3 c > 0 D + 0 [k j +, k] [ s +, s + + ] c > 0 D + > 0 [k j +, k] z 0,j < c 3 z 0,j c 3, 3 If c < c, then K + j K j D + 0 [k j +, k] [ s +, s + + ] D + > 0 D 0 [k j +, k] [ s, s + ] D > 0, 4 If c c, then K + j K j z 0,j c 3 [k j +, k] z 0,j > c 3 [k j +, k] c 0 z 0,j < c 3 c 0 z 0,j c 3 c > 0 D 0 [k j +, k] [ s, s + ] c > 0 D > 0 5 If c < c, then K + j K j [k j +, k] D + 0 [k j +, s + ] [ s + +, k] D + > 0 D 0 [k j +, k] [ s, s + ] D > 0, Proof First we consider the case k Observe that y,j θ x 0,j + θ z 0,j z 0,j x 0,j,

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods Journal of Machine Learning Research 8 8-5 Submitted 8/6; Revised 5/7; Published 6/8 : The First Direct Acceleration of Stochastic Gradient Methods Zeyuan Allen-Zhu Microsoft Research AI Redmond, WA 985,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

arxiv: v5 [math.oc] 2 May 2017

arxiv: v5 [math.oc] 2 May 2017 : The First Direct Acceleration of Stochastic Gradient Methods version 5 arxiv:3.5953v5 [math.oc] May 7 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March 8,

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

arxiv: v1 [stat.ml] 20 May 2017

arxiv: v1 [stat.ml] 20 May 2017 Stochastic Recursive Gradient Algorithm for Nonconvex Optimization arxiv:1705.0761v1 [stat.ml] 0 May 017 Lam M. Nguyen Industrial and Systems Engineering Lehigh University, USA lamnguyen.mltd@gmail.com

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

arxiv: v6 [math.oc] 24 Sep 2018

arxiv: v6 [math.oc] 24 Sep 2018 : The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

arxiv: v2 [stat.ml] 3 Jun 2017

arxiv: v2 [stat.ml] 3 Jun 2017 : A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient Lam M. Nguyen Jie Liu Katya Scheinberg Martin Takáč arxiv:703.000v [stat.ml] 3 Jun 07 Abstract In this paper, we propose

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

A Proximal Stochastic Gradient Method with Progressive Variance Reduction A Proximal Stochastic Gradient Method with Progressive Variance Reduction arxiv:403.4699v [math.oc] 9 Mar 204 Lin Xiao Tong Zhang March 8, 204 Abstract We consider the problem of minimizing the sum of

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

arxiv: v1 [math.oc] 5 Feb 2018

arxiv: v1 [math.oc] 5 Feb 2018 for Convex-Concave Saddle Point Problems without Strong Convexity Simon S. Du * 1 Wei Hu * arxiv:180.01504v1 math.oc] 5 Feb 018 Abstract We consider the convex-concave saddle point problem x max y fx +

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

SEGA: Variance Reduction via Gradient Sketching

SEGA: Variance Reduction via Gradient Sketching SEGA: Variance Reduction via Gradient Sketching Filip Hanzely 1 Konstantin Mishchenko 1 Peter Richtárik 1,2,3 1 King Abdullah University of Science and Technology, 2 University of Edinburgh, 3 Moscow Institute

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Gradient Projection Iterative Sketch for Large-Scale Constrained Least-Squares

Gradient Projection Iterative Sketch for Large-Scale Constrained Least-Squares Gradient Projection Iterative Sketch for Large-Scale Constrained Least-Squares Junqi Tang 1 Mohammad Golbabaee 1 Mike E. Davies 1 Abstract We propose a randomized first order optimization algorithm Gradient

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information