Adaptive restart of accelerated gradient methods under local quadratic growth condition

Size: px
Start display at page:

Download "Adaptive restart of accelerated gradient methods under local quadratic growth condition"

Transcription

1 Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv: v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal gradient methods under a local quadratic growth condition, we show that restarting these algorithms at any frequency gives a globally linearly convergent algorithm. This result was previously known only for long enough frequencies. Then, as the rate of convergence depends on the match between the frequency and the quadratic error bound, we design a scheme to automatically adapt the frequency of restart from the observed decrease of the norm of the gradient mapping. Our algorithm has a better theoretical bound than previously proposed methods for the adaptation to the quadratic error bound of the objective. We illustrate the efficiency of the algorithm on a Lasso problem and on a regularized logistic regression problem. Introduction. Motivation The proximal gradient method aims at minimizing composite convex functions of the form F x = fx + ψx, x R n where f is differentiable with Lipschitz gradient and ψ may be nonsmooth but has an easily computable proximal operator. For a mild additional computational cost, accelerated gradient methods transform the proximal gradient method, for which the optimality gap F x k F decreases as O/k, into an algorithm with optimal O/k complexity [Nes83]. Accelerated variants include the dual accelerated proximal gradient [Nes05, Nes3], the accelerated proximal gradient method APG [Tse08] and [BT09]. Gradienttype methods, also called first-order methods, are often used to solve large-scale problems because of their good scalability and easiness of implementation that facilitates parallel and distributed computations. When solving a convex problem whose objective function satisfies a local quadratic error bound this is a generalization of strong convexity, classical non-accelerated gradient and coordinate descent methods automatically have a linear rate of convergence, i.e. F x k F O µ k for a problem dependent 0 < µ < [NNG, DL6], whereas one needs to know explicitly the strong convexity parameter in order to set accelerated gradient and accelerated coordinate descent methods to have a linear rate of convergence, see for instance [LS3, LMH5, LLX4, Nes, Nes3]. Setting the algorithm with an incorrect parameter may result in a slower algorithm, somes even slower than if we had not tried to set an acceleration scheme [OC]. This is a major drawback of the method because in general, the strong convexity parameter is difficult to estimate. In the context of accelerated gradient method with unknown strong convexity parameter, Nesterov [Nes3] proposed a restarting scheme which adaptively approximates the strong convexity parameter. The same idea Télécom ParisTech, Université Paris-Saclay, Paris, France, olivier.fercoq@telecom-paristech.fr Department of Mathematics, The University of Hong Kong, Hong Kong, China, zhengqu@hku.hk

2 was exploited by Lin and Xiao [LX5] for sparse optimization. Nesterov [Nes3] also showed that, instead of deriving a new method designed to work better for strongly convex functions, one can restart the accelerated gradient method and get a linear convergence rate. However, the restarting frequency he proposed still depends explicitly on the strong convexity of the function and so O Donoghue and Candes [OC] introduced some heuristics to adaptively restart the algorithm and obtain good results in practice.. Contributions In this paper, we show that, if the objective function is convex and satisfies a local quadratic error bound, we can restart accelerated gradient methods at any frequency and get a linearly convergent algorithm. The rate depends on an estimate of the quadratic error bound and we show that for a wide range of this parameter, one obtains a faster rate than without acceleration. In particular, we do not require this estimate to be smaller than the actual value. In that way, our result supports and explains the practical success of arbitrary periodic restart for accelerated gradient methods. Then, as the rate of convergence depends on the match between the frequency and the quadratic error bound, we design a scheme to automatically adapt the frequency of restart from the observed decrease of the norm of the gradient mapping. The approach follows the lines of [Nes3, LX5, LY3]. We proved that, if our current estimate of the local error bound were correct, the norm of the gradient mapping would decrease at a prescribed rate. We just need to check this decrease and when the test fails, we have a certificate that the estimate was too large. Our algorithm has a better theoretical bound than previously proposed methods for the adaptation to the quadratic error bound of the objective. In particular, we can make use of the fact that we know that the norm of the gradient mapping will decrease even when we had a wrong estimate of the local error bound. In Section we recall the main convergence results for accelerated gradient methods and show that a fixed restart leads to a linear convergence rate. In Section 3, we present our adaptive restarting rule. Finally, we present numerical experiments on the lasso and logistic regression problem in Section 4. Accelerated gradient schemes. Problem and assumptions We consider the following optimization problem: min x R n F x := fx + ψx, where f : R n R is a differentiable convex function and ψ : R n R {+ } is a proper, closed and convex function. We denote by F the optimal value of and assume that the optimal solution set X is nonempty. Throughout the paper denotes the Euclidean norm. For any positive vector v R n +, we denote by v the weighted Euclidean norm: x v def = n v i x i, i= and dist v x, X the distance of x to the closed convex set X with respect to the norm v. In addition we assume that ψ is simple, in the sense that the proximal operator defined as { } prox v,ψ x := arg min y x y v + ψy, is easy to compute, for any positive vector v R n +. quadratic error bound assumption. We also make the following smoothness and local

3 Assumption. There is a positive vector L R n + such that fx fy + fy, x y + x y L, x, y R n. 3 Assumption. For any x 0 domf, there is µ F L, x 0 > 0 such that F x F + µ F L, x 0 dist Lx, X, x [F F x 0 ], 4 where [F F x 0 ] denotes the set of all x such that F x F x 0. Assumption is also referred to as local quadratic growth condition. It is known that Assumption and guarantee the linear convergence of proximal gradient method [NNG, DL6] with complexity bound Olog/ɛ/µ F L, x 0.. Accelerated gradient schemes We first recall in Algorithm and two classical accelerated proximal gradient schemes. For identification purpose we refer to them respectively as Fast Iterative Soft Thresholding Algorithm [BT09] and APG Accelerated Proximal Gradient [Tse08]. As pointed out in [Tse08], the accelerated schemes were first proposed by Nesterov [Nes04]. Algorithm ˆx x 0, K [BT09] : Set θ 0 = and z 0 = x 0. : for k = 0,,, K do 3: y k = θ k x k + θ k { z k 4: x k+ = arg min x R n fyk, x y k + x y k L + ψx} 5: z k+ = z k + θ k x k+ y k 6: θ k+ = 7: end for 8: ˆx x k+ θ 4 k +4θ k θ k Algorithm ˆx APGx 0, K [Tse08] : Set θ 0 = and z 0 = x 0. : for k = 0,,, K do 3: y k = θ k x k + θ k { z k 4: z k+ = arg min z R n fyk, z y k + θ k z z k L + ψz} 5: x k+ = y k + θ k z k+ z k 6: θ k+ = 7: end for 8: ˆx x k+ θ 4 k +4θ k θ k Remark. We have written the algorithms in a unified framework to emphasize their similarities. Practical implementations usually consider only two variables: x k, y k for and y k, z k for APG. Remark. The vector L used in the proximal operation step line 4 of Algorithm and should satisfy Assumption. If such L is not known a priori, we can incorporate a line search procedure, see for example [Nes07]. The most simple restarted accelerated gradient method has a fixed restarting frequency. This is Algorithm 3, which restarts periodically Algorithm or Algorithm. Here, the restarting period is fixed to be some integer K. In Section 3 we will also consider adaptive restarting frequency with a varying restarting period. 3

4 Algorithm 3 FixedRESx 0, K : for t = 0,,, do : x t+k APGx tk, K or x t+k x tk, K 3: end for.3 Convergence results for accelerated gradients methods.3. Basic results In this section we gather a list of known results, shared by and APG, which will be used later to build restarted methods. Although all the results presented in this subsection have been proved or can be derived easily from existing results, for completeness all the proof is given in Appendix. We first recall the following properties on the sequence {θ k }. θ 4 Lemma. The sequence θ k defined by θ 0 = and θ k+ = k +4θk θ k satisfies We shall also need the following relation of the sequences. k + θ k 5 k + θ k+ θk+ = θk, k = 0,,... 6 θ k+ θ k, k = 0,,..., 7 Lemma. The iterates of Algorithm and Algorithm satisfy for all k, x k+ = θ k x k + θ k z k+. 8 It is known that the sequence of objective values {F x k } generated by accelerated schemes, in contrast to that generated by proximal gradient schemes, does not decrease monotonically. However, this sequence is always upper bounded by the initial value F x 0. Following [LX5], we refer to this result as the non-blowout property of accelerated schemes. Proposition. The iterates of Algorithms and satisfy for all k, F x k F x 0. The non-blowout property of accelerated schemes can be found in many papers, see for example [LX5, WCP7]. It will be repeatedly used in this paper to derive the linear convergence rate of restarted methods. Finally recall the following fundamental property for accelerated schemes. Proposition. The iterates of Algorithms and satisfy for all k, θ k where x is an arbitrary point in X. F x k F + z k x L x 0 x L. 9 Again, Proposition is not new and can be easily derived from existing results, see Appendix for a proof. We derive from Proposition a direct corollary. Corollary. The iterates of Algorithm and Algorithm satisfy for all k, θ k F x k F + dist Lz k, X dist Lx 0, X. 0 Remark 3. All the results presented above hold without Assumption. 4

5 .3. Conditional linear convergence We first derive directly from Corollary and Assumption the following decreasing property. Corollary. If ˆx is the output of Algorithm or Algorithm with input x 0, K and Assumptions and hold, then and F ˆx F dist L ˆx, X Proof. By Proposition and Assumption, µ F L, ˆx µ F L, x 0. Therefore, θ K θ K µ F L, x 0 F x 0 F, θ K µ F L, x 0 dist Lx 0, X. F ˆx F 0 dist Lx 0, X µ F L, x 0 F x 0 F, and µ F L, x 0 θ K dist L ˆx, X θk F ˆx F 0 dist Lx 0, X. Proposition 3. Let {x tk } t 0 be the sequence generated by Algorithm 3 with K. If Assumptions and hold, then we have θ F x tk F t K F x 0 F, t, µ F L, x 0 and Proof. From Corollary, we have and dist L x tk, X F t + K F dist L t + K, X By the non-blowout property stated in Proposition, Therefore, under Assumption, It follows that and F t + K F dist L t + K, X θ t K dist L x 0, X, t. µ F L, x 0 θ K µ F L, x tk F tk F, t 0, θ K µ F L, x tk dist LtK, X, t 0. F x tk F x t K F x 0. µ F L, x tk µ F L, x 0, t 0. θ K µ F L, x 0 F tk F, t 0, θ K µ F L, x 0 dist LtK, X, t 0. 5

6 Remark 4. Although Proposition 3 provides a guarantee of linear convergence when θ K /µ F L, x 0 <, it does not give any information in the case when K is too small to satisfy θ K /µ F L, x 0 <. For any µ > 0. Define: Kµ := e µ. 3 By the right inequality of 5, letting the restarting period K Kµ F L, x 0 implies θ K µ F L, x 0 e. 4 Therefore, if we know in advance µ F L, x 0 and restart Algorithm or every Kµ F L, x 0 iterations, then after computing Kµ F L, x 0 ln dist Lx 0, X = O µ F ɛ L, x 0 ln ɛ 5 number of proximal gradient mappings, we get a point x such that dist L x, X ɛ..3.3 Unconditional linear convergence We now prove a contraction result on the distance to the optimal solution set. Theorem. If ˆx is the output of Algorithm or Algorithm with input x 0, K and Assumptions and hold, then dist L ˆx, X min θ K µ F L, x 0, dist + µ F L,x L x 0, X. 6 0 Proof. For simplicity we denote µ F = µ F L, x 0. Note that by Assumption and Proposition we have for all k N, θ K F x k F µ F dist Lx k, X. 7 For all k N let us denote x k the projection of x k onto X. For all k N and σ k [0, ], we have xk+ x k+ = σ k xk+ x L k+ + σ k xk+ x L k+ σ k xk+ x k+ + σ k x L k+ θ k x 0 θ k x k L σ k F x k+ F + σ k θ k z k+ x µ F 0 L + σ k θ k x k x k L, where the first inequality follows from the convexity of X, and the second inequality follows from 8 and 7. Let us choose σ k = +θ k /µ F so that L σ k = σ k θ k. µ F θ k 6

7 We proceed as dist Lx k+, X σ k θ k θk F x k+ F + z k+ x 0 L 9 σ k Denote k = dist L x k, X. Remark that = + σ k θ k x k x k L θk x 0 x 0 L + θ k x k x k L θk + µ F /θ k dist Lx 0, X + θ k dist L x k, X 8 + µ F /θ µ F /θ 0 so that we may prove that k +0.5µ F /θk 0 by induction. Let us assume that k +0.5µ F /θk 0. Then using 8 Using 6 one can then easily check that 0 k+ θ k 0 + θ k k + µ F /θ k θ k θ k µ F /θ k + 0.5µ F /θk 0 = θ k + 0.5µ F /θk + θ k + µ F /θ k + 0.5µ F /θk 0 θ k + 0.5µ F /θ k + θ k + µ F /θ k + 0.5µ F /θ k + 0.5µ F /θ k 0 θ 3 k + µ F θ k and so Inequality 6 comes by combining this with Proposition 3. Theorem allows us to derive immediately an explicit linear convergence rate of Algorithm 3, regardless of the choice of K. Corollary 3. Let {x tk } t 0 be the sequence generated by Algorithm 3 with fixed restarting period K. If Assumptions and hold, then t dist L x tk, X min θ K µ F L, x 0, dist + µ F L,x L x 0, X. 0 By Corollary 3, the number of proximal mappings needed to reach an ɛ-accuracy on the distance is bounded by K log + µ F L,x 0 θ K θ K log dist Lx 0, X. ɛ In particular, if we choose K / µ F, then we get an iteration complexity bound O/ µ F log/ɛ. 9 7

8 Remark 5. In [WCP7], the local linear convergence of the sequence generated by with arbitrary fixed or adaptive restarting frequency was proved. Our Theorem not only yields the global linear convergence of such sequence, but also gives an explicit bound on the convergence rate. Also, note that although an asymptotic linear convergence rate can be derived from the proof of Lemma 3.6 in [WCP7], it can be checked that the asymptotic rate in [WCP7] is not as good as ours. In fact, an easy calculation shows that even restarting with optimal period K / µ F, their asymptotic rate only leads to the complexity bound O/µ F log/ɛ. Moreover, our restarting scheme is more flexible, because the internal block in Algorithm 3 can be replaced by any scheme which satisfies all the properties presented in Section Adaptive restarting of accelerated gradient schemes Although Theorem guarantees a linear convergence of the restarted method Algorithm 3, it requires the knowledge of µ F L, x 0 to attain the complexity bound 9. In this section, we show how to combine Corollary 3 with Nesterov s adaptive restart method, first proposed in [Nes07], in order to obtain a complexity bound close to 9 that does not depend on a guess on µ F L, x Bounds on gradient mapping norm We first show the following inequalities that generalize similar ones in [Nes07]. Hereinafter we define the proximal mapping: { T x := arg min fx, y x + } y R n y x L + ψy. Proposition 4. If Assumption and hold, then for any x R n, we have T x x L F x F T x F x F, 0 and dist L T x, X 4 µ F L, x x T x L, F T x F 8 x T x L. µ F L, x Proof. The inequality 0 follows directly from [Nes07, Theorem ]. By the convexity of F, for any q F T x and x X, q, T x x F T x F, q F T x, x X. Furthermore, by [Nes07, Theorem ], for any q F T x and x X, Therefore, x T x L T x x L q, T x x. By 0, we know that F T x F x and in view of 4, x T x L dist L T x, X F T x F. 3 F T x F µ F L, x Combining the latter two inequalities, we get: Plugging 4 back to 3 we get. x T x L µ F L, x dist LT x, X. dist L T x, X. 4 8

9 Remark 6. Condition is usually referred to as an error bound condition. It was proved in [DL6] that under Assumption, the error bound condition is equivalent to the quadratic growth condition 4. Corollary 4. Suppose Assumptions and hold and denote µ F = µ F L, x 0. The iterates of Algorithm 3 satisfy for all t, T x tk x tk L θ K min θ K µ F, + µ F θ K Moreover, if there is u 0 R n such that x 0 = T u 0 and denote µ F = µ F L, u 0, then T x tk x tk L 6 µ F Proof. By 0 and 0 we have θ K min µ F θ K µ, F t dist L x 0, X. 5 + µ F θ K T x tk x tk L F x tk F θ K dist L x t K, X. Now applying Corollary 3 we get 5. If in addition x 0 = T u 0, then in view of, 4 dist L x 0, X µ F x 0 u 0 L. t x 0 u 0 L. 6 The second inequality 6 then follows from combining the latter inequality, 5 and the fact that µ F µ F. 3. Adaptively restarted algorithm Inequality 6 provides a way to test whether our guess on µ F L, x 0 is too large. Indeed, let x 0 = T u 0. Then if µ µ F L, u 0 µ F L, x 0 and we run Algorithm 3 with K = Kµ, then necessarily we have T x tk x tk L 6 µ θ t K θ min K µ µ, + µ x 0 u 0 L. 7 θk It is essential that both sides of 7 are computable, so that we can check this inequality for each estimate µ. If 7 does not hold, then we know that µ > µ F L, u 0. The idea was originally proposed by Nesterov in [Nes07] and later generalized in [LX5], where instead of restarting, they incorporate the estimate µ into the update of θ k. As a result, the complexity analysis only works for strongly convex objective function and seems not to hold under Assumption. Our adatively restarted algorithm is described in Algorithm 4. We start from an initial estimate µ 0, and restart Algorithm or with period Kµ s defined in 3. Note that by 4, min θ Ks µ s, + µs θks = θ K s µ s. At the end of each restarting period, we test condition 7, the opposite of which is given by the first inequality at Line 3. If it holds then we continue with the same estimate µ s and thus the same restarting period, otherwise we decrease µ s by one half and repeat. Our stopping criteria is based on the norm of proximal gradient, same as in related work [LX5, LY3]. 9

10 Algorithm 4 ˆx, ŝ, ˆN x 0 : Parameters: ɛ, µ 0 : s, t s 0 3: x s,ts x 0 4: x s+,0 T x s,ts 5: repeat 6: s s + 7: C s 6 xs,0 xs,t s L µ s 8: K s Kµ s 9: t 0 0: repeat : x s,t+ Algorithm x s,t, K s or x s,t+ Algorithm x s,t, K s : t t + 3: until T x s,t x s,t L > C sθk /µ s s t or T x s,t x s,t L ɛ 4: t s t 5: x s+,0 T x s,ts 6: µ s+ µ s / 7: until x s+,0 x s,ts L ɛ ŝ 8: ˆx T x s,ts, ŝ s, ˆN + t s K s + s=0 Remark 7. Although Line 3 of Algorithm 4 requires to compute the proximal gradient mapping T x s,t, one should remark that this T x s,t is in fact given by the first iteration of Algorithm x s,t, K s or Algorithm x s,t, K s. Hence, except for t = t s, the computation of T x s,t does not incur additional computational cost. Therefore, the output ˆN of Algorithm 4 records the total number of proximal gradient mappings needed to get T x x L ɛ. We first show the following non-blowout property for Algorithm 4. Lemma 3. For any s ŝ and 0 t t s we have F T x s,t F x s,t F x 0. Proof. Since x s,t+ is the output of Algorithm or with input x s,t, we know by Proposition that By 0, F x s,ts F x s,ts... F x s,0. F x s+,0 = F T x s,ts F x s,ts F x s,0. The right inequality then follows by induction since x,0 = x 0. The left inequality follows from 0. Lemma 4. For any 0 s ŝ, if µ s µ F L, x 0, then θ t F x s,t F Ks F x s,0 F, t t s. 8 Proof. This is a direct application of Proposition 3 and Lemma 3. µ s From Lemma 3 and 4, we obtain immediately the following key results for the complexity bound of Algorithm 4. 0

11 Corollary 5. Let C s be the constant defined in Line 7 of Algorithm 4. We have If for any 0 s ŝ we have µ s µ F L, x 0, then and C s 3F x 0 F µ s, s 0. 9 θ t T x s,t x s,t L C Ks s, 0 t t s, 30 µ s T x s,t x s,t L e t F x 0 F, 0 t t s. 3 Proof. The bound on C s follows from 0 applied at x s,ts and Lemma 3. The second bound can be derived from 0, and Lemma 4. For the third one, it suffices to apply Lemma 3, Lemma 4 together with 0 and the fact that θ K s µ s e, s 0. 3 Proposition 5. Consider Algorithm 4. If for any 0 s ŝ we have µ s µ F L, x 0, then ŝ = s and t s ln F x 0 F. 33 ɛ Proof. If µ s µ F L, x 0, by 30, when the inner loop terminates we necessarily have Therefore ŝ = s. Then, 33 is derived from 3. T x s,t x s,t L ɛ. Theorem. Suppose Assumptions and hold. Consider the adaptively restarted accelerated gradient Algorithm 4. If the initial estimate of the local quadratic error bound satisfies µ 0 µ F L, x 0 then the number of iterations ˆN is bounded by e ˆN ln F x 0 F µ 0 ɛ If µ 0 > µ F L, x 0, then µ 0 ˆN + log + µ F L, x 0 e µf L, x 0 + e ln F x 0 F ɛ µ F L, x 0 ln 3F x 0 F µ 0 ɛµ F L, x Proof. The first case is a direct application of Proposition 5. Let us now concentrate on the case µ 0 > µ F L, x 0. For simplicity we denote µ F = µ F L, x 0. Define l = log µ 0 log µ F. Then µ l µ F and by Proposition 5, we know that ŝ l and if ŝ = l, t l ln F x 0 F. 36 ɛ

12 Now we consider t s for 0 s l. Note that ɛ < T x s,t x s,t L C sθ K s /µ s t cannot hold for t satisfying By 9, In view of 3 and 38, 37 holds for any t 0 such that Therefore, By the definition 3, Therefore, C s θ K s /µ s t ɛ. 37 C s 3F x 0 F µ F, 0 s l. 38 t s 3F x 0 F e t ɛµ F. ln 3F x 0 F, 0 s ɛµ l. 39 F l l e s e K K l = µ s s=0 = e/µ 0 l l t s K s + e µ F s=0 Combining 36 and 40 we get ˆN + l k=0 + l + + t s K s + e e µ l Then 35 follows by noting that µ l µ F /. s=0 e µ F µ 0 µ F µ 0 ln F x 0 F ɛ µ 0 µ 0 ln 3F x 0 F + ɛµ l 40 F ln 3F x 0 F ɛµ F In Table 3. we compare the worst-case complexity bound of four algorithms which adaptively restart accelerated algorithms. Note that the algorithms proposed by Nesterov [Nes3] and by Lin & Xiao [LX5] require strong convexity of F. However, the algorithm of Liu & Yang [LY3] also applies to the case when local Hölderian error bound condition holds. The latter condition requires the existence of θ and a constant µ F L, x 0, θ > 0 such that F x F + µ F L, x 0, θ dist L x, X θ, x [F F x 0 ]. 4 When θ =, we recover the local quadratic growth condition. In this case, the algorithm of Liu & Yang [LY3] has a complexity bound Õ/ µ F where Õ hides logarithm terms. We can see that the analysis of our algorithm leads to a worst case complexity that is log/µ F s better than previous work. This will be illustrated also in the experiment section.

13 Algorithm Complexity bound Assumption Nesterov [Nes3] O µf ln µ F ln µ F ɛ strong convexity Lin & Xiao [LX5] O µf ln µ F ln µ F ɛ strong convexity Liu & Yang [LY3] O µf ln µ F ln ɛ Hölderian error bound 4 This work O µf ln µ F ɛ local quadratic error bound 4 Table : Comparison of the iteration complexity of accelerated gradient methods with adaptation to the local error bound. 3.3 A stricter test condition The test condition Line 3 in Algorithm 4 can be further strengthened as follows. For simplicity denote µ F = µ F L, x 0 and θ α s µ := min Ks µ, + µ. θks Let any 0 s s ŝ, and t t s. Then for the same reason as 6, we have T x s,t x s,t L 6 µ F θ Ks µ F α s µ F t s j=s α j µ F tj x s,0 x s,t s L. This suggests to replace Line 7 of Algorithm 4 by C s 6 s min α µ s j µ s tj x s,0 x s,t s L : 0 s s. 4 j=s As we only decrease the value of C s, all the theoretical analysis holds and Theorem is still true with the new C s defined in 4. Moreover, this change allows to identify more quickly a too large µ s and thus can improve the practical performance of the algorithm. Furthermore, if we find that T x s,t x s,t L > C s θ K s /µ s t, then before running Algorithm or Algorithm with µ s+ := µ s /, we can first do a test on µ s+, i.e., check the condition T x s,t x s,t L { 6 θ min Ks α s µ s+ t µ s+ s j=s µ s+ α j µ s+ tj x s,0 x s,t s L : 0 s s. 43 If 43 holds, then we go to Line 0. Otherwise, µ s+ is still too large and we decrease it further by one half. 3.4 Looking for an ɛ-solution Instead of an x such that T x x ɛ, we may be interested in an x such that F x F ɛ, that is an ɛ -solution. 3

14 In view of, if x T x L ɛ µ F L,x 8, then F T x F ɛ. As a result, except from the fact that we cannot terminate the algorithm in Line 7, Algorithm 4 is applicable with ɛ = ɛ µ F L,x 8. We then will obtain an ɛ -solution after a number of iterations at most equal to e + µ F L, x 0 ln 8 F x 0 F ɛ µ F L, x e µf L, x 0 µ 0 ln F x 0 F ɛ µ F l, x 0 Note that compared to the result of Theorem, we only add a constant factor. 4 Numerical experiments In this section we present some numerical results to demonstrate the effectiveness of the proposed algorithms. We apply Algorithm 4 to solve regression and classification problems which typically take the form of min F x := gax + ψx 44 x Rn where A R m n, g : R m R has Lipschitz continuous gradient and ψ : R n R {+ } is simple. The model includes in particular the L -regularized least squares problem Lasso and the L -L -regularized logistic regression problem. Note that the following problem is dual to 44, max Gy := y R ψ A y g y, m where g resp. ψ denotes the convex conjugate function of g resp. ψ. We define the primal dual gap associated to a point x domψ as: F x G αxa gax 45 where αx [0, ] is chosen as the largest α [0, ] such that G αa gax < +. Note that x domψ is an optimal solution of 44 if and only if the associated primal dual gap 45 equals 0. We compare five methods: Gradient Descent, [BT09], [LX5], [LY3], and Algorithm 4 using in Line. We plot the primal dual gap 45 versus running. Note that and do not depend on the initial guess of the value µ F L, x Lasso problem We present in Figure the experimental results for solving the L -regularised least squares problem Lasso: min x R n Ax b + A b x λ, 46 on the dataset cpusmall scale with n = and m = 89. The value L is set to be tracea A. We test with λ = 0 4, λ = 0 5 and λ = 0 6. For each value of λ, we vary the initial guess µ 0 from 0. to 0 5. Compared with AdaAPG and, Algorithm 4 seems to be more efficient and less sensitive to the guess of µ F. 4

15 4. Logistic regression We present in Figure the experimental results for solving the L -L regularized logistic regression problem: λ m min x R n A log + expb j a j x + x b + λ x, 47 j= on the dataset dorothea with n = 00, 000 and m = 800. In our experiments, we set where λ λ = L := 8 A b L 0n n j= i= m b j A ij is an upper bound of the Lipschitz constant of the function λ m fx A log + expb j a j x. b j= Thus µ F /0n = 0 6. We test with λ = 0, λ = 00 and λ = 000. For each value of λ, we vary the initial guess µ 0 from 0.0 to 0 6. On this problem Algorithm 4 also outperforms AdaAPG and in all the cases. 5 Conclusion In this work, we show that global linear convergence is guaranteed if we restart at any frequency accelerated gradient methods under a local quadratic growth condition. We then propose an adaptive restarting strategy based on the decrease of the norm of proximal gradient mapping. Compared with similar methods dealing with unknown local error bound condition number, our algorithm has a better worst-case complexity bound and practical performance. Our algorithm can be further extended to a more general setting when Hölderian error bound 4 is satisfied. Another avenue of research is that the accelerated coordinate descent method [FR5] faces the same issue as full gradient methods: to get an accelerated rate of convergence, on needs to estimate the strong convexity coefficient [LLX4]. In [FQ6], an algorithm with fixed periodic restart was proposed. We may also consider adaptive restart for the accelerated coordinate descent method, to get more efficiency in large-scale computation. A proof of Lemma, Lemma, Proposition and Proposition proof of Lemma. The equation 6 holds because θ k+ is the unique positive square root to the polynomial P X = X + θk X θ k. 7 is a direct consequence of 6. Let us prove 5 by induction. It is clear that θ Assume that θ k k+. We know that P θ k+ = 0 and that P is an increasing function on [0, + ]. So we just need to show that P k P = k + + k k + + θ k θk As θ k k+ and k++ 0, P k k k k + 4 = k + + k

16 logprimal Dual Gap cpusmall; λ =0000; µ 0 =0. logprimal Dual Gap cpusmall; λ =00000; µ 0 =0. logprimal Dual Gap cpusmall; λ =000000; µ 0 =0. logprimal Dual Gap cpusmall; λ =0000; µ 0 =0.0 logprimal Dual Gap cpusmall; λ =00000; µ 0 =0.0 logprimal Dual Gap cpusmall; λ =000000; µ 0 =0.0 logprimal Dual Gap cpusmall; λ =0000; µ 0 =0.00 logprimal Dual Gap cpusmall; λ =00000; µ 0 =0.00 logprimal Dual Gap cpusmall; λ =000000; µ 0 =0.00 logprimal Dual Gap cpusmall; λ =0000; µ 0 =0.000 logprimal Dual Gap cpusmall; λ =00000; µ 0 =0.000 logprimal Dual Gap cpusmall; λ =000000; µ 0 =0.000 logprimal Dual Gap cpusmall; λ =0000; µ 0 =e 05 logprimal Dual Gap cpusmall; λ =00000; µ 0 =e 05 logprimal Dual Gap cpusmall; λ =000000; µ 0 =e Figure : Experimental results on the Lasso problem 46 and the dataset cpusmall. Column-wise: we solve the same problem with a different a priori on the quadratic error bound. Row-wise: we use the same a priori on the quadratic error bound but the weight of the -norm is varying.

17 logprimal Dual Gap cpusmall; λ =0; µ =0.0 0 logprimal Dual Gap cpusmall; λ =00; µ =0.0 0 logprimal Dual Gap cpusmall; λ =000; µ =0.0 0 logprimal Dual Gap cpusmall; λ =0; µ 0 =0.00 logprimal Dual Gap cpusmall; λ =00; µ 0 =0.00 logprimal Dual Gap cpusmall; λ =000; µ 0 =0.00 logprimal Dual Gap cpusmall; λ =0; µ 0 =0.000 logprimal Dual Gap cpusmall; λ =00; µ 0 =0.000 logprimal Dual Gap cpusmall; λ =000; µ 0 =0.000 logprimal Dual Gap cpusmall; λ =0; µ 0 =e 05 logprimal Dual Gap cpusmall; λ =00; µ 0 =e 05 logprimal Dual Gap cpusmall; λ =000; µ 0 =e 05 logprimal Dual Gap cpusmall; λ =0; µ 0 =e 06 logprimal Dual Gap cpusmall; λ =00; µ 0 =e 06 logprimal Dual Gap cpusmall; λ =000; µ 0 =e Figure : Experimental results on the logistic regression problem 47 and the dataset dorothea. Columnwise: we solve the same problem with a different a priori on the quadratic error bound. Row-wise: we use the same a priori on the quadratic error bound but the weight of the regularization is varying.

18 For the other inequality, 0+ θ 0. We now assume that θ k k+ but that θ k+ < k++. Remark that x x/x is strictly decreasing for x 0,. Then, using 6, we have This is equivalent to k + + k + + < θ k+ θ k+ k + + < which obviously does not hold for any k 0. So θ k+ k++. 6 = θ k k +. proof of Lemma. The relation 8 follows by combining line 3 and line 5 of Algorithm and Algorithm. proof of Proposition. can be written equivalently using only x k, y k and where β k = θ k θ k. y k = x k + β k x k x k F x k+ = fy k + x k+ y k + ψx k+ fy k + fy k, x k+ y k + x k+ y k v + ψx k+ = fy k + fy k, x k y k + fy k, x k+ x k + x k+ y k v + ψx k+ fx k + fy k, x k+ x k + x k+ y k v + ψx k+ The update of implies the existence of ξ ψx k+ such that Therefore, fy k + v x k+ y k + ξ = 0. F x k+ fx k + ξ + v x k+ y k, x k x k+ + x k+ y k v + ψx k+ F x k + v x k+ y k, x k x k+ + x k+ y k v = F x k + x k y k v x k x k+ v = F x k + β k x k x k v x k x k+ v F x k + x k x k v x k x k+ v. By applying the last inequality recursively we get F x k F x 0. Next consider the iterates of APG. By [Tse08, Proposition ], for any x R n, the iterates of APG satisfy the following property: By taking x = x k we obtain: F x k+ F x + θ k x z k+ v θ k F x k F x + θ k x z k v. F x k+ F x k θ k x k z k v θ k x k z k+ v 8

19 The update of APG implies: which yields Therefore, for any k, x k+ = θ k x k + θ k z k+, x k+ z k+ = θ k x k z k+. F x k+ F x k θ k x k z k θk v θ k x k+ z k+ v 48 Applying 48 recursively and using the decreasing property 7 we obtain: F x k+ F x 0 x 0 z v 0. proof of Proposition. We first prove 9 for. Let any x X. Since z k+ = z k + θ k x k+ y k = θ k x k+ θ k x k, Inequality 9 is a simple consequence of Lemma 4. in [BT09] Note that we have a shift of indices for our variables x k+, z k+ vs x k, u k in [BT09]. To prove 9 for APG, we notice that it is a special case of Theorem 3 in [FR5]. Acknowledgement The first author s work was supported by the EPSRC Grant EP/K035X/ Accelerated Coordinate Descent Methods for Big Data Optimization, the Centre for Numerical Algorithms and Intelligent Software funded by EPSRC grant EP/G03636/ and the Scottish Funding Council, the Orange/Telecom ParisTech think tank Phi-TAB and the ANR grant ANR--LABX-0056-LMH, LabEx LMH as part of the Investissement d avenir project. The second author s work was supported by Hong Kong Research Grants Council Early Career Scheme This research was conducted using the HKU Information Technology Services research computing facilities that are supported in part by the Hong Kong UGC Special Equipment Grant SEG HKU09. References [BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, :83 0, 009. [DL6] D. Drusvyatskiy and A.S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. arxiv: , 06. [FQ6] Olivier Fercoq and Zheng Qu. Restarting accelerated gradient methods with a rough strong convexity estimate. arxiv preprint arxiv: , 06. [FR5] [LLX4] Olivier Fercoq and Peter Richtárik. Accelerated, parallel and proximal coordinate descent. SIAM Journal on Optimization, 54:997 03, 05. Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pages , 04. [LMH5] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 8, pages Curran Associates, Inc., 05. 9

20 [LS3] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In Foundations of Computer Science FOCS, 03 IEEE 54th Annual Symposium on, pages IEEE, 03. [LX5] [LY3] Qihang Lin and Lin Xiao. An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Computational Optimization and Applications, 603: , 05. Mingrui Liu and Tianbao Yang. Adaptive accelerated gradient converging methods under holderian error bound condition. Preprint arxiv: , 03. [Nes83] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O/k. Soviet Mathematics Doklady, 7:37 376, 983. [Nes04] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course Applied Optimization. Kluwer Academic Publishers, 004. [Nes05] Yurii Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 03:7 5, 005. [Nes07] [Nes] [Nes3] Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical Report , CORE Discussion Papers, September 007. Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, :34 36, 0. Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 40:5 6, 03. [NNG] Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 0. [OC] [Tse08] Brendan O Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, pages 8, 0. Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM Journal on Optimization, 008. [WCP7] Bo Wen, Xiaojun Chen, and Ting Kei Pong. Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems. SIAM Journal on Optimization, 7:4 45, 07. 0

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

arxiv: v7 [math.oc] 22 Feb 2018

arxiv: v7 [math.oc] 22 Feb 2018 A SMOOTH PRIMAL-DUAL OPTIMIZATION FRAMEWORK FOR NONSMOOTH COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, OLIVIER FERCOQ, AND VOLKAN CEVHER arxiv:1507.06243v7 [math.oc] 22 Feb 2018 Abstract. We propose a

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ) 1 28 Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/) Yi Xu yi-xu@uiowa.edu Yan Yan yan.yan-3@student.uts.edu.au Qihang Lin qihang-lin@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities Caihua Chen Xiaoling Fu Bingsheng He Xiaoming Yuan January 13, 2015 Abstract. Projection type methods

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao June 8, 2014 Abstract In this paper we propose a randomized nonmonotone block

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

arxiv: v1 [math.oc] 13 Dec 2018

arxiv: v1 [math.oc] 13 Dec 2018 A NEW HOMOTOPY PROXIMAL VARIABLE-METRIC FRAMEWORK FOR COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, LIANG LING, AND KIM-CHUAN TOH arxiv:8205243v [mathoc] 3 Dec 208 Abstract This paper suggests two novel

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization Qihang Lin Lin Xiao April 4, 013 Abstract We consider optimization problems with an objective function

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

SHARPNESS, RESTART AND ACCELERATION

SHARPNESS, RESTART AND ACCELERATION SHARPNESS, RESTART AND ACCELERATION VINCENT ROULET AND ALEXANDRE D ASPREMONT ABSTRACT. The Łojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Global convergence of the Heavy-ball method for convex optimization

Global convergence of the Heavy-ball method for convex optimization Noname manuscript No. will be inserted by the editor Global convergence of the Heavy-ball method for convex optimization Euhanna Ghadimi Hamid Reza Feyzmahdavian Mikael Johansson Received: date / Accepted:

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems arxiv:1711.06831v [math.oc] 1 Nov 017 Lei Yang Abstract In this paper, we consider a class of

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization Qihang Lin Lin Xiao February 4, 014 Abstract We consider optimization problems with an objective function

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Complexity of gradient descent for multiobjective optimization

Complexity of gradient descent for multiobjective optimization Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD

ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD TIM HOHEISEL, MAXIME LABORDE, AND ADAM OBERMAN Abstract. In this article we study the connection

More information

1. Introduction. We consider the following constrained optimization problem:

1. Introduction. We consider the following constrained optimization problem: SIAM J. OPTIM. Vol. 26, No. 3, pp. 1465 1492 c 2016 Society for Industrial and Applied Mathematics PENALTY METHODS FOR A CLASS OF NON-LIPSCHITZ OPTIMIZATION PROBLEMS XIAOJUN CHEN, ZHAOSONG LU, AND TING

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information