On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Size: px
Start display at page:

Download "On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:"

Transcription

1 A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition 31, v = ψ z ) = arg min x X m x) Observe that m x) = a fx ), x x m 1 x) By the definition of Bregman divergence: m 1 v ) = m 1 v 1 ) m 1 v 1 ), v v 1 D m 1 v, v 1 ) As Bregman divergence is blind to linear and zero-order terms, we have that D m 1 v, v 1 ) = D ψ v, v 1 ) By Proposition 31, v 1 = arg min x X m 1 x), and hence m 1 v 1 ), v v 1 0 Therefore, m v ) m 1 v 1 ) a D ψ v, v 1 ) Using the definition of fx ), the change in the lower bound is: L 1 L 1 a fx ) a fx ), v x D ψ v, v 1 ) a η, x v A1) For the change in the upper bound, we have: U 1 U 1 = fy ) 1 fy 1 ) =a fx ) fy ) fx )) 1 fx ) fy 1 )) By convexity of f ): fx ) fy 1 ) fx ), x y 1 Combining A1)-A3) and AGD): A) A3) as claimed G 1 G 1 a η, x v D ψ v, v 1 ) fy ) fx )) fx ), y x, B AGD for Smooth and Strongly Convex Minimization Here we show that AGD can be extended to the setting of smooth and strongly convex minimization As is customary Nesterov, 013), in this setting we assume that = so that f ) is L-smooth and µ-strongly convex wrt the l norm, for L < and µ > 0 To distinguish from the non-strongly-convex case, we refer to AGD for smooth and strongly convex minimization as µagd To analyze AGD in this setting, we need to use a stronger lower bound L, which is constructed by the same arguments as before, but now using strong convexity instead of regular convexity Such a construction gives: L = a ifx i ) min u X mu) a i η i, x x i D ψ x, x 0 ), B1) where m u) = a i fxi ), u x i µ u x i ) D ψ u, x 0 ) While it suffices to have ψ be an arbitrary function that is strongly convex wrt the, for simplicity, we tae ψx) = µ0 x, where µ 0 will be specified later For θ = a, the algorithm can now be stated as follows: v = argmin m u), u X x = 1 y 1 θ v 1, 1 θ 1 θ y = 1 θ )y 1 θ v, where, x 1 = x 0 = y 0 = v 0 is an arbitrary initial point from X µagd)

2 As before, the main convergence argument is to show that G 1 G 1 and combine it with the bound on the initial gap G 1 We start with bounding the initial gap, as follows Proposition B1 If ψx) = µ0 x, where µ 0 = a 1 L µ), then A 1 G 1 A1L µ) x x 0 E η 1, where Eη 1 = a 1 η 1, x v 1 Proof As x 1 = x 0, the initial lower bound is: A 1 L 1 =a 1 fx 1 ) a 1 fx 1 ), v 1 x 1 As a 1 = A 1, it follows that y 1 = v 1, and hence: A 1 U 1 =a 1 fv 1 ) a1 µ µ ) 0 v 1 x 1 µ 0 x x 0 a 1 η 1, x x 1 a 1 fx 1 ) a 1 fx 1 ), v 1 x 1 a 1L v 1 x 1 =a 1 fx 1 ) a 1 fx1 ), v 1 x 1 a 1L v 1 x 1 a 1 η 1, v 1 x 1, where the inequality is by the smoothness of f ) Combining the bounds on the initial upper and lower bounds, it follows: A 1 G 1 µ 0 x x 0 a 1L a 1 µ µ 0 v 1 x 1 a 1 η 1, x v 1 a 1L µ) x x 0 E η 1, as, by the initial assumption, µ 0 = a 1 µ L) and E η 1 = a 1 η 1, x v 1 To bound the change in the lower bound, it is useful to first bound m v ) m 1 v 1 ), as in the following technical proposition Proposition B Let ψx) = a1l µ) x Then: m v ) m 1 v 1 ) a 1µ v v 1 a µ v x Proof Observe that, by the definition of m ), m v ) = m 1 v ) a a µ v x The rest of the proof bounds m 1 v ) m 1 v 1 ) Observe that, as v 1 = argmin u X m 1 u), it must be m 1 v 1 ), u v 1 0, u X As Bregman divergence is blind to linear terms: m 1 v ) m 1 v 1 ) = m 1 v 1 ), u v 1 D m 1 v, v 1 ) D m 1 v, v 1 ) = 1µ The rest of the proof is by a1l µ) v v 1 0 v v 1 a 1L µ) v v 1 We are now ready to move to the main part of the convergence argument, namely, to show that G 1 G 1 for a certain choice of a Lemma B3 Let ψx) = a1l µ) x and 0 < a µ L Then: G 1 G 1 E η, where Eη = a η, x v Proof As U = fy ), we have that: U 1 U 1 fy ) 1 fy 1 ) B)

3 Using Proposition B, the change in the lower bound is: L 1 L 1 a fx ) a a η, x x 1µ v v 1 a µ v x B3) Denote w = 1 v 1 a x By Jensen s Inequality: 1 µ v v 1 a µ v x µ v w B4) Write a as: a As a µ L and by smoothness of f ) : = a fx ), v w a fx ), A 1 v 1 x ) a η, v x B5) a fx ), v w µa v w f x a fx ), x a v w ) Combining B3)-B6), we have the following bound for the change in the lower bound: L 1 L 1 f ) v w ) x L a ) v w ) ) ) fx ) B6) x a ) v w ) 1 fx ) a fx ), A 1 v 1 x ) a η, x v Using the definition of w, θ = a, and µagd), it is not hard to show that y = x a v w ) and a v 1 x ) = x y 1, which, using the convexity of f ), gives: L 1 L 1 fy ) 1 fy 1 ) a η, x v B7) Combining B7) and B), the proof follows Theorem B4 Let ψx) = L µ x, a 1 = 1, ai = γ i µ L for i, and let y, x evolve according to µagd) Then, 1: fy ) fx ) = A 1 L µ) x x 0 a i η i, x v i Π 1 γ i ) ) L µ) x x 0 a i η i, x v i Proof Applying Lemma B3, it follows that G A1G1 1 ai = 1 γ i, we have G Π 1 γ i) ) G 1 using that fy ) fx ) G Eη i Eη i = A1 A A A 3 A 1 G 1 Eη i As Ai 1 = The rest of the proof is by applying Proposition B1 and Using the same arguments for bounding the noise term as in the case of smooth minimization Section 3), we have the following corollary

4 Corollary B5 of Theorem B4) If η i = 0 the noiseless gradient case), setting γ i = µ L, we recover the standard convergence result for accelerated smooth and strongly convex minimization: ) µ L µ fy ) fx ) 1 x x 0 L If E[ η i ] M i and max u X x u R x, then setting γ i = µ E[fy ) fx )] 1 If η i s are zero-mean and independent and E[ η i ] σ, then: In particular, setting: ai = γ i = µ L, a i = i p for p Z, E[fy ) fx )] Π 1 γ i ) ) L µ E[fy) fx )] L : ) µ L µ x x 0 R x a im i L 1 x x 0 σ a i µ ) µ L µ x x 0 L σ µl p 1 E[fy ) fx )] = O p1 L µ) x x 0 ) p 1) σ p µ Proof The bounds for η i = 0 and for E[ η i ] M i and max u X x u R x are straightforward Assume that η i s are zero-mean and independent and denote ψ x) = a i µ x x i µ0 x x 0 Observe that the strong convexity parameter of ψ is µ µ 0 > µ From Fact 4, v = ψ z ) Similarly as for the case of smooth minimization, let ˆv = ψ z a η ) Then ˆv is independent of η, and, using Fact 5, we have: E[a η, x v ] = E[a η, x ˆv ] E[a η, ˆv v ] a µ η Combining with Theorem B4, we get E[fy ) fx )] Π 1 γ i) ) L µ x x 0 σ µ The rest of the proof follows by plugging in particular choices of a i Let us mae a few more remars here When η i s are zero-mean, independent, and E[ η i ] σ, even the vanilla version of µagd does not accumulate noise the noise averages out) Under the same assumptions and when a i = i, we recover the asymptotic bound from Ghadimi & Lan, 01) 7 More generally, a i = i p for any constant integer p gives a convergence bound for which the deterministic term vanishes at rate 1/ p1 while the noise term vanishes at rate 1/ When p = log) ) for a fixed number of iterations of µagd, we get E[fy ) fx )] = O log) L µ) x x0 log) log) σ µ, ie, the deterministic term independent of noise) decreases super-polynomially with the iteration count, while the noise term decreases at rate log)/ This is a much stronger bound than the one from Ghadimi & Lan, 01) and closer to the theoretical lower bound Ω 1 µ L ) L µ) x x 0 σ µ ) from Nemirovsii & Yudin, 1983) Note that in the setting of constrained bounded-diameter) minimization, Ghadimi & Lan, 013) obtained the optimal 1 convergence bound O µ ) ) L L µ) x x 0 σ µ by coupling the algorithm from Ghadimi & Lan, 01) with a domain-shrining procedure resulting in a multi-stage algorithm We expect it is possible to obtain a similar result for µagd by coupling it with the domain-shrining from Ghadimi & Lan, 013) 7 Note that the independence of η i s is a stronger assumption than used in Ghadimi & Lan, 01) Nevertheless, we can obtain the same bounds as stated in Corollary B5 for the same assumptions as in Ghadimi & Lan, 01) using the same arguments as for the case of smooth minimization see Appendix C) a i

5 C Different Models of Inexact Oracle C1 Adversarial Models On Acceleration with Noise-Corrupted Gradients There are two main adversarial models of inexact gradient oracles that have been used in the convergence analysis of accelerated methods: the approximate gradient model of d Aspremont, 008) and inexact first-order oracle of Devolder et al, 014) The approximate gradient model d Aspremont, 008) defines the inexact oracle by a deterministic perturbation satisfying the following condition for all queries: η, y z δ, y, z X Hence, this model is only applicable to constrained optimization with bounded-diameter domain and bounded adversarial) additive noise Under these assumptions, d Aspremont, 008) proves that it is possible to approximate fx ) up to an error of δ achieving an accelerated rate We can show the same asymptotic bound 8 by applying the assumption to Equation 3) in Proposition 35 This yields: G D ψx, x 0 ) δ whenever a µ L for all Setting a = µ L 1 yields = O), which establishes the accelerated decrease of the first term in the error bound above The inexact first-order oracle Devolder et al, 014) is a generalization of the model from d Aspremont, 008) that defines the inexact oracle by: fy) fx) fx), y x L y x δ As stated in Devolder et al, 014), this model does not apply to noise-corrupted gradients per se, but rather to non-smooth and wealy smooth convex problems In other words, the model was introduced to characterize the behavior of accelerated methods on objective functions that are non-smooth, but close to smooth Our results agree with those of Devolder et al, 014) and lead to the same ind of error accumulation To see this, observe that we only use the definition of smoothness when bounding E e in Theorem 34 Thus, the error from the inexact oracle would only appear as E = Ee δ, leading to: fy ) fx ) D ψx, x 0 ) δ, This is exactly the same bound as in Theorem 4 of Devolder et al, 014), but we obtain it through a generic algorithm with a simple analysis C Generalized Stochastic Models It is possible to generalize the results from Lemma 37 to the noise model from Lan, 01; Ghadimi & Lan, 01) In such a model, η = Gx, ξ ) fx ), where {ξ i } s are iid random vectors, E[η ] = 0 and E[ η ] σ Let F denote the natural filtration up to and including) iteration Then ˆv = ψ z a η ) is measurable wrt F 1 as {x i } and {ξ i} 1 are measurable wrt F 1) It follows that: E[E η F 1] = a E[η, x v F 1 ] = a E[η, x ˆv F 1 ] a E[η, ˆv v F 1 ] def Define Γ = G a µ E[ η ] a σ µ C1) a σ µ Then, using the results from Section 3 and C1): E[Γ Γ 1 F 1 ] = E [ G 1 G 1 a σ ] µ F 1 = E [E η a σ ] µ F 1 0, ie, Γ is a supermartingale Hence, we can conclude that E[Γ ] E[Γ 1 ], implying E[G ] A1 E[G 1 ] we recover the same bound as in Lemma 37: E[fy )] fx ) D ψx, x 0 ) a i σ µ 8 We actually obtain better constants than those in Theorem of d Aspremont, 008) i= ai σ µ, and

6 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 Figure : Median performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD with restart and slow-down semi-heuristics and TO-AGD) over 50 repeated runs on a hard instance for unconstrained smooth minimization for η N 0, σ η I) over R n and for: a)-d) 500 iterations; e)-h) 100 iterations D Additional Experiments D1 Hard Instance for Unconstrained Smooth Minimization Unconstrained Minimization In Section 5, we compared various accelerated algorithms with gradient descent GD) when restart and slow-down semi-heuristics are disabled and enabled We also included a comparison with AGD when its parameters are set according to Corollary 39 TO-AGD) Since the step sizes of TO-AGD depend on the number of iterations, one may hope that TO-AGD could outperform other accelerated algorithms for a smaller number of iterations However, this is not true for a smaller number of iterations, in the best case TO-AGD matches the performance of other algorithms with restart and slow-down When all algorithms have the same performance, they all essentially run their vanilla versions without restart and slow down and for their standard accelerated step sizes This is illustrated in Fig Hard Instance over Simplex The set of experiments in Figure 3 correspond to the minimization of the hard instance function for smooth optimization, constrained over the probability simplex It should be compared to the unconstrained version in Figure 1 As predicted, we observe that the presence of constraints decreases the effect of error accumulation as the boundary of the feasible set limits the variance Given the low variance due to the constraints, the effect of RESTARTSLOWDOWN is less evident for this batch of experiments D Regression on Epileptic Seizure Dataset For the second set of experiments, we used the Epileptic Seizure Recognition Dataset Andrzeja et al, 001) obtained from Lichman, 013) The dataset consists of brain activity EEG recordings for 100 patients at different time points and in five different states, of which only one indicates epileptic seizure The dataset contains rows and 179 columns, of which the first 178 columns are features while the last column indicates whether the patient was in seizure Before running the experiments, we standardize the data using a standard preprocessing function from the Python Sci-it library Linear regression and LASSO We performed linear regression on the considered dataset mainly to illustrate the performance of the algorithms in the bounded regime for l 1 -constraints LASSO), which we discuss here The results for standard, unconstrained linear regression are similar to the results for logistic regression discussed below and are omitted Fig 4a)-4d) shows the performance of GD and accelerated algorithms for l 1 -constrained linear regression LASSO) on the Epileptic Seizure Recognition Dataset Interestingly, the experiments suggest that there are cases when AGD performes better than AXGD and AGD In particular, in the noiseless Fig 4a)) and the low-noise Fig 4b)) settings, AGD performs

7 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 i) σ η = 0 j) σ η = 10 5 ) σ η = 10 3 l) σ η = 10 1 Figure 3: Performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD) on a hard instance for unconstrained smooth minimization for η N 0, σ η I) and over probability simplex a)-d) without and i)-l) with RESTARTSLOWDOWN and RESTARTSLOWDOWN- much better than worst-case and converges faster than AXGD and AGD However, the faster convergence comes at the expense of lower stability to noise as the noise becomes higher Specifically, as the noise is increased, AGD performs only marginally better and with higher variance than AXGD and AGD Fig 4c)), and stabilizes to much higher mean and variance in the very high-noise setting Fig 1d)) Intuitively, greedy gradient steps that AGD taes may reduce the function value significantly and lead to faster convergence in the noiseless and low-noise settings, while maing the convergence very sensitive to the noise from the last iteration, as the gradient steps only depend on the last seen noisy) gradient In contrast, AXGD and AGD are more stable to noise, since both of their per-iteration steps depend on the aggregate gradient and thus, aggregate noise) information As expected from the analytical results from Section 3, restart and slow-down does not noticeably improve the mean error of the algorithms Fig 4e)-4h), 4i)-4l)) However, in agreement with the analysis, it can reduce the error variance in the high-noise-variance setting Fig 4l)) We also note that TO-AGD is more stable over the repeated methods execution, at the expense of slower initial convergence Logistic regression Finally, we evaluated the performance of the accelerated algorithms and GD for unregularized) logistic regression on the Epileptic Seizure Recognition Dataset The results are shown in Fig 5 Similar as in the case of unconstrained minimization from the beginning of this section, in the noiseless and low-noise settings Fig 5a), 5b)) all accelerated algorithms perform similarly and restart and slow-down does not lead to any noticeable improvements or degradation Fig 5e), 5i), 5f), 5j)) Once the noise is high enough Fig 5c), 5d)), all accelerated algorithms begin to accumulate noise, while restart and slow-down stabilize their performance to a low error mean and variance Interestingly, in all the experiments, when RESTARTSLOWDOWN and RESTARTSLOWDOWN- are employed all accelerated algorithms perform at least as good as GD in terms of the error mean and variance

8 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 i) σ η = 0 j) σ η = 10 5 ) σ η = 10 3 l) σ η = 10 1 Figure 4: Performance of GD and accelerated algorithms AGD, AXGD, AGD, TO-AGD) for linear regression over l 1 -ball LASSO) on Epileptic Seizure Recognition Dataset Andrzeja et al, 001) a)-d) without noise reduction; i)-l) sample run with TOAGD, RESTARTSLOWDOWN and RESTARTSLOWDOWN-; and i)-l) 50 repeated runs and the median with TOAGD, RESTARTSLOWDOWN and RESTARTSLOWDOWN-

9 a) σ η = 0 b) σ η = 10 5 c) σ η = 10 3 d) σ η = 10 1 e) σ η = 0 f) σ η = 10 5 g) σ η = 10 3 h) σ η = 10 1 i) σ η = 0 j) σ η = 10 5 ) σ η = 10 3 l) σ η = 10 1 Figure 5: Performance of gradient descent GD) and accelerated algorithms AGD, AXGD, AGD, TO-AGD) for logistic regression on Epileptic Seizure Recognition Dataset Andrzeja et al, 001) a)-d) without noise reduction; e)-h) sample run with noise reduction; and i)-l) 50 repeated runs and the median run with noise reduction

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

arxiv: v1 [math.oc] 3 Nov 2017

arxiv: v1 [math.oc] 3 Nov 2017 Analysis of Approximate Stochastic Gradient Using Quadratic Constraints and Sequential Semidefinite Programs arxiv:1711.00987v1 [math.oc] 3 Nov 2017 Bin Hu Peter Seiler Laurent Lessard November 6, 2017

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Stephen Boyd and Almir Mutapcic Notes for EE364b, Stanford University, Winter 26-7 April 13, 28 1 Noisy unbiased subgradient Suppose f : R n R is a convex function. We say

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Analysis of Approximate Stochastic Gradient Using Quadratic Constraints and Sequential Semidefinite Programs

Analysis of Approximate Stochastic Gradient Using Quadratic Constraints and Sequential Semidefinite Programs Analysis of Approximate Stochastic Gradient Using Quadratic Constraints and Sequential Semidefinite Programs Bin Hu Peter Seiler Laurent Lessard November, 017 Abstract We present convergence rate analysis

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging arxiv:307.879v [math.oc] 7 Jul 03 Angelia Nedić and Soomin Lee July, 03 Dedicated to Paul Tseng Abstract This paper considers

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

IEOR 265 Lecture 3 Sparse Linear Regression

IEOR 265 Lecture 3 Sparse Linear Regression IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Estimate sequence methods: extensions and approximations

Estimate sequence methods: extensions and approximations Estimate sequence methods: extensions and approximations Michel Baes August 11, 009 Abstract The approach of estimate sequence offers an interesting rereading of a number of accelerating schemes proposed

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

APPLICATIONS OF DIFFERENTIABILITY IN R n.

APPLICATIONS OF DIFFERENTIABILITY IN R n. APPLICATIONS OF DIFFERENTIABILITY IN R n. MATANIA BEN-ARTZI April 2015 Functions here are defined on a subset T R n and take values in R m, where m can be smaller, equal or greater than n. The (open) ball

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Lecture 14: Newton s Method

Lecture 14: Newton s Method 10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

1 Review and Overview

1 Review and Overview DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

10. Ellipsoid method

10. Ellipsoid method 10. Ellipsoid method EE236C (Spring 2008-09) ellipsoid method convergence proof inequality constraints 10 1 Ellipsoid method history developed by Shor, Nemirovski, Yudin in 1970s used in 1979 by Khachian

More information

arxiv: v6 [math.oc] 24 Sep 2018

arxiv: v6 [math.oc] 24 Sep 2018 : The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information