arxiv: v3 [math.oc] 23 Jan 2018

Size: px
Start display at page:

Download "arxiv: v3 [math.oc] 23 Jan 2018"

Transcription

1 Nonconvex Finite-Sum Optimization Via SCSG Methods arxiv: v [mathoc] Jan 08 Lihua Lei UC Bereley Cheng Ju UC Bereley Michael I Jordan UC Bereley ordan@statbereleyedu Abstract Jianbo Chen UC Bereley ianbochen@bereleyedu We develop a class of algorithms, as variants of the stochastically controlled stochastic gradient SCSG methods [9], for the smooth non-convex finitesum optimization problem Assuming the smoothness of each component, the complexity of SCSG to reach a stationary point with E fx is O min{ 5/, n / }, which strictly outperforms the stochastic gradient descent Moreover, SCSG is never worse than the state-of-the-art methods based on variance reduction and it significantly outperforms them when the target accuracy is low A similar acceleration is also achieved when the functions satisfy the Polya-Loasiewicz condition Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networs in terms of both training and validation loss Introduction We study smooth non-convex finite-sum optimization problems of the form min fx = n f i x x R d n where each component f i x is possibly non-convex with a Lipschitz gradient This generic form captures numerous statistical learning problems, ranging from generalized linear models [0] to deep neural networs [7] In contrast to the convex case, the non-convex case is comparatively under-studied Early wor focused on the asymptotic performance of algorithms [, 7, 7], with non-asymptotic complexity bounds emerging more recently [] In recent years, complexity results have been derived for both gradient methods [,, 8, 9] and stochastic gradient methods [,, 6, 4, 4, 5, ] Unlie in the convex case, in the non-convex case one can not expect a gradient-based algorithm to converge to the global minimum if only smoothness is assumed As a consequence, instead of measuring functionvalue suboptimality Efx inf x fx as in the convex case, convergence is generally measured in i=

2 Table : Computation complexity of gradient methods and stochastic gradient methods for the finite-sum non-convex optimization problem The second and third columns summarize the rates in the smooth and P-L cases respectively µ is the P-L constant and H is the variance of a stochastic gradient These quantities are defined in Section The final column gives additional required assumptions beyond smoothness or the P-L condition The symbol denotes a minimum and Õ is the usual Landau big-o notation with logarithmic terms hidden Gradient Methods GD Best achievable Smooth Polya-Loasiewicz additional cond O n [, ] Õ n µ [, 5] - Õ n [9] - smooth gradient 7/8 Õ n [9] - smooth Hessian 5/6 Stochastic Gradient Methods SGD O [, 4] O µ [5] H = O Best achievable O n + n/ [4, 5] Õ n + n/ µ [4, 5] - SCSG Õ Õ µ n + µ µ n/ H = O n/ 5/ terms of the squared norm of the gradient; ie, E fx We summarize the best achievable rates in Table We also list the rates for Polya-Loasiewicz P-L functions, which will be defined in Section The accuracy for minimizing P-L functions is measured by Efx inf x fx As in the convex case, gradient methods have better dependence on in the non-convex case but worse dependence on n This is due to the requirement of computing a full gradient Comparing the complexity of SGD and the best achievable rate for stochastic gradient methods, achieved via variance-reduction methods, the dependence on is significantly improved in the latter case However, unless << n /, SGD has similar or even better theoretical complexity than gradient methods and existing variance-reduction methods In practice, it is often the case that n is very large while the target accuracy is moderate 0 0 In this case, SGD has a meaningful advantage over other methods, deriving from the fact that it does not require a full gradient computation This motivates the following research question: Is there an algorithm that achieves/beats the theoretical complexity of SGD in the regime of modest target accuracy; and achieves/beats the theoretical complexity of existing variance-reduction methods in the regime of high target accuracy? The question has been partially answered in the convex case by [9] in their formulation of the stochastically controlled stochastic gradient SCSG methods When the target accuracy is low, SCSG has the same O rate as SGD but with a much smaller data-dependent constant factor which does not even require bounded gradients When the target accuracy is high, SCSG achieves the same rate as the best non-accelerated methods, O n Despite the gap between this and the optimal rate, SCSG is the first nown algorithm that provably achieves the desired performance in both regimes It is also common to use E fx to measure convergence; see, eg [, 8, 9, ] Our results can be readily transferred to this alternative measure by using Cauchy-Schwartz inequality, E fx E fx, although not vice versa The rates under this alternative can be made comparable to ours by replacing by

3 In this paper, we show how to generalize SCSG to the non-convex setting and, surprisingly, provide a completely affirmative answer to the question raised above Even though we only assume smoothness of each component, we show that SCSG is always O / faster than SGD and is never worse than recently developed variance-reduction methods When >> n, SCSG is at least On/ faster than the best variance-reduction algorithm Comparing with gradient methods, SCSG has a better convergence rate provided >> n 6/5, which is the common setting in practice Interestingly, there is a parallel to recent advances in gradient methods; [9] improved the classical O rate of gradient descent to O 5/6 ; this parallels the improvement of SCSG over SGD from O to O 5/ Beyond the theoretical advantages of SCSG, we also show that SCSG yields good empirical performance for the training of multi-layer neural networs It is worth emphasizing that the mechanism by which SCSG achieves acceleration variance reduction is qualitatively different from other speed-up techniques, including momentum [6] and adaptive stepsizes [6] It will be of interest in future wor to explore combinations of these various approaches in the training of deep neural networs The rest of paper is organized as follows: In Section we discuss our notation and assumptions and we state the basic SCSG algorithm We present the theoretical convergence analysis in Section Experimental results are presented in Section 4 All the technical proofs are relegated to the Appendices Our code is available at Notation, Assumptions and Algorithm We use to denote the Euclidean norm and write min{a, b} as a b for brevity throughout the paper The notation Õ, which hides logarithmic terms, will only be used to maximize readibility in our presentation but will not be used in the formal analysis We define computation cost using the IFO framewor of [] which assumes that sampling an index i and accessing the pair f i x, f i x incur a unit of cost For brevity, we write f I x for i I f ix Note that calculating f I x incurs I units of computational cost x is called I an -accurate solution iff E fx The minimum IFO complexity to reach an -accurate solution is denoted by C comp Recall that a random variable N has a geometric distribution, N Geomγ, if N is supported on the non-negative integers with An elementary calculation shows that P N = = γ γ, = 0,, E N Geomγ = To formulate our complexity bounds, we define f = inf x fx, f = f x 0 f γ γ Further we define H as an upper bound on the variance of the stochastic gradients: H n = sup f i x fx x n i= Assumption A on the smoothness of individual functions will be made throughout the paper Here we allow N to be zero to facilitate the analysis

4 A f i is differentiable with for some L < and for all i {,, n} f i x f i y L x y, As a direct consequence of assumption A, it holds for any x, y R d that L x y f i x f i y f i y, x y L x y 4 In this paper, we also consider the Polya-Loasiewicz P-L condition [] It is weaer than strong convexity as well as other popular conditions that appear in the optimization literature; see [5] for an extensive discussion A fx satisfies the P-L condition with µ > 0 if where x is the global minimum of f Generic form of SCSG methods fx µfx fx The algorithm we propose in this paper is similar to that of [4] except critically the number of inner loops is a geometric random variable This is an essential component in the analysis of SCSG, and, as we will show below, it is ey in allowing us to extend the complexity analysis for SCSG to the non-convex case Moreover, that algorithm that we present here employs a mini-batch procedure in the inner loop and outputs a random sample instead of an average of the iterates The pseudo-code is shown in Algorithm Algorithm Mini-Batch Stochastically Controlled Stochastic Gradient SCSG method for smooth non-convex finite-sum obectives Inputs: Number of stages T, initial iterate x 0, stepsizes η T =, batch sizes B T =, mini-batch sizes T = Procedure : for =,,, T do : Uniformly sample a batch I {,, n} with I = B ; : g f I x ; 4: x 0 x ; 5: Generate Geom B /B + ; 6: for =,,, do 7: Randomly pic Ĩ [n] with Ĩ = ; 8: ν f Ĩ x f Ĩ x 0 + g ; 9: x 0: end for : x x ; : end for x η ν ; Output: Smooth case Sample x T from x T = with P x T = x η B / ; P-L case x T As seen in the pseudo-code, the SCSG method consists of multiple epochs In the -th epoch, a minibatch of size B is drawn uniformly from the data and a sequence of mini-batch SVRG-type updates are implemented, with the total number of updates being randomly generated from a geometric distribution, with mean equal to the batch size Finally it outputs a random sample from { x } T = 4

5 Table : Parameter settings analyzed in this paper η B Type of Obectives EC comp Version O LB / n Smooth O n/ 5/ Version LB / n Smooth Õ n/ 5/ Version O LB / µ n Polya-Loasiewicz Õ µ n + µ µ n/ This is the standard way, proposed by [], as opposed to computing arg min T f x which requires additional overhead By, the average total cost is T B + E = = T T B + B = B 5 i= Define T as the minimum number of epochs such that all outputs afterwards are -accurate solutions, ie T = min{t : E f x T for all T T } Recall the definition of C comp at the beginning of this section, the average IFO complexity to reach an -accurate solution is T EC comp B Parameter settings The generic form Algorithm allows for flexibility in both stepsize, η, and batch/mini-batch size, B, In order to minimize the amount of tuning needed in practice, we provide several default settings which have theoretical support The settings and the corresponding complexity results are summarized in Table Note that all settings fix = since this yields the best rate as will be shown in Section However, in practice a reasonably large mini-batch size might be favorable due to the acceleration that could be achieved by vectorization; see Section 4 for more discussions on this point = = Convergence Analysis One-epoch analysis First we present the analysis for a single epoch Given, we define e = f I x f x 6 As shown in [4], the gradient update ν is a biased estimate of the gradient fx conditioning on the current random index i Specifically, within the -th epoch, EĨ ν = fx + f I x 0 fx 0 = fx + e This reveals the basic qualitative difference between SVRG and SCSG Most of the novelty in our analysis lies in dealing with the extra term e Unlie [4], we do not assume x x to be bounded since this is invalid in unconstrained problems, even in convex cases By careful analysis of primal and dual gaps [5], we find that the stepsize η should scale as B / Then same phenomenon has also been observed in [4, 5, 4] when = and B = n 5

6 Theorem Let η L = γb / Suppose γ and B 8 for all, then under Assumption A, E f x 5L γ b Ef x f x + 6IB < n H 7 B B The proof is presented in Appendix B It is not surprising that a large mini-batch size will increase the theoretical complexity as in the analysis of mini-batch SGD For this reason we restrict most of our subsequent analysis to Convergence analysis for smooth non-convex obectives When only assuming smoothness, the output x T is a random element from x T = Telescoping 7 over all epochs, we easily obtain the following result Theorem Under the specifications of Theorem and Assumption A, 5L E f x T γ T f + 6 = b B IB < n H T = b B This theorem covers many existing results When B = n and =, Theorem implies that E f x T L f = O and hence T = O+ L f This yields the same complexity bound T n / n / EC comp = On + n/ L f as SVRG [4] On the other hand, when = B B for some B < n, Theorem implies that E f x T L f = O T + H B The second term can be made H O by setting B = O L f L f H Under this setting T = O and EC comp = O This is the same rate as in [4] for SGD However, both of the above settings are suboptimal since they either set the batch sizes B too large or set the mini-batch sizes too large By Theorem, SCSG can be regarded as an interpolation between SGD and SVRG By leveraging these two parameters, SCSG is able to outperform both methods We start from considering a constant batch/mini-batch size B B, Similar to SGD and SCSG, B should be at least O H In applications lie the training of neural networs, the required accuracy is moderate and hence a small batch size suffices This is particularly important since the gradient can be computed without communication overhead, which is the bottlenec of SVRG-type algorithms As shown in Corollary below, the complexity of SCSG beats both SGD and SVRG Corollary Constant batch sizes Set { } H, B B = min, n, η η = 6LB Then it holds that EC comp = O H n + L f H n Assume that L f, H = O, the above bound can be simplified to EC comp = O n + n = O 5 n 6

7 When the target accuracy is high, one might consider a sequence of increasing batch sizes Heuristically, a large batch is wasteful at the early stages when the iterates are inaccurate Fixing the batch size to be n as in SVRG is obviously suboptimal Via an involved analysis, we find that B gives the best complexity among the class of SCSG algorithms Corollary 4 Time-varying batch sizes Set Then it holds that { EC comp = O min, 5 B = min [ L f 5 + H 5 log 5 {, n }, η = H 6LB ] }, n 5 + n L f + H log n The proofs of both Corollary and Corollary 4 are presented in Appendix C To simplify the bound 8, we assume that f, H = O in order to highlight the dependence on and n Then 8 can be simplified to EC comp = O log 5 5 n 5 n log n + = Õ 5 n 5 + n = Õ 5 n The log-factor log 5 is purely an artifact of our proof It can be reduced to log +µ for any µ > 0 by setting B log +µ ; see remar in Appendix C Convergence analysis for P-L obectives When the component f i x satisfies the P-L condition, it is nown that the global minimum can be found efficiently by SGD [5] and SVRG-type algorithms [4, 4] Similarly, SCSG can also achieve this As in the last subsection, we start from a generic result to bound Ef x T f and then consider specific settings of the parameters as well as their complexity bounds Theorem 5 Let λ = 5Lb µγb +5Lb Then under the same settings of Theorem, Ef x T f λ T λ T λ f + 6γH T λ T λ T λ + IB < n µγb + 5Lb B The proofs and additional discussion are presented in Appendix D Again, Theorem 5 covers existing complexity bounds for both SGD and SVRG In fact, when B = B as in SGD, via some calculation, we obtain that T L Ef x T f = O f + H µ + L µb The second term can be made O by setting B = O H µ, in which case T = O L µ log f As a result, the average cost to reach an -accurate solution is EC comp = O LH µ, which is the same as [5] On the other hand, when B n and as in SVRG, Theorem 5 implies that T Ef x T f L = O µn f + L This entails that T = O + log µn / and hence EC comp = O n + n/ µ log, which is the same as [4] By leveraging the batch and mini-batch sizes, we obtain a counterpart of Corollary as below = 8 7

8 Corollary 6 Set, { } H B B = min µ, n, η η = 6LB Then it holds that { H EC comp = O µ n + H µ µ n } log f Recall the results from Table, SCSG is O µ + faster than SGD and is never worse than SVRG When both µ and are moderate, the acceleration of SCSG over SVRG is significant Unlie the smooth case, we do not find any possible choice of setting that can achieve a better rate than Corollary 6 µ / 4 Experiments We evaluate SCSG and mini-batch SGD on the MNIST dataset with a three-layer fully-connected neural networ with 5 neurons in each layer FCN for short and a standard convolutional neural networ LeNet [8] CNN for short, which has two convolutional layers with and 64 filters of size 5 5 respectively, followed by two fully-connected layers with output size 04 and 0 Max pooling is applied after each convolutional layer The MNIST dataset of handwritten digits has 50, 000 training examples and 0, 000 test examples The digits have been size-normalized and centered in a fixed-size image Each image is 8 pixels by 8 pixels All experiments were carried out on an Amazon pxlarge node with a NVIDIA GK0 GPU with algorithms implemented in TensorFlow 0 Due to the memory issues, sampling a chun of data is costly We avoid this by modifying the inner loop: instead of sampling mini-batches from the whole dataset, we split the batch I into B / mini-batches and run SVRG-type updates sequentially on each Despite the theoretical advantage of setting =, we consider practical settings > to tae advantage of the acceleration obtained by vectorization We initialized parameters by TensorFlow s default Xavier uniform initializer In all experiments below, we show the results corresponding to the best-tuned stepsizes We consider three algorithms: SGD with a fixed batch size B {5, 04}; SCSG with a fixed batch size B {5, 04} and a fixed mini-batch size b = ; SCSG with time-varying batch sizes B = / n and = B / To be clear, given T epochs, the IFO complexity of the three algorithms are T B, T B and T = B, respectively We run each algorithm with 0 passes of data It is worth mentioning that the largest batch size in Algorithm is 75 5 = 456, which is relatively small compared to the sample size We plot in Figure the training and the validation loss against the IFO complexity ie, the number of passes of data for fair comparison In all cases, both versions of SCSG outperform SGD, especially in terms of training loss SCSG with time-varying batch sizes always has the best performance and it is more stable than SCSG with a fixed batch size For the latter, the acceleration is more significant after increasing the batch size to 04 Both versions of SCSG provide strong evidence that variance reduction can be achieved efficiently without evaluating the full gradient Given B IFO calls, SGD implements updates on two fresh batches while SCSG replaces the second batch by a sequence of variance reduced updates Thus, Figure shows that the gain due to variance reduction is significant when the batch size is fixed To further explore this, we compare SCSG with time-varying batch sizes to SGD with the same sequence of batch sizes The results corresponding to 8

9 Training Log-Loss CNN SGD B = 5 SCSG B = 5, b = SCSG B = ^5, B/b = Training Log-Loss CNN SGD B = 04 SCSG B = 04, b = SCSG B = ^5, B/b = Training Log-Loss FCN SGD B = 5 SCSG B = 5, b = SCSG B = ^5, B/b = Training Log-Loss FCN SGD B = 04 SCSG B = 04, b = SCSG B = ^5, B/b = Validation Log-Loss Validation Log-Loss Validation Log-Loss Validation Log-Loss Figure : Comparison between two versions of SCSG and mini-batch SGD of training loss top row and validation loss bottom row against the number of IFO calls The loss is plotted on a log-scale Each column represents an experiment with the setup printed on the top the best-tuned constant stepsizes are plotted in Figure a It is clear that the benefit from variance reduction is more significant when using time-varying batch sizes We also compare the performance of SGD with that of SCSG with time-varying batch sizes against wall cloc time, when both algorithms are implemented in TensorFlow and run on a Amazon pxlarge node with a NVIDIA GK0 GPU Due to the cost of computing variance reduction terms in SCSG, each update of SCSG is slower per iteration compared to SGD However, SCSG maes faster progress in terms of both training loss and validation loss compared to SCD in wall cloc time The results are shown in Figure CNN CNN 0 0 scsg B=^5, B/b=6 sgd B=^5 0 0 scsg B=^5, B/b=6 sgd B=^5 Training Log Loss 0 - Validation Log Loss Wall Cloc Time in second Wall Cloc Time in second Figure : Comparison between SCSG and mini-batch SGD of training loss and validation loss with a CNN loss, against wall cloc time The loss is plotted on a log-scale Finally, we examine the effect of B /, namely the number of mini-batches within an iteration, since it affects the efficiency in practice where the computation time is not proportional to the batch size Figure b shows the results for SCSG with B = / n and B / {, 5, 0, 6, } In general, larger B / yields better performance It would be interesting to explore the tradeoff between computation efficiency and this ratio on different platforms 5 Discussion We have presented the SCSG method for smooth, non-convex, finite-sum optimization problems SCSG is the first algorithm that achieves a uniformly better rate than SGD and is never worse than SVRG-type algorithms When the target accuracy is low, SCSG significantly outperforms the SVRG-type algorithms Unlie various other variants of SVRG, SCSG is clean in terms of both 9

10 Training Log-Loss CNN SGD SCSG Training Log-Loss FCN SGD SCSG a SCSG and SGD with increasing batch sizes Training Log-Loss CNN B/b = 0 B/b = 50 B/b = 00 B/b = 60 B/b = Training Log-Loss FCN B/b = 0 B/b = 50 B/b = 00 B/b = 60 B/b = b SCSG with different B / implementation and analysis Empirically, SCSG outperforms SGD in the training of multi-layer neural networs Although we only consider the finite-sum obective in this paper, it is straightforward to extend SCSG to the general stochastic optimization problems where the obective can be written as E ξ F fx; ξ: at the beginning of -th epoch a batch of iid sample ξ,, ξ B is drawn from the distribution F and g = B f x ; ξ i see line of Algorithm ; B i= at the -th step, a fresh sample ξ ν = i= fx ; ξ i,, ξ is drawn from the distribution F and i= fx 0 ; ξ i + g see line 8 of Algorithm Our proof directly carries over to this case, by simply suppressing the term IB < n, and yields the bound Õ 5/ for smooth non-convex obectives and the bound Õµ µ 5/ / for P-L obectives These bounds are simply obtained by setting n = in our convergence analysis Compared to momentum-based methods [6] and methods with adaptive stepsizes [0, 6], the mechanism whereby SCSG achieves acceleration is qualitatively different: while momentum aims at balancing primal and dual gaps [5], adaptive stepsizes aim at balancing the scale of each coordinate, and variance reduction aims at removing the noise We believe that an algorithm that combines these three techniques is worthy of further study, especially in the training of deep neural networs where the target accuracy is moderate Acnowledgments The authors than Zeyuan Allen-Zhu, Chi Jin, Nilesh Tripuraneni, Yi Xu, Tianbao Yang, Shenyi Zhao and anonymous reviewers for helpful discussions References [] Aleh Agarwal and Leon Bottou A lower bound for the optimization of finite sums ArXiv e-prints abs/4007, 04 [] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma Finding approximate local minima for nonconvex optimization in linear time arxiv preprint arxiv:6046, 06 [] Zeyuan Allen-Zhu Natasha: Faster stochastic non-convex optimization via strongly non-convex parameter arxiv preprint arxiv:700076, 07 [4] Zeyuan Allen-Zhu and Elad Hazan Variance reduction for faster non-convex optimization ArXiv e-prints abs/600564, 06 0

11 [5] Zeyuan Allen-Zhu and Lorenzo Orecchia Linear coupling: An ultimate unification of gradient and mirror descent arxiv preprint arxiv:40757, 04 [6] Zeyuan Allen-Zhu and Yang Yuan Improved SVRG for non-strongly-convex or sum-of-nonconvex obectives ArXiv e-prints, abs/506097, 05 [7] Dimitri P Bertseas A new class of incremental gradient methods for least squares problems SIAM Journal on Optimization, 74:9 96, 997 [8] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford Accelerated methods for non-convex optimization arxiv preprint arxiv:600756, 06 [9] Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions arxiv preprint arxiv: , 07 [0] John Duchi, Elad Hazan, and Yoram Singer Adaptive subgradient methods for online learning and stochastic optimization Journal of Machine Learning Research, Jul: 59, 0 [] Alexei A Gaivoronsi Convergence properties of bacpropagation for neural nets via theory of stochastic gradient methods Part Optimization methods and Software, 4:7 4, 994 [] Saeed Ghadimi and Guanghui Lan Stochastic first-and zeroth-order methods for nonconvex stochastic programming SIAM Journal on Optimization, 4:4 68, 0 [] Saeed Ghadimi and Guanghui Lan Accelerated gradient methods for nonconvex nonlinear and stochastic programming Mathematical Programming, 56-:59 99, 06 [4] Reza Hariandeh, Mohamed Osama Ahmed, Alim Virani, Mar Schmidt, Jaub Konečnỳ, and Scott Sallinen Stop wasting my gradients: Practical SVRG In Advances in Neural Information Processing Systems, pages 4 50, 05 [5] Hamed Karimi, Julie Nutini, and Mar Schmidt Linear convergence of gradient and proximalgradient methods under the Polya-Loasiewicz condition In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages Springer, 06 [6] Diederi Kingma and Jimmy Ba Adam: A method for stochastic optimization arxiv preprint arxiv:46980, 04 [7] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton Deep learning Nature, 5755:46 444, 05 [8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patric Haffner Gradient-based learning applied to document recognition Proceedings of the IEEE, 86:78 4, 998 [9] Lihua Lei and Michael I Jordan Less than a single pass: Stochastically controlled stochastic gradient method arxiv preprint arxiv:60906, 06 [0] Peter McCullagh and John A Nelder Generalized Linear Models CRC Press, 989 [] Aradi Nemirovsi, Anatoli Juditsy, Guanghui Lan, and Alexander Shapiro Robust stochastic approximation approach to stochastic programming SIAM Journal on Optimization, 94: , 009 [] Yurii Nesterov Introductory Lectures on Convex Optimization: A Basic Course Kluwer Academic Publishers, Massachusetts, 004 [] Boris Teodorovich Polya Gradient methods for minimizing functionals Zhurnal Vychislitel noi Matematii i Matematichesoi Fizii, 4:64 65, 96 [4] Sashan J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola Stochastic variance reduction for nonconvex optimization arxiv preprint arxiv:600660, 06

12 [5] Sashan J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola Fast incremental method for nonconvex optimization arxiv preprint arxiv:600659, 06 [6] Ilya Sutsever, James Martens, George E Dahl, and Geoffrey E Hinton On the importance of initialization and momentum in deep learning ICML, 8:9 47, 0 [7] Paul Tseng An incremental gradient -proection method with momentum term and adaptive stepsize rule SIAM Journal on Optimization, 8:506 5, 998

13 A Technical Lemmas In this section we present several technical lemmas that facilitate the proofs of our main results We start with a lemma on the variance of the sample mean without replacement Lemma A Let x,, x M R d be an arbitrary population of N vectors with M x = 0 J = Further let J be a uniform random subset of {,, M} with size m Then E x m = M m M m M x Im < M M x M m M = Proof Let W = I J, then it is easy to see that Then the sample mean can be rewritten as EW = EW = m M, EW W = m x = m J n W x i= = mm MM 9 This implies that E x = M m m EW x + EW W x, x J = = m M m x mm + x, x M MM = = M m mm m x mm M + M MM MM = = = m m M = M m M m mm M x MM = M M = x x Since the geometric random variable plays an important role in our analysis, we recall its ey properties below Lemma A Let N Geomγ for some B > 0 Then for any sequence D 0, D,, ED N D N+ = γ D 0 ED N Proof By definition, ED N D N+ = n 0D n D n+ γ n γ

14 = γ D 0 D n γ n γ n = γ γ D 0 D n γ n γ n n n 0 = γ γ D 0 D n γ n γ = γ n 0 γ D 0 ED N Lemma A For any η > and z > 0, define g η x and xz as Then g η x = + log x x η, xz = z η η log z η g η x z x xz Proof For any x xz, denote α = x/z η Then α η log z η and g η x = + log α + η log + log α z η log z α η z z α η Taing the logarithm of both sides, we obtain that log g η x log z log + log α η log α + log η log z log η log z η log α 0 B One-Epoch Analysis As in the standard analysis of stochastic gradient methods, we start by establishing a bound on EĨ ν and E I e Lemma B Under Assumption A, EĨ ν L x b x 0 + fx + e Proof Using the fact that E Z = E Z EZ + EZ for any random variable Z, we have EĨ ν = EĨ ν =EĨ fĩ x f Ĩ x EĨ fĩ x f Ĩ x By Lemma A, EĨ ν + EĨ ν 0 fx 0 fx EĨ fĩ x f Ĩ x fx 0 + fx + e fx 0 + fx + e 0 fx 4 fx 0

15 n n i= = n n i= n n i= L x f i x f ix f i x f ix f i x f ix 0 x 0 where the last line uses Assumption A Therefore, 0 fx 0 fx fx 0 fx 0 EĨ ν L x b x 0 + fx + e Lemma B E I e IB < n B H Proof Since x is independent of I, conditioning on x and applying Lemma A, we have E I e = n B n f i x f x n B n i= n B H IB < n H n B B Based on Lemma A, Lemma B and Lemma B, we can derive bounds for primal and dual gaps respectively Lemma B Suppose η L <, then under Assumption A, η B η LE f x + η B E e, f x E f x f x + L η B E x x + Lη B E e 0 where E denotes the expectation with respect to all randomness Proof By 4, EĨ fx =fx fx + fx η EĨ ν, fx + Lη E Ĩ ν η fx η e, fx + Lη E Ĩ ν η η L fx η e, fx + L η x x 0 + Lη e Lemma B Let E denotes the expectation over Ĩ0, Ĩ,, given Note that E is equivalent to the expectation over Ĩ0, Ĩ, as is independent of them Since Ĩ+, Ĩ+, are independent of x, the above inequality implies that η η LE fx + η E e, fx 5

16 E fx E fx + + L η E x b x 0 + Lη e Let = in By taing expectation with respect to and using Fubini s theorem, we arrive at η η LE N E fx + η E N E e, fx E N E fx E fx N + + L η E N E E x x 0 + Lη e = b fx 0 B E E N fx + L η E E N x x 0 + Lη e Lemma A The lemma is then proved by substituting x x 0 by x x, and taing an expectation over all past randomness Lemma B4 Suppose η L B < b, then under Assumption A, η L B E x x + η B E e, x x η B E f x, x x + η B E f x + η B E e 4 Proof Since x + = x EĨ x + x 0 = x x = x x + η L η ν, we have 0 η EĨ ν 0 η fx x x, x, x 0 η fx x 0 + η EĨ ν x 0 η e, x x 0 + η EĨ ν, x + η fx + η e Lemma B Using the same notation E as in the proof of Lemma B, we have η E fx + η L, x E x x 0 + η E e, x x 0 x 0 η e, x x 0 x 0 E x + x 0 + η fx + η e 5 Let = in 5 By taing expectation with respect to and using Fubini s theorem, we arrive at η E N E fx, x x 0 + η E N E e, x x 0 + η L E N E x x 0 E N E x N + x 0 + η E N fx + η e = B + η L E N E x x 0 + η E N fx + η e Lemma A The lemma is then proved by substituting x x 0 by x x and taing a further expectation with respect to the past randomness 6 6

17 Lemma B5 E e, x x = η B E e, f x η B E e 7 Proof Let M = e, x x 0 By definition, we have E N e, x x = E N M Since is independent of x 0, e, this implies that E e, x x = EM Also we have M 0 = 0 On the other hand, EĨ M + M = EĨ e, x + x = η e, fx η e = η e, EĨ ν Using the same notation E as in the proof of Lemma B and Lemma B4, we have E M + M = η e, E fx η e 8 Let = in 8 By taing an expectation with respect to and using Lemma A, we obtain that E N M N B = η e, E N E fx η e The lemma is then proved by substituting x x 0 by x x and taing a further expectation with respect to the past randomness Proof [Theorem ] Multiplying equation 0 by, equation 4 by b η B and summing them, we obtain that η B η L B E f x + b η L B η L B η B E x x + η B E e, f x + E e, x x E f x, x x + Ef x f x + Lη B + η E e 9 By Lemma B5, the second row can be simplified as η B E e, f x + E e, x x = η B E e Using the fact that a, b β a + β b for any β > 0, we have E f x, x x η B b η L B η L B Putting the pieces together, we conclude that η B η L B Ef x f x + η B b E f x + b η L B η L B η B E x x b b η L B η L B + η L + B E f x E e 0 Since η L = θ = γ /B and, B 8 8, b η L B η L B = b γ b B γ b b γ / γ 7

18 Then 0 can be simplified as γ B B γ LEf x f x + γ LEf x f x + γ Since B 8, γ, we have B γ b B B γ / γ E f x B b + γ + E e b + γ B b B B IB < n + H By Lemma B B b B γ / γ 4 γ γ / γ 048 and b + γ + + γ B B Thus, implies that B E f x 5L γ Ef x f x + 6IB < n H b B C Convergence Analysis for Smooth Obectives Proof [Theorem ] Since x T is a random element from x T = with we have E f x T 5L P x T = x η B B /, γ Ef x 0 f x T L γ f x 0 f + 6 T = b B IB < n H T = b B T = b B IB < n H T = b B Proof [Corollary ] By Theorem, E f x T 0 f + 6T B IB < n H T B Let T be the minimum number of epochs such that 0 f T B = 0 f T B + 6H IB < n B Then under the setting of the corollary, for any T T, E f x T + 6H IB < n B 8

19 By definition, we now that T T Noticing that f T = O = O + f, B B we conclude that EC comp = OT B = O T B = O B + f The corollary is then proved by substituting for B Proof [Corollary 4] By Theorem, B IB <n H E f x T 0 f + 6 T = T = n W T Let T = n First we prove that W T is strictly decreasing When T T, the numerator is a constant and the denominator is strictly increasing Thus, W T is strictly decreasing on [T, ; When T < T, let a = 6H and a = Further let UT = 0 T f = T = a, V T = a T = a, then W T = UT + V T It is obvious that UT is strictly decreasing Noticing that a a = 6H is strictly decreasing, we also conclude that V T is strictly decreasing Therefore, W T is strictly decreasing on [, T ] In summary, W T is strictly decreasing Now we derive the bound for T To do so, we distinguish two cases to analyze W T If T T, then W T = T 0 f + 6 T = = H Since is decreasing, we have T = T = + T + dx = + log T x = Similarly, since is increasing, we have T = T 0 x dx = T Therefore, W T 45 f log T H T 9

20 If T > T, then W T = T 0 f + 6 = H T = + n T T = W T Similar to the first case, we have Since we obtain that T = T = = As a result, implies that + T T T = T = + n T T xdx + n = n + n n = n n + n + + n, T = W T W T Putting the pieces together, we obtain that 45 f log T H W T T W T + n T T n + n T T W T T T W T T > T It is easy to see that both W T and W T are strictly decreasing and lim T W T = lim T W T = 0 Let 4 T = min{t : W T }, T = min{t T : W T } Recall that W T is also strictly decreasing, we have { T W T T T W T > More concisely, T T T + T T 5 To derive a bound for T, let T be the minimum T such that Then by Lemma A, we have 45 f T T T = O, 9H + log T T f + H On the other hand, it is straightforward to derive a bound for T as T T n W T n W T = O Therefore, we conclude that { T = O min + [ f + H log 0 H log H f n + H log n n ] }, n + f + H log n 6 n

21 Finally, to obtain the bound for the computation complexity, we notice that T = = OT 5 Let x + denote the positive part of x; ie, x + = max{x, 0} Therefore, T EC comp = O = O T T 5 + nt T + = B = O T T 5 + nt T { [ ] } = O min 5 f + H 5 log 5 H, n 5 + n f + H log n 5 Remar The log-factor log 5 can be reduced to log +µ for any µ > 0 by setting B = log +µ n In this case, For any µ > 0, W T = T = IB < n log + µ On the other hand, as proved above, Thus, = T 0 f + 6 = IB <n log + µ T = log + µ + xlog x + µ T T log + µ T = f + H W T O T H < Using similar arguments and treating f and H as O for simplicity, we can obtain that T = O n + n If B T < n, then T EC comp = O log +µ = O T 5 log T +µ = O 5 log +µ = If B T n, we obtain the same bound as in Corollary 4 D Convergence Analysis for P-L Obectives Proof [Theorem 5] By equation in the proof of Theorem see p 8 and the P-L condition, B µ Ef x f B E f x 5L γ Ef x f x + 6b B IB < n H

22 For brevity, we write F for Ef x f Then µγb + 5Lb F 5Lb F + 6γB IB < n H 7 By definition of λ, this can be reformulated as F λ F + 6γH IB < n µγb + 5Lb B Applying the above inequality iteratively for = T, T,,, we prove the result Proof [Corollary 6] When B B, and γ = 6, 7 in the proof of Theorem 5 can be reformulated as µγb + 0L F 6H IB < n 0L F 6H IB < n µb µb This implies that T 0L F T µb f + 6H IB < n + 0L µb Under the setting of this problem, 6H IB < n µb By definition of T, we have T log / f 0L log = O log f + L µb + 0L µb As a consequence, EC comp = O T B = O B + LB µ Plugging into B, we end up with { H EC comp = O µ n + H µ µ n log f } log f

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa

More information

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

A Unified Analysis of Stochastic Momentum Methods for Deep Learning A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

arxiv: v1 [stat.ml] 20 May 2017

arxiv: v1 [stat.ml] 20 May 2017 Stochastic Recursive Gradient Algorithm for Nonconvex Optimization arxiv:1705.0761v1 [stat.ml] 0 May 017 Lam M. Nguyen Industrial and Systems Engineering Lehigh University, USA lamnguyen.mltd@gmail.com

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Stochastic Cubic Regularization for Fast Nonconvex Optimization Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni Mitchell Stern Chi Jin Jeffrey Regier Michael I. Jordan {nilesh tripuraneni,mitchell,chijin,regier}@berkeley.edu jordan@cs.berkeley.edu

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Sparse Regularized Deep Neural Networks For Efficient Embedded Learning

Sparse Regularized Deep Neural Networks For Efficient Embedded Learning Sparse Regularized Deep Neural Networks For Efficient Embedded Learning Anonymous authors Paper under double-blind review Abstract Deep learning is becoming more widespread in its application due to its

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

On the Convergence of Federated Optimization in Heterogeneous Networks

On the Convergence of Federated Optimization in Heterogeneous Networks On the Convergence of Federated Optimization in Heterogeneous Networs Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalar, Virginia Smith Abstract The burgeoning field of federated

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Haiwen Huang School of Mathematical Sciences Peking University, Beijing, 100871 smshhw@pku.edu.cn Chang Wang

More information

arxiv: v4 [math.oc] 11 Jun 2018

arxiv: v4 [math.oc] 11 Jun 2018 Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,

More information

Fast Learning with Noise in Deep Neural Nets

Fast Learning with Noise in Deep Neural Nets Fast Learning with Noise in Deep Neural Nets Zhiyun Lu U. of Southern California Los Angeles, CA 90089 zhiyunlu@usc.edu Zi Wang Massachusetts Institute of Technology Cambridge, MA 02139 ziwang.thu@gmail.com

More information

Lecture 16: Introduction to Neural Networks

Lecture 16: Introduction to Neural Networks Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,

More information

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning

Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning Jacob Rafati Electrical Engineering and Computer Science University of California, Merced Merced, CA 95340 USA jrafatiheravi@ucmerced.edu

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Deep Learning Lecture 2

Deep Learning Lecture 2 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

A Robust Multi-Batch L-BFGS Method for Machine Learning

A Robust Multi-Batch L-BFGS Method for Machine Learning A Robust Multi-Batch L-BFGS Method for Machine Learning Albert S. Berahas albertberahas@u.northwestern.edu Department of Engineering Sciences and Applied Mathematics Northwestern University Evanston, IL

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

arxiv: v2 [math.oc] 5 May 2018

arxiv: v2 [math.oc] 5 May 2018 The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

arxiv: v1 [cs.lg] 17 Nov 2017

arxiv: v1 [cs.lg] 17 Nov 2017 Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

arxiv: v2 [cs.lg] 4 Oct 2016

arxiv: v2 [cs.lg] 4 Oct 2016 Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information