Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Size: px
Start display at page:

Download "Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations"

Transcription

1 Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract We present a novel miniatch stochastic optimization method for empirical risk minimization prolems. The method efficiently leverages variance reduced first-order and su-sampled higher-order information to accelerate the convergence. We prove improved iteration complexity over state-of-the-art methods under suitale conditions. Introduction We consider the following optimization prolem of finite-sums: min f(w) := w n n f i (w), () which arises in many machine learning prolems. In the context of regularized loss minimization of linear predictors, we have f i (w) = l(w x i ; i ) + 2 w 2. Let x, x 2,..., x n R d e feature vectors of n data samples, and,..., n R or, } e the corresponding target variales of interest. We note that () covers many popular models used in machine learning: for example when l(w x i ; i ) = log( + exp( i w, x i )) we otain l 2 regularized logistic regression. The uiquitousness of such finite-sum optimization prolems and the massive scale of modern datasets motivate significant research effort on efficient optimization algorithms to solve (). Throughout the paper, we assume each component function is L-smooth and -strongly convex. In recent years there have een significant advances in developing fast optimization algorithms for (), and we refer the readers to ottou206optimization for a comprehensive survey of these developments. For large-scale prolems in the form of (), randomized methods are particularly efficient ecause of their low per iteration computation. Below we riefly review two lines of research: i) randomized variance reduced first-order methods; ii) randomized methods leveraging second-order information. Variance Reduced First-order Methods The key technique for developing fast stochastic first-order methods is variance reduction, which makes the variance of the randomized update direction approching zero when the iterate gets closer to the optimum. Representative methods of this category include SAG roux202stochastic, SVRG johnson203accelerating, SDCA shalev203stochastic and SAGA defazio204saga, etc. In order to find a solution that reaches ɛ-suoptimality, these methods requires O (( ) ( n + L log )) ɛ calls to the first-order oracle to compute the gradient of an individual function. In L the large condition numer regime (e.g. > n), y using an acceleration technique (such as Catalyst shalev206accelerated,frostig205regularizing,lin205universal, APCG lin205accelerated, SPDC zhang205stochastic, (( ) Katyusha allen206katyusha, etc), one can further improve the iteration complexity to O n + n L log ( ) ) ɛ. i= Leveraging Second-order Information Second-order information is often useful in improving the convergence of optimization algorithms. However for large-scale prolems, otaining and inverting the exact Hessian matrix is often computational expensive, making vanilla Newton methods not well suited in solving (). Therefore, there have een emerging studies in designing randomized algorithms that effectively utilize the approximated Hessian information. One line of such research is the su-sampled Newton method which approximates the Hessian matrix ased on a su-sampled miniatch of data. Several work, such as yrd20use,am205su,roosta206su,roosta206su2,ollapragada206exact, estalished local linear OPTML 207: 0th NIPS Workshop on Optimization for Machine Learning (NIPS 207).

2 Algorithm MB-SVRP: Miniatch Stochastic Variance Reduced Proximal Iterations. Parameters η,,, ν, ε. Initialize w 0 = 0. Sampling Sampling items from [n] to form a miniatch B. for s =, 2,... do Calculate ṽ = n n i= f i( w s ). Initialize y 0 = w 0 = w s. for t =, 2,..., m do Sampling items from [n] to form a miniatch B t. Find w t that approximately solve (2) such that f t (w t ) min w ft (w) ε. f t (w) := 2 (w w t ) 2 f i (w t ) (w w t ) + η i B f i (y t ) + ηṽ η f i ( w s ), w i B t i B t + 2 w y t 2. (2) Update: end for Update w s = w m. end for Return w s y t = w t + ν(w t w t ). convergence rates for different variants of su-sampled Newton methods, when the size of the su-sampled data is large enough. All of the aforementioned methods employ the full first-order gradient, and require calling the second-order oracle to compute the Hessian and solving a resulting linear system at every iteration. Therefore the computational complexity (as compared in Tale 2 of xu206su) are often worse than stochastic algorithms such as SVRG. Another line of research is to consider randomness in oth first and second-order information to design lower per-iteration cost algorithms. In particular, yrd206stochastic considered comining miniatch SGD with Limited-memory BFGS (L-BFGS) liu989limited type update. Inspired y this, moritz206linearly,gower206stochastic,wang207stochastic proposed to comine variance reduced stochastic gradient with L-BFGS, and proved linear convergence for strongly convex and smooth ojectives. However, theoretically it is hard to guarantee the quality of approximated Hessian using L-BFGS update, and thus the iteration complexity otained y moritz206linearly,gower206stochastic,wang207stochastic can e pessimistic, and can e much worse than vanilla SVRG. In this paper, we make effort in this direction and make the following contriutions: i) we propose a novel approach that comines the advantages of variance-reduced first-order methods and su-sampled Newton methods, in a efficient way that does not require expensive Hessian matrix computation and inversion. The method can e naturally extended to solve composite optimization prolems with non-smooth regularization; ii) we theoretically show under certain conditions the proposed approach can improve state-of-the-art iteration complexity, and empirically demonstrate it can sustantially improve the convergence of existing methods. 2 Miniatch Stochastic Variance Reduced Proximal Iterations In this section we present the proposed approach for minimizing (). The high-level description is given in Algorithm, which consists of two main innovations: i) simultaneous incorporation of first-order and higher-order information via su-sampling; ii) allowing larger miniatch sizes through acceleration with inexact minimization. We explain these features elow. 2

3 Sampling oth First-order and Higher-order Information We first discuss the main uilding lock of the algorithm. Suppose at iteration t, given the previous iterate w t, we consider the following update rule: w t arg min w 2 (w w t ) 2 f i (w t ) (w w t ) + 2 w w t 2 i B η + f i (w t ) + η f( w) η f i ( w), w (3) i B t i B t ( ) =w t η 2 f i (w t ) + I f i (w t ) + f( w) f i ( w), (4) i B t i B t i B where B, B t are some randomly sampled miniatch from,..., n, oth with miniatch size ; η and are stepsize parameters, and w is a reference predictor used for reducing variance. The main feature of updating rule (3) is it considered oth noisy first-order and higher-order information via miniatch sampling. The term i B f i(w) i B f i(w t ), w in (3) incorporates noisy higher-order information of the ojective function. It which can e treated as a variant of su-sampled Newton method, comined with the miniatch stochastic gradient with variance reduction. And the update rule (3) can e treated as a preconditioned I) miniatch SVRG update rule, with i B 2 f i (w t ) + as a precondition matrix. Large Miniatch Size via Acceleration with Inexact Minimization ( Using (3) as the uilding lock, we propose the MB-SVRP (miniatch stochastic variance reduced proximal iterations) method, which is detailed in Algorithm. At the eginning of the algorithm, we form a miniatch B y sampling from,..., n and fix it for the whole optimization process. Then following the SVRG method johnson203accelerating, the algorithm is divided to multiple stages, indexed y s. At each stage, we iteratively solve a minimization prolem of the form (2) ased on the randomly sampled miniatch B t. Compared with the simple update rule in (3), the major difference in Algorithm is that we consider a momentum scheme y maintaining two sequences w t, y t }, which is inspired y Nesterov s acceleration technique nesterov2003introductory and its recent SVRG variant nitanda204stochastic. The main theoretical advantage over miniatch SVRG without momentum konevcny206mini is that such an acceleration allows us to use a much larger miniatch size (up to a size of O( n)) without slowing down the convergence. Since it is often expensive to find the exact minimizer of (2), we consider an approximate minimizer with ojective suoptimality ε. When we choose the appropriate for f t (w) to otain enough strong convexity, the suprolems can e efficiently solved to high accuracy efficiently. For example, using SVRG-type algorithms johnson203accelerating to solve (2) allows us to find an approximate solution with a small suoptimality ε with a constant numer of passes over the data in the miniatch B B t, if is set appropriately. 3 Theoretical Results Besides the strong convexity and smoothness, we also need the following Hessian Lipschitz condition, and the notion of effective dimension and statistical leverage. Condition. For each component function f i (w), i [n], its Hessian matrix is M-Lipschitz, i.e. 2 f i (w) 2 f i (w ) M w w, w, w R d, i [n]. Definition. (Effective dimension) Let the,..., d e the top-d eigenvalues of H 0 (w ) = (/n) n i= l (w x i ; i )x i x i, define the effective dimension d (for some 0) of H 0 (w ) eing d = d j= j j+. 3

4 Tale : Comparison of iteration complexity of various finite-sum quadratic optimization algorithms when effective dimension and statistical leverage are ounded, where we compare the different relative scale of condition numer κ = L/ and sample size n, ignoring logarithmic factors. κ n κ = n 4/3 κ = n 3/2 κ = n 2 κ = n 3 SVRG n n 4/3 n 3/2 n 2 n 3 MB-SVRP n ρ 2 d n ρ 2 d n ρ 2 d n 3/2 ρ 2 d n 5/2 Acc-SVRG n n 7/6 n 5/4 n 3/2 n 2 ( ) /2 ( ) /2 ( 2 d ) /2 ( 2 d ) /2 Acc-MB-SVRP n ρ 2 d n ρ 2 d n ρ n 5/4 ρ n 7/4 Definition 2. (Statistical leverage at ) Let H (w n ) = n i= l (w x i ; i )x i x i + I, we say the statistical leverage of data matrix X is ounded y ρ at if H /2 (w )l (w x i; i) /2 x i max i [n] (/n) ρ. n j= H /2 (w )l (w x j; j) /2 2 x j Theorem 3. Consider Algorithm on prolems with L-smooth and -strongly convex functions that satisfied Condition with Hessian Lipschitz parameter M. Suppose we sample miniatch B uniformly from,..., n, and set the tuning parameters as = max, L }, min n, ρ 2 d ( ) } /3 L, η min 3 (ρ 2 d ) 2 L, ρ 2 d and( suppose ) we start from the initialization point that is sufficiently close to the optimum: w 0 w R = O 4 L 2 M. Given ɛ > 0, and let ε 0 ( 7 L) ɛ. Then for the MP-SVRP algorithm to find the approximate 5 solution w s of () that reaches the expected ɛ-ojective suoptimality Ef( w s ) min w f(w) ɛ, the total numer of oracle calls to solve (2) is no more than ( (L ) /3 O max, ρ2 d L } ( ) ) n 2 log. ɛ }, Note that in Algorithm, we can use SVRG to solve each suprolem in (2). This leads to the following result. Corollary 4. Assume that the conditions of Theorem 3 hold. Moreover, if we use SVRG to solve each suprolem in (2) up to the suoptimality ε, then the total numer of gradient evaluations used in the whole MP-SVRP algorithm can e upper ounded y O ( max ρ 2 d ( ) 2/3 L, ρ2 d L n } log ( ) ( ) ( ) ) L log 2 + n log. ɛ ɛ Acceleration Corollary 4 stated that if the condition numer is not too large: L n5/4, then MB-SVRP only requires logarithmic passes over data to find a solution with high accuracy. When the condition numer is large, MP-SVRP can still e slow. By using the accelerated proximal point framework proposed in shalev206accelerated, frostig205regularizing,lin205universal, it is possile to otain an accelerated convergence rate which has milder dependence on condition numer, Tale summarized the main results. 4

5 References [Allen-Zhu(206)] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. arxiv preprint arxiv: , 206. [Bollapragada et al.(206)bollapragada, Byrd, and Nocedal] R. Bollapragada, R. Byrd, and J. Nocedal. Exact and inexact susampled newton methods for optimization. arxiv preprint arxiv: , 206. [Bottou et al.(206)bottou, Curtis, and Nocedal] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arxiv preprint arxiv: , 206. [Byrd et al.(20)byrd, Chin, Neveitt, and Nocedal] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 2(3): , 20. [Byrd et al.(206)byrd, Hansen, Nocedal, and Singer] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2): , 206. [Cotter et al.(20)cotter, Shamir, Srero, and Sridharan] A. Cotter, O. Shamir, N. Srero, and K. Sridharan. Better mini-atch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages , 20. [Defazio et al.(204)defazio, Bach, and Lacoste-Julien] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite ojectives. In Advances in Neural Information Processing Systems, pages , 204. [Dekel et al.(202)dekel, Gilad-Bachrach, Shamir, and Xiao] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distriuted online prediction using mini-atches. Journal of Machine Learning Research, 3(Jan):65 202, 202. [Devolder et al.(204)devolder, Glineur, and Nesterov] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 46(-2): 37 75, 204. [Erdogdu and Montanari(205)] M. A. Erdogdu and A. Montanari. Convergence rates of su-sampled newton methods. In Advances in Neural Information Processing Systems, pages , 205. [Frostig et al.(205)frostig, Ge, Kakade, and Sidford] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning, pages , 205. [Gower et al.(206)gower, Goldfar, and Richtárik] R. Gower, D. Goldfar, and P. Richtárik. Stochastic lock fgs: squeezing more curvature out of data. In International Conference on Machine Learning, pages , 206. [Johnson and Zhang(203)] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages , 203. [Konečnỳ et al.(206)konečnỳ, Liu, Richtárik, and Takáč] J. Konečnỳ, J. Liu, P. Richtárik, and M. Takáč. Mini-atch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 0(2): , 206. [Lan(202)] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 33(): , 202. [Lin et al.(205a)lin, Mairal, and Harchaoui] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages , 205a. [Lin et al.(205)lin, Lu, and Xiao] Q. Lin, Z. Lu, and L. Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4): , 205. [Liu and Nocedal(989)] D. C. Liu and J. Nocedal. On the limited memory fgs method for large scale optimization. Mathematical programming, 45(): , 989. [Moritz et al.(206)moritz, Nishihara, and Jordan] P. Moritz, R. Nishihara, and M. Jordan. A linearlyconvergent stochastic l-fgs algorithm. In Artificial Intelligence and Statistics, pages ,

6 [Nesterov(2004)] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, [Nitanda(204)] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages , 204. [Roosta-Khorasani and Mahoney(206a)] F. Roosta-Khorasani and M. W. Mahoney. Su-sampled newton methods i: gloally convergent algorithms. arxiv preprint arxiv: , 206a. [Roosta-Khorasani and Mahoney(206)] F. Roosta-Khorasani and M. W. Mahoney. Su-sampled newton methods ii: Local convergence rates. arxiv preprint arxiv: , 206. [Roux et al.(202)roux, Schmidt, and Bach] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pages , 202. [Schmidt et al.(20)schmidt, Roux, and Bach] M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems, pages , 20. [Shalev-Shwartz and Zhang(203)] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 4(Fe): , 203. [Shalev-Shwartz and Zhang(206)] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 55(-2):05 45, 206. [Wang et al.(207a)wang, Wang, and Srero] J. Wang, W. Wang, and N. Srero. Memory and communication efficient distriuted stochastic optimization with miniatch prox. In Proceedings of The 30th Conference on Learning Theory, 207a. [Wang et al.(207)wang, Ma, Goldfar, and Liu] X. Wang, S. Ma, D. Goldfar, and W. Liu. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2): , 207. [Xu et al.(206)xu, Yang, Roosta-Khorasani, Ré, and Mahoney] P. Xu, J. Yang, F. Roosta-Khorasani, C. Ré, and M. W. Mahoney. Su-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pages , 206. [Zhang and Xiao(205)] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning, pages ,

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

arxiv: v2 [math.oc] 29 Jul 2016

arxiv: v2 [math.oc] 29 Jul 2016 Stochastic Frank-Wolfe Methods for Nonconvex Optimization arxiv:607.0854v [math.oc] 9 Jul 06 Sashank J. Reddi sjakkamr@cs.cmu.edu Carnegie Mellon University Barnaás Póczós apoczos@cs.cmu.edu Carnegie Mellon

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Approximate Newton Methods and Their Local Convergence

Approximate Newton Methods and Their Local Convergence Haishan Ye Luo Luo Zhihua Zhang 2 Abstract Many machine learning models are reformulated as optimization problems. Thus, it is important to solve a large-scale optimization problem in big data applications.

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 A Class of Parallel Douly Stochastic Algorithms for Large-Scale Learning arxiv:1606.04991v1 [cs.lg] 15 Jun 2016 Aryan Mokhtari Alec Koppel Alejandro Rieiro Department of Electrical and Systems Engineering

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Second order machine learning

Second order machine learning Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem

More information

Second order machine learning

Second order machine learning Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

SEGA: Variance Reduction via Gradient Sketching

SEGA: Variance Reduction via Gradient Sketching SEGA: Variance Reduction via Gradient Sketching Filip Hanzely 1 Konstantin Mishchenko 1 Peter Richtárik 1,2,3 1 King Abdullah University of Science and Technology, 2 University of Edinburgh, 3 Moscow Institute

More information

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy Majid Jahani Xi He Chenxin Ma Aryan Mokhtari Lehigh University Lehigh University

More information

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information