Stochastic Gradient Descent with Variance Reduction

Size: px
Start display at page:

Download "Stochastic Gradient Descent with Variance Reduction"

Transcription

1 Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

2 Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

3 Problem Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

4 Problem f (αx 1 + (1 α)x 2 ) αf (x 1 ) + (1 α)f (x 2 ) 1 2 α(1 α)β x 1 x 2 2 (2) Preliminaries Recall a few definitions from convex analysis. Definition 1. A function f (x) is a L-Lipschitz continuous function if f (x 1 ) f (x 2 ) L x 1 x 2 (1) for all x 1, x 2 dom(f ) Definition 2. A convex function f (x) is β-strong convex if there exists a constant β > 0 and for any α [0, 1], it holds: Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

5 Problem Preliminaries When f (x) is differentiable, the strong convexity is equivalent to f (x 1 ) f (x 2 )+ < f (x 2 ), x 1 x 2 > + β 2 x 1 x 2 2 (3) Typically, we use the standard Euclidean norm to define Lipschitz and strong convex functions. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

6 Problem Minimizing finite average of convex functions Let ψ 1,..., ψ n be a sequence of vector functions from R d to R. minp(ω), P(ω) = 1 n n ψ i (ω) (4) i=1 assumptions: each ψ i (ω) is convex and differentiable on dom(r) each ψ i (ω) is smooth with Lipschitz constant L ψ i (ω) ψ i (ω ) L ω ω (5) P(ω) is strongly convex P(ω) P(ω ) + γ 2 ω ω 2 + P(ω ) (ω ω ) (6) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

7 Problem Gradient Descent ω (t) = ω (t 1) η t P(ω (t 1) ) = ω (t 1) η t n n ψ i (ω (t 1) ) (7) i=1 Stochastic Gradient Descent Draw i t randomly from {1,..., n} ω (t) = ω (t 1) η t ψ it (ω (t 1) ) (8) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

8 Problem SGD A more general version of SGD is the following ω (t) = ω (t 1) η t g t (ω (t 1), ξ t ) (9) where ξ t is a random variable that may depend on ω (t 1), the expectation E[g t (ω (t 1), ξ t ) ω (t 1) ] = P(ω (t 1) ) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

9 Problem Variance For general convex optimization, stochastic gradient descent methods can obtain an O(1/ T ) convergence rate in expectation. Randomness introduces large variance if g t (ω (t 1), ξ t ) is very large, it will slow down the convergence. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

10 Stochastic Average Gradient (SAG) Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

11 Stochastic Average Gradient (SAG) Stochastic Average Gradient SAG method (Le Roux, Schmidt, Bach 2012) ω t = ω t 1 η n n g / t (i) (10) i=1 where g (i) t = { ψi (ω t ), if i = i t g (i) t 1, otherwise (11) It needs to store all gradient, not practical for some cases Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

12 Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

13 SVRG Motivation Reduce the variance Stochastic gradient descent has slow convergence asymptotically due to the inherent variance. SAG needs to store all gradients Contribution No need to store the intermediate gradients The same convergence rate as SAG can obtain Under mild assumptions, even work on nonconvex cases Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

14 Stochastic variance reduced gradient (SVRG) SVRG (Johnson & Zhang, NIPS 2013) update form ω (t) = ω (t 1) η t ( ψ it (ω (t 1) ) ψ it ( ω) + P( ω)) (12) update ω periodically (every m SGD iterations) ~ it ( ) it ( k ) ~ P ~ ) ~ ( it k P ~ ) ~ ( it P( ~ ) P( k ) Figure: Intuition of variance reduction Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

15 Procedure SVRG input: update frequency m and learning rate η initialization: ω 0 for s=1,2,... do ω = ω s 1 µ = P( ω) = 1 n n i=1 ψ i( ω) ω 0 = ω Randomly pick i t {1,..., n} and update weight, repeat m times ω t = ω t 1 η t ( ψ it (ω t 1 ) ψ it ( ω) + P( ω)) option I: set ω s = ω m option II: set ω s = ω t for randomly chosen t {0,..., m 1} end for Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

16 Convergence for SVRG Theorem Consider SVRG with option II. Assume that all ψ i (ω) are convex and smooth, P(ω) is strongly convex. Let ω = argmin ω P(ω). Assume that m is sufficiently large so that α = 1 γη(1 2Lη)m + 2Lη 1 2Lη < 1 then we have geometric convergence in expectations for SVRG EP( ω s ) EP( ω ) + α s [P( ω 0 ) P(ω )] Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

17 Proof Given any i, consider g i (ω) = ψ i (ω) ψ i (ω ) ψ i (ω ) T (ω ω ) (13) where g i (ω ) = argmin ω g i (ω) and g i (ω ) = 0 0 = g i (ω ) min η [g i (ω η g i (ω))] min η [g i (ω) η g i (ω) Lη 2 g i (ω) 2 ] (14) Here it uses a well-known inequality for a function with 1/L-Lipschitz continuous gradient f (x) f (y) f (y), x y L x y 2 2 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

18 Proof From (14), we can get η = 1/L, then It can be rewrite as 0 = g i (ω ) g i (ω) 1 2L g i(ω) 2 (15) g i (ω) 2 2Lg i (ω) (16) using the definition of g i (ω) and g i (ω) = ψ i (ω) ψ i (ω ), the (16) will be ψ i (ω) ψ i (ω ) 2 2L[ψ i (ω) ψ i (ω ) ψ i (ω ) T (ω ω )] (17) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

19 Proof By summing the inequality (17) over i = {1,..., n}, the fact that P(ω) = 1 n n i=1 ψ i(ω) and P(ω ) = 0, we can get n 1 n ψ i (ω) ψ i (ω ) 2 2L[P(ω) P(ω )] (18) i=1 Use µ = P( ω) and let v t = ψ it (ω t 1 ) ψ it ( ω) + µ, v t is the approximate gradient of SVRG. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

20 Proof With respect to i t, expectation can be obtained as E v t 2 = E ψ it (ω t 1 ) ψ it ( ω) + µ 2 2E ψ it (ω t 1 ) ψ it (ω ) 2 + 2E [ ψ it ( ω) ψ it (ω )] µ 2 = 2E ψ it (ω t 1 ) ψ it (ω ) 2 + 2E [ ψ it ( ω) ψ it (ω )] E[ ψ it ( ω) ψ it (ω )] 2 2E ψ it (ω t 1 ) ψ it (ω ) 2 + 2E ψ it ( ω) ψ it (ω ) 2 4L[P(ω t 1 ) P(ω ) + P( ω) P(ω )] (19) The first inequality uses a + b 2 2 a b 2 The second inequality uses E X EX 2 E X 2 The third one uses (18) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

21 Proof The update form of SVRG is ω t = ω t 1 ηv t, conditioned on ω t 1 E ω t ω 2 = E ω t 1 ω ηv t 2 = ω t 1 ω 2 2η(ω t 1 ω ) Ev t + η 2 E v t 2 Here Ev t = E[ ψ it (ω t 1 ) ψ it ( ω) + µ] = P(ω t 1 ) Using (19) then we can get E ω t ω 2 By convexity of P(ω) that ω t 1 ω 2 2η(ω t 1 ω ) P(ω t 1 ) + 4Lη 2 [P(ω t 1 ) P(ω ) + P( ω) P(ω )] (20) (ω t 1 ω ) P(ω t 1 ) P(ω ) P(ω t 1 ) (21) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

22 Proof E ω t ω 2 ω t 1 ω 2 2η[P(ω ) P(ω t 1 )] + 4Lη 2 [P(ω t 1 ) P(ω ) + P( ω) P(ω )] = ω t 1 ω 2 2η(1 2Lη)[P(ω t 1 ) P(ω )] + 4Lη 2 [P( ω) P(ω )] In each fixed stage s, ω = ω s 1 and ω s is selected after all updates have completed. By summing the inequality over t = 1,..., m, taking expectation with all the history E ω m ω 2 + 2η(1 2Lη)mE[P( ω s ) P(ω )] E ω ω 2 + 4Lmη 2 E[P( ω) P(ω )] (22) 2 γ E[P( ω) P(ω )] + 4Lmη 2 E[P( ω) P(ω )] (23) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

23 Proof From above inequality, we can have 2η(1 2Lη)mE[P( ω s ) P(ω )] 2 γ E[P( ω) P(ω )] + 4Lmη 2 E[P( ω) P(ω )] (24) which can be also rewrite as E[P( ω s ) P( ω )] αe[p( ω s 1 ) P(ω )] (25) where α = 1 γη(1 2Lη)m + 2Lη 1 2Lη (26) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

24 Proof From (25), we can get the desired bound in the Theorem E[P( ω s ) P( ω )] α s E[P( ω 0 ) P(ω )] (27) The bound in Theorem 1 is comparable to Le Roux et al. [2012] and Shalev-Shwartz and Zhang [2012]. The convergence rate of SVRG is O(1/T ) which improves the standard SGD convergence rate of O(1/ T ) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

25 Experiments Figure: (a) Training loss comparison with SGD with fixed learning rates. (b) Training loss residual P(ω) P(ω ) (c) Variance of weight update It is hard to find a good η for SGD. Use a single relatively large value of η, SVRG smoothly goes down faster than SGD. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

26 Experiments Figure: More convex-case results. Loss residual P(ω) P(ω ) (top) and test error rates (down) SVRG is competitive with SDCA and better than the best-tuned SGD. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

27 Figure: Neural net results (nonconvex) For nonconvex problems, it is useful to start with an initial vector ω 0 that is close to a local minimum. Results show that SVRG reduces the variance and smoothly converges faster than the best-tuned SGD. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

28 Conclusion Conclusion For smooth and strongly convex functions, we prove SVRG enjoys the same fast convergence rate as SAG Unlike SAG, no requirement of the storage of gradients Unlike SAG, it is more easily applicable to complex problems Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

29 Conclusion Thank you! Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, / 29

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Coordinate Descent Methods on Huge-Scale Optimization Problems

Coordinate Descent Methods on Huge-Scale Optimization Problems Coordinate Descent Methods on Huge-Scale Optimization Problems Zhimin Peng Optimization Group Meeting Warm up exercise? Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant,

More information

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm Deanna Needell Department of Mathematical Sciences Claremont McKenna College Claremont CA 97 dneedell@cmc.edu Nathan

More information

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

A Proximal Stochastic Gradient Method with Progressive Variance Reduction A Proximal Stochastic Gradient Method with Progressive Variance Reduction arxiv:403.4699v [math.oc] 9 Mar 204 Lin Xiao Tong Zhang March 8, 204 Abstract We consider the problem of minimizing the sum of

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

Lecture 26: Neural Nets

Lecture 26: Neural Nets Lecture 26: Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois 11/30/2017 1 Intro 2 Knowledge-Based Design 3 Error Metric 4 Gradient Descent 5 Simulated Annealing

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

arxiv: v2 [cs.lg] 4 Oct 2016

arxiv: v2 [cs.lg] 4 Oct 2016 Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

SVRG Escapes Saddle Points

SVRG Escapes Saddle Points DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

Large-scale Machine Learning and Optimization

Large-scale Machine Learning and Optimization 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Robert M. Gower robert.gower@telecom-paristech.fr icolas Le Roux nicolas.le.roux@gmail.com Francis Bach francis.bach@inria.fr

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information