Stochastic Gradient Descent with Variance Reduction

Similar documents
Stochastic and online algorithms

Accelerating Stochastic Optimization

SVRG++ with Non-uniform Sampling

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Big Data Analytics: Optimization and Randomization

Nesterov s Acceleration

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Trade-Offs in Distributed Learning and Optimization

Optimization for Machine Learning

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Accelerating SVRG via second-order information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Large-scale Stochastic Optimization

Fast Stochastic Optimization Algorithms for ML

ECS289: Scalable Machine Learning

Introduction to Machine Learning (67577) Lecture 7

Lecture 3: Minimizing Large Sums. Peter Richtárik

Proximal Minimization by Incremental Surrogate Optimization (MISO)

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

STA141C: Big Data & High Performance Statistical Computing

Coordinate Descent and Ascent Methods

Logistic Regression. Stochastic Gradient Descent

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

ECS171: Machine Learning

Convex Optimization Lecture 16

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

The FTRL Algorithm with Strongly Convex Regularizers

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Bregman Divergence and Mirror Descent

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Comparison of Modern Stochastic Optimization Algorithms

Day 3 Lecture 3. Optimizing deep networks

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Stochastic Optimization

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

CS260: Machine Learning Algorithms

Stochastic optimization: Beyond stochastic gradients and convexity Part I

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Mini-Batch Primal and Dual Methods for SVMs

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Coordinate Descent Faceoff: Primal or Dual?

Linear Convergence under the Polyak-Łojasiewicz Inequality

Stochastic Quasi-Newton Methods

A Parallel SGD method with Strong Convergence

Linear Convergence under the Polyak-Łojasiewicz Inequality

Variance-Reduced and Projection-Free Stochastic Optimization

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Optimization and Gradient Descent

ONLINE VARIANCE-REDUCING OPTIMIZATION

Coordinate Descent Methods on Huge-Scale Optimization Problems

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Full-information Online Learning

Stochastic Optimization Algorithms Beyond SG

Stochastic Gradient Descent with Only One Projection

arxiv: v1 [math.oc] 18 Mar 2016

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Variational Inference via Stochastic Backpropagation

Sub-Sampled Newton Methods

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Lecture 26: Neural Nets

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

ECS289: Scalable Machine Learning

arxiv: v2 [cs.lg] 4 Oct 2016

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Advanced computational methods X Selected Topics: SGD

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Stochastic optimization in Hilbert spaces

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

SVRG Escapes Saddle Points

arxiv: v2 [stat.ml] 16 Jun 2015

Large-scale Machine Learning and Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Logistic Regression Trained with Different Loss Functions. Discussion

Sparsity Regularization

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Transcription:

Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 1 / 29

Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 2 / 29

Problem Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 3 / 29

Problem f (αx 1 + (1 α)x 2 ) αf (x 1 ) + (1 α)f (x 2 ) 1 2 α(1 α)β x 1 x 2 2 (2) Preliminaries Recall a few definitions from convex analysis. Definition 1. A function f (x) is a L-Lipschitz continuous function if f (x 1 ) f (x 2 ) L x 1 x 2 (1) for all x 1, x 2 dom(f ) Definition 2. A convex function f (x) is β-strong convex if there exists a constant β > 0 and for any α [0, 1], it holds: Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 4 / 29

Problem Preliminaries When f (x) is differentiable, the strong convexity is equivalent to f (x 1 ) f (x 2 )+ < f (x 2 ), x 1 x 2 > + β 2 x 1 x 2 2 (3) Typically, we use the standard Euclidean norm to define Lipschitz and strong convex functions. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 5 / 29

Problem Minimizing finite average of convex functions Let ψ 1,..., ψ n be a sequence of vector functions from R d to R. minp(ω), P(ω) = 1 n n ψ i (ω) (4) i=1 assumptions: each ψ i (ω) is convex and differentiable on dom(r) each ψ i (ω) is smooth with Lipschitz constant L ψ i (ω) ψ i (ω ) L ω ω (5) P(ω) is strongly convex P(ω) P(ω ) + γ 2 ω ω 2 + P(ω ) (ω ω ) (6) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 6 / 29

Problem Gradient Descent ω (t) = ω (t 1) η t P(ω (t 1) ) = ω (t 1) η t n n ψ i (ω (t 1) ) (7) i=1 Stochastic Gradient Descent Draw i t randomly from {1,..., n} ω (t) = ω (t 1) η t ψ it (ω (t 1) ) (8) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 7 / 29

Problem SGD A more general version of SGD is the following ω (t) = ω (t 1) η t g t (ω (t 1), ξ t ) (9) where ξ t is a random variable that may depend on ω (t 1), the expectation E[g t (ω (t 1), ξ t ) ω (t 1) ] = P(ω (t 1) ) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 8 / 29

Problem Variance For general convex optimization, stochastic gradient descent methods can obtain an O(1/ T ) convergence rate in expectation. Randomness introduces large variance if g t (ω (t 1), ξ t ) is very large, it will slow down the convergence. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 9 / 29

Stochastic Average Gradient (SAG) Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 10 / 29

Stochastic Average Gradient (SAG) Stochastic Average Gradient SAG method (Le Roux, Schmidt, Bach 2012) ω t = ω t 1 η n n g / t (i) (10) i=1 where g (i) t = { ψi (ω t ), if i = i t g (i) t 1, otherwise (11) It needs to store all gradient, not practical for some cases Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 11 / 29

Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using Predictive Variance Reduction (SVRG) 4 Conclusion Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 12 / 29

SVRG Motivation Reduce the variance Stochastic gradient descent has slow convergence asymptotically due to the inherent variance. SAG needs to store all gradients Contribution No need to store the intermediate gradients The same convergence rate as SAG can obtain Under mild assumptions, even work on nonconvex cases Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 13 / 29

Stochastic variance reduced gradient (SVRG) SVRG (Johnson & Zhang, NIPS 2013) update form ω (t) = ω (t 1) η t ( ψ it (ω (t 1) ) ψ it ( ω) + P( ω)) (12) update ω periodically (every m SGD iterations) ~ it ( ) it ( k ) ~ P ~ ) ~ ( it k P ~ ) ~ ( it P( ~ ) P( k ) Figure: Intuition of variance reduction Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 14 / 29

Procedure SVRG input: update frequency m and learning rate η initialization: ω 0 for s=1,2,... do ω = ω s 1 µ = P( ω) = 1 n n i=1 ψ i( ω) ω 0 = ω Randomly pick i t {1,..., n} and update weight, repeat m times ω t = ω t 1 η t ( ψ it (ω t 1 ) ψ it ( ω) + P( ω)) option I: set ω s = ω m option II: set ω s = ω t for randomly chosen t {0,..., m 1} end for Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 15 / 29

Convergence for SVRG Theorem Consider SVRG with option II. Assume that all ψ i (ω) are convex and smooth, P(ω) is strongly convex. Let ω = argmin ω P(ω). Assume that m is sufficiently large so that α = 1 γη(1 2Lη)m + 2Lη 1 2Lη < 1 then we have geometric convergence in expectations for SVRG EP( ω s ) EP( ω ) + α s [P( ω 0 ) P(ω )] Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 16 / 29

Proof Given any i, consider g i (ω) = ψ i (ω) ψ i (ω ) ψ i (ω ) T (ω ω ) (13) where g i (ω ) = argmin ω g i (ω) and g i (ω ) = 0 0 = g i (ω ) min η [g i (ω η g i (ω))] min η [g i (ω) η g i (ω) 2 + 0.5Lη 2 g i (ω) 2 ] (14) Here it uses a well-known inequality for a function with 1/L-Lipschitz continuous gradient f (x) f (y) f (y), x y L x y 2 2 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 17 / 29

Proof From (14), we can get η = 1/L, then It can be rewrite as 0 = g i (ω ) g i (ω) 1 2L g i(ω) 2 (15) g i (ω) 2 2Lg i (ω) (16) using the definition of g i (ω) and g i (ω) = ψ i (ω) ψ i (ω ), the (16) will be ψ i (ω) ψ i (ω ) 2 2L[ψ i (ω) ψ i (ω ) ψ i (ω ) T (ω ω )] (17) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 18 / 29

Proof By summing the inequality (17) over i = {1,..., n}, the fact that P(ω) = 1 n n i=1 ψ i(ω) and P(ω ) = 0, we can get n 1 n ψ i (ω) ψ i (ω ) 2 2L[P(ω) P(ω )] (18) i=1 Use µ = P( ω) and let v t = ψ it (ω t 1 ) ψ it ( ω) + µ, v t is the approximate gradient of SVRG. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 19 / 29

Proof With respect to i t, expectation can be obtained as E v t 2 = E ψ it (ω t 1 ) ψ it ( ω) + µ 2 2E ψ it (ω t 1 ) ψ it (ω ) 2 + 2E [ ψ it ( ω) ψ it (ω )] µ 2 = 2E ψ it (ω t 1 ) ψ it (ω ) 2 + 2E [ ψ it ( ω) ψ it (ω )] E[ ψ it ( ω) ψ it (ω )] 2 2E ψ it (ω t 1 ) ψ it (ω ) 2 + 2E ψ it ( ω) ψ it (ω ) 2 4L[P(ω t 1 ) P(ω ) + P( ω) P(ω )] (19) The first inequality uses a + b 2 2 a 2 + 2 b 2 The second inequality uses E X EX 2 E X 2 The third one uses (18) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 20 / 29

Proof The update form of SVRG is ω t = ω t 1 ηv t, conditioned on ω t 1 E ω t ω 2 = E ω t 1 ω ηv t 2 = ω t 1 ω 2 2η(ω t 1 ω ) Ev t + η 2 E v t 2 Here Ev t = E[ ψ it (ω t 1 ) ψ it ( ω) + µ] = P(ω t 1 ) Using (19) then we can get E ω t ω 2 By convexity of P(ω) that ω t 1 ω 2 2η(ω t 1 ω ) P(ω t 1 ) + 4Lη 2 [P(ω t 1 ) P(ω ) + P( ω) P(ω )] (20) (ω t 1 ω ) P(ω t 1 ) P(ω ) P(ω t 1 ) (21) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 21 / 29

Proof E ω t ω 2 ω t 1 ω 2 2η[P(ω ) P(ω t 1 )] + 4Lη 2 [P(ω t 1 ) P(ω ) + P( ω) P(ω )] = ω t 1 ω 2 2η(1 2Lη)[P(ω t 1 ) P(ω )] + 4Lη 2 [P( ω) P(ω )] In each fixed stage s, ω = ω s 1 and ω s is selected after all updates have completed. By summing the inequality over t = 1,..., m, taking expectation with all the history E ω m ω 2 + 2η(1 2Lη)mE[P( ω s ) P(ω )] E ω ω 2 + 4Lmη 2 E[P( ω) P(ω )] (22) 2 γ E[P( ω) P(ω )] + 4Lmη 2 E[P( ω) P(ω )] (23) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 22 / 29

Proof From above inequality, we can have 2η(1 2Lη)mE[P( ω s ) P(ω )] 2 γ E[P( ω) P(ω )] + 4Lmη 2 E[P( ω) P(ω )] (24) which can be also rewrite as E[P( ω s ) P( ω )] αe[p( ω s 1 ) P(ω )] (25) where α = 1 γη(1 2Lη)m + 2Lη 1 2Lη (26) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 23 / 29

Proof From (25), we can get the desired bound in the Theorem E[P( ω s ) P( ω )] α s E[P( ω 0 ) P(ω )] (27) The bound in Theorem 1 is comparable to Le Roux et al. [2012] and Shalev-Shwartz and Zhang [2012]. The convergence rate of SVRG is O(1/T ) which improves the standard SGD convergence rate of O(1/ T ) Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 24 / 29

Experiments Figure: (a) Training loss comparison with SGD with fixed learning rates. (b) Training loss residual P(ω) P(ω ) (c) Variance of weight update It is hard to find a good η for SGD. Use a single relatively large value of η, SVRG smoothly goes down faster than SGD. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 25 / 29

Experiments Figure: More convex-case results. Loss residual P(ω) P(ω ) (top) and test error rates (down) SVRG is competitive with SDCA and better than the best-tuned SGD. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 26 / 29

Figure: Neural net results (nonconvex) For nonconvex problems, it is useful to start with an initial vector ω 0 that is close to a local minimum. Results show that SVRG reduces the variance and smoothly converges faster than the best-tuned SGD. Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 27 / 29

Conclusion Conclusion For smooth and strongly convex functions, we prove SVRG enjoys the same fast convergence rate as SAG Unlike SAG, no requirement of the storage of gradients Unlike SAG, it is more easily applicable to complex problems Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 28 / 29

Conclusion Thank you! Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction March 17, 2015 29 / 29