Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Similar documents
Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Selected Topics in Optimization. Some slides borrowed from

Subgradient Method. Ryan Tibshirani Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Coordinate Descent and Ascent Methods

Stochastic Gradient Descent

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Stochastic and online algorithms

Large-scale Stochastic Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Conditional Gradient (Frank-Wolfe) Method

Lecture 8: February 9

CS260: Machine Learning Algorithms

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

STA141C: Big Data & High Performance Statistical Computing

Big Data Analytics: Optimization and Randomization

Convex Optimization Lecture 16

ECS289: Scalable Machine Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Optimization methods

CPSC 540: Machine Learning

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 17: October 27

Lecture 9: September 28

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Accelerating Stochastic Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Fast Stochastic Optimization Algorithms for ML

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Linear Convergence under the Polyak-Łojasiewicz Inequality

Why should you care about the solution strategies?

Lecture 23: November 21

Descent methods. min x. f(x)

CSC321 Lecture 8: Optimization

CSC321 Lecture 7: Optimization

Lecture 5: September 12

Big Data Analytics. Lucas Rego Drumond

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

ECS171: Machine Learning

Stochastic Optimization Algorithms Beyond SG

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Linearly-Convergent Stochastic-Gradient Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Nonlinear Optimization Methods for Machine Learning

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Day 3 Lecture 3. Optimizing deep networks

Stochastic Composition Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Lecture 1: Supervised Learning

Linear Models in Machine Learning

Classification Logistic Regression

Lecture 1: September 25

Linear Convergence under the Polyak-Łojasiewicz Inequality

Nesterov s Acceleration

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lasso: Algorithms and Extensions

Block stochastic gradient update method

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent. CS 584: Big Data Analytics

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Lasso Regression: Regularization for feature selection

Gradient Descent. Stephan Robert

Lecture 7: September 17

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Stochastic Subgradient Method

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Logistic Regression. Stochastic Gradient Descent

Comparison of Modern Stochastic Optimization Algorithms

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Trade-Offs in Distributed Learning and Optimization

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Ad Placement Strategies

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Gradient Descent. Model

arxiv: v1 [math.oc] 1 Jul 2016

Ad Placement Strategies

Logistic Regression Logistic

Optimization methods

Modern Optimization Techniques

High Order Methods for Empirical Risk Minimization

Lasso Regression: Regularization for feature selection

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Optimization for Machine Learning

Lecture 6 Optimization for Deep Neural Networks

Transcription:

Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725

Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so much as prox t (x) = argmin z 1 2t x z 2 2 + h(z) is computable. Proximal gradient descent: let x (0) R n, repeat: x (k) = prox tk ( x (k 1) t k g(x (k 1) ) ), k = 1, 2, 3,... Step sizes t k chosen to be fixed and small, or via backtracking If g is Lipschitz with constant L, then this has convergence rate O(1/ɛ). Lastly we can accelerate this, to optimal rate O(1/ ɛ) 2

Outline Today: Stochastic gradient descent Convergence rates Mini-batches Early stopping 3

Stochastic gradient descent Consider minimizing an average of functions 1 min x m m f i (x) As m f i(x) = m f i(x), gradient descent would repeat: x (k) = x (k 1) t k 1 m m f i (x (k 1) ), k = 1, 2, 3,... In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats: x (k) = x (k 1) t k f ik (x (k 1) ), k = 1, 2, 3,... where i k {1,... m} is some chosen index at iteration k 4

Two rules for choosing index i k at iteration k: Randomized rule: choose i k {1,... m} uniformly at random Cyclic rule: choose i k = 1, 2,... m, 1, 2,... m,... Randomized rule is more common in practice. For randomized rule, note that E[ f ik (x)] = f(x) so we can view SGD as using an unbiased estimate of the gradient at each step Main appeal of SGD: Iteration cost is independent of m (number of functions) Can also be a big savings in terms of memory useage 5

Example: stochastic logistic regression Given (x i, y i ) R p {0, 1}, i = 1,... n, recall logistic regression: min β f(β) = 1 n n ( ) y i x T i β + log(1 + exp(x T i β)) } {{ } f i (β) Gradient computation f(β) = 1 n n ( yi p i (β) ) x i is doable when n is moderate, but not when n is huge Full gradient (also called batch) versus stochastic gradient: One batch update costs O(np) One stochastic update costs O(p) Clearly, e.g., 10K stochastic steps are much more affordable 6

Small example with n = 10, p = 2 to show the classic picture for batch versus stochastic methods: 20 10 0 10 20 * Batch Random Blue: batch steps, O(np) Red: stochastic steps, O(p) Rule of thumb for stochastic methods: generally thrive far from optimum generally struggle close to optimum 20 10 0 10 20 7

Step sizes Standard in SGD is to use diminishing step sizes, e.g., t k = 1/k, for k = 1, 2, 3,... Why not fixed step sizes? Here s some intuition. Suppose we take cyclic rule for simplicity. Set t k = t for m updates in a row, we get: x (k+m) = x (k) t m f i (x (k+i 1) ) Meanwhile, full gradient with step size t would give: x (k+1) = x (k) t m f i (x (k) ) The difference here: t m [ f i(x (k+i 1) ) f i (x (k) )], and if we hold t constant, this difference will not generally be going to zero 8

Convergence rates Recall: for convex f, gradient descent with diminishing step sizes satisfies f(x (k) ) f = O(1/ k) When f is differentiable with Lipschitz gradient, we get for suitable fixed step sizes f(x (k) ) f = O(1/k) What about SGD? For convex f, SGD with diminishing step sizes satisfies 1 E[f(x (k) )] f = O(1/ k) Unfortunately this does not improve when we further assume f has Lipschitz gradient 1 E.g., Nemirosvki et al. (2009), Robust stochastic optimization approach to stochastic programming 9

Even worse is the following discrepancy! When f is strongly convex and has a Lipschitz gradient, gradient descent satisfies f(x (k) ) f = O(c k ) where c < 1. But under same conditions, SGD gives us 2 E[f(x (k) )] f = O(1/k) So stochastic methods do not enjoy the linear convergence rate of gradient descent under strong convexity What can we do to improve SGD? 2 E.g., Nemirosvki et al. (2009), Robust stochastic optimization approach to stochastic programming 10

11 Mini-batches Also common is mini-batch stochastic gradient descent, where we choose a random subset I k {1,... m}, of size I k = b m, and repeat: x (k) = x (k 1) t k 1 f i (x (k 1) ), k = 1, 2, 3,... b i I k Again, we are approximating full graident by an unbiased estimate: [ 1 ] E f i (x) = f(x) b i I k Using mini-batches reduces the variance of our gradient estimate by a factor 1/b, but is also b times more expensive

12 Back to logistic regression, let s now consider a regularized version: 1 min β R p n n Write the criterion as n ( ) y i x T i β + log(1 + e xt i β ) + λ 2 β 2 2 f(β) = 1 n f i (β), f i (β) = y i x T i β + log(1 + e xt i β ) + λ 2 β 2 2 Full gradient computation is f(β) = 1 n n ( yi p i (β) ) x i + λβ. Comparison between methods: One batch update costs O(np) One mini-batch update costs O(bp) One stochastic update costs O(p)

13 Example with n = 10, 000, p = 20, all methods use fixed step sizes: Criterion fk 0.50 0.55 0.60 0.65 Full Stochastic Mini batch, b=10 Mini batch, b=100 0 10 20 30 40 50 Iteration number k

14 What s happening? Now let s parametrize by flops: Criterion fk 0.50 0.55 0.60 0.65 Full Stochastic Mini batch, b=10 Mini batch, b=100 1e+02 1e+04 1e+06 Flop count

15 Finally, looking at suboptimality gap (on log scale): Criterion gap fk fstar 1e 12 1e 09 1e 06 1e 03 Full Stochastic Mini batch, b=10 Mini batch, b=100 0 10 20 30 40 50 Iteration number k

16 End of the story? Short story: SGD can be super effective in terms of iteration cost, memory But SGD is slow to converge, can t adapt to strong convexity And mini-batches seem to be a wash in terms of flops (though they can still be useful in practice) Is this the end of the story for SGD? For a while, the answer was believed to be yes. Slow convergence for strongly convex functions was believed inevitable, as Nemirovski and others established matching lower bounds... but this was for a more general stochastic problem, where f(x) = F (x, ξ) dp (ξ) New wave of variance reduction work shows we can modify SGD to converge much faster for finite sums (more later?)

SGD in large-scale ML SGD has really taken off in large-scale machine learning In many ML problems we don t care about optimizing to high accuracy, it doesn t pay off in terms of statistical performance Thus (in contrast to what classic theory says) fixed step sizes are commonly used in ML applications One trick is to experiment with step sizes using small fraction of training before running SGD on full data set... many other heuristics are common 3 Many variants provide better practical stability, convergence: momentum, acceleration, averaging, coordinate-adapted step sizes, variance reduction... See AdaGrad, Adam, AdaMax, SVRG, SAG, SAGA... (more later?) 3 E.g., Bottou (2012), Stochastic gradient descent tricks 17

18 Early stopping Suppose p is large and we wanted to fit (say) a logistic regression model to data (x i, y i ) R p {0, 1}, i = 1,... n We could solve (say) l 2 regularized logistic regression: 1 min β R p n n ( ) y i x T i β + log(1 + e xt i β ) subject to β 2 t We could also run gradient descent on the unregularized problem: 1 min β R p n n ( ) y i x T i β + log(1 + e xt i β ) and stop early, i.e., terminate gradient descent well-short of the global minimum

19 Consider the following, for a very small constant step size ɛ: Start at β (0) = 0, solution to regularized problem at t = 0 Perform gradient descent on unregularized criterion β (k) = β (k 1) ɛ 1 n n (y i p i (β (k 1) ))x i, k = 1, 2, 3,... (we could equally well consider SGD) Treat β (k) as an approximate solution to regularized problem with t = β (k) 2 This is called early stopping for gradient descent. Why would we ever do this? It s both more convenient and potentially much more efficient than using explicit regularization

20 An intruiging connection When we solve the l 2 regularized logistic problem for varying t... solution path looks quite similar to gradient descent path! Example with p = 8, solution and grad descent paths side by side: Ridge logistic path Stagewise path Coordinates 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Coordinates 0.4 0.2 0.0 0.2 0.4 0.6 0.8 0.0 0.5 1.0 1.5 g( ˆβ(t)) 0.0 0.5 1.0 1.5 g(β (k) )

Lots left to explore Connection holds beyond logistic regression, for arbitrary loss In general, the grad descent path will not coincide with the l 2 regularized path (as ɛ 0). Though in practice, it seems to give competitive statistical performance Can extend early stopping idea to mimick a generic regularizer (beyond l 2 ) 4 There is a lot of literature on early stopping, but it s still not as well-understood as it should be Early stopping is just one instance of implicit or algorithmic regularization... many others are effective in large-scale ML, they all should be better understood 4 Tibshirani (2015), A general framework for fast stagewise algorithms 21

22 References and further reading D. Bertsekas (2010), Incremental gradient, subgradient, and proximal methods for convex optimization: a survey A. Nemirovski and A. Juditsky and G. Lan and A. Shapiro (2009), Robust stochastic optimization approach to stochastic programming R. Tibshirani (2015), A general framework for fast stagewise algorithms