Beyond stochastic gradient descent for large-scale machine learning

Similar documents
Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient descent and robustness to ill-conditioning

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Optimization for Machine Learning

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Statistical machine learning and convex optimization

Linearly-Convergent Stochastic-Gradient Methods

Convex Optimization Lecture 16

Stochastic Optimization

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Stochastic Gradient Descent. CS 584: Big Data Analytics

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Stochastic optimization in Hilbert spaces

ONLINE VARIANCE-REDUCING OPTIMIZATION

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Comparison of Modern Stochastic Optimization Algorithms

SVRG++ with Non-uniform Sampling

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Fast Stochastic Optimization Algorithms for ML

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Learning with stochastic proximal gradient

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Logistic Regression. Stochastic Gradient Descent

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Accelerating SVRG via second-order information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Stochastic and online algorithms

Lecture 1: Supervised Learning

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Convergence Rates of Kernel Quadrature Rules

Big Data Analytics: Optimization and Randomization

Large-scale Stochastic Optimization

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Stochastic Composition Optimization

Stochastic gradient descent on Riemannian manifolds

Big Data Analytics. Lucas Rego Drumond

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Minimizing finite sums with the stochastic average gradient

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Day 3 Lecture 3. Optimizing deep networks

A Parallel SGD method with Strong Convergence

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Accelerating Stochastic Optimization

Probabilistic Graphical Models & Applications

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Regression with Numerical Optimization. Logistic

STA141C: Big Data & High Performance Statistical Computing

Large Scale Machine Learning with Stochastic Gradient Descent

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Neural networks and optimization

Perturbed Proximal Gradient Algorithm

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Convergence Rate of Expectation-Maximization

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Modern Optimization Techniques

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Overfitting, Bias / Variance Analysis

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

OPTIMIZATION METHODS IN DEEP LEARNING

Stochastic Gradient Descent with Variance Reduction

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Full-information Online Learning

Algorithmic Stability and Generalization Christoph Lampert

Lecture 2 February 25th

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Announcements Kevin Jamieson

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Linear classifiers: Overfitting and regularization

Incremental Quasi-Newton methods with local superlinear convergence rate

Machine Learning Lecture 7

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Convex Optimization Algorithms for Machine Learning in 10 Slides

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Online Learning and Sequential Decision Making

Introduction to Machine Learning

Trade-Offs in Distributed Learning and Optimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

ECS289: Scalable Machine Learning

Stochastic gradient descent on Riemannian manifolds

Midterm exam CS 189/289, Fall 2015

More Optimization. Optimization Methods. Methods

High Order Methods for Empirical Risk Minimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

ECS171: Machine Learning

Stochastic Quasi-Newton Methods

arxiv: v1 [math.oc] 7 Dec 2018

Stochastic Gradient Descent

Transcription:

Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July 2014

Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension p

Search engines - Advertising

Marketing - Personalized recommendation

Visual object recognition

Personal photos

Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Outline Introduction: stochastic approximation algorithms Supervised machine learning and convex optimization Stochastic gradient and averaging Strongly convex vs. non-strongly convex Fast convergence through smoothness and constant step-sizes Online Newton steps (Bach and Moulines, 2013) O(1/n) convergence rate for all convex functions More than a single pass through the data Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Linear (exponential) convergence rate for strongly convex functions

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2

Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2 Classification : y { 1,1}, prediction ŷ = sign( θ,φ(x) ) loss of the form l(y θ,φ(x) ) True 0-1 loss: l(y θ,φ(x) ) = 1 y θ,φ(x) <0 Usual convex losses: 5 4 3 0 1 hinge square logistic 2 1 0 3 2 1 0 1 2 3 4

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, g (θ) L Id smooth non smooth

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, g (θ) L Id Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Bounded data

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, g (θ) µ Id convex strongly convex

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p

Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p Machine learning - statistics f(θ) = Ef n (θ) = El(y n, θ,φ(x n ) ) = generalization error Loss for a single pair of observations: f n (θ) = l(y n, θ,φ(x n ) ) Expected gradient: f (θ) = Ef n(θ) = E { l (y n, θ,φ(x n ) )Φ(x n ) } Beyond convex optimization: see, e.g., Benveniste et al. (2012)

Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Running-time = O(np) Single pass through the data One line of code among many

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm for smooth problems with convergence rate O(1/n) in all situations?

Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n 1 ) f(θ ) 4σ2 p n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] 1 2 3 1/2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/2 5 0 2 4 6 log 10 (n)

Simulations - benchmarks alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000) log 10 [f(θ) f(θ * )] 1 0.5 0 0.5 1 1.5 2 alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 1 0.5 0 0.5 1 1.5 2 alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 0.2 news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 0.6 0.8 1/R 2 1/R 2 n 1/2 SAG 0 2 4 log (n) 10 0 0.2 0.4 0.6 0.8 C/R 2 C/R 2 n 1/2 SAG 0 2 4 log 10 (n)

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/2 0 2 4 6 log 10 (n)

Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ + 1 2 θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ + 1 2 θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ + 1 2 θ θ,f n( θ)(θ θ) ]

Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ + 1 2 θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ + 1 2 θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ + 1 2 θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] 1 2 2 1/2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl. 0 2 4 6 0 2 4 6 log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

Simulations - benchmarks alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000) alpha logistic C=1 test alpha logistic C=opt test 0.5 0.5 log 10 [f(θ) f(θ * )] 0 0.5 1 1/R 2 1.5 1/R 2 n 1/2 SAG 2 Adagrad Newton 2.5 0 2 4 log (n) 10 6 0 0.5 1 C/R 2 1.5 C/R 2 n 1/2 SAG 2 Adagrad Newton 2.5 0 2 4 log (n) 10 6 0.2 news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 1/R 2 0.6 1/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton 0 2 4 log (n) 10 0 0.2 0.4 C/R 2 0.6 C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton 0 2 4 log 10 (n)

Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y, θ,φ(x) )

Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y, θ,φ(x) ) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i, θ,φ(x i ) ) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, θ,φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, θ,φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t n with yt i = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Extra memory requirement Supervised machine learning If f i (θ) = l i (y i, Φ(x i ),θ ), then f i (θ) = l i (y i, Φ(x i ),θ )Φ(x i ) Only need to store n real numbers

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Strongly convex case (Le Roux et al., 2012, 2013) E [ g(θ t ) g(θ ) ] ( 8σ 2 nµ + 4L θ 0 θ 2 n ) exp ( { 1 tmin 8n, µ }) 16L Linear (exponential) convergence rate with ( O(1) iteration { cost 1 After one pass, reduction of cost by exp min 8, nµ }) 16L

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Non-strongly convex case (Le Roux et al., 2013) E [ g(θ t ) g(θ ) ] 48 σ2 +L θ 0 θ 2 n n t Improvement over regular batch and stochastic gradient Adaptivity to potentially hidden strong convexity

spam dataset (n = 92 189, p = 823 470)

Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Going beyond a single pass through the data Keep memory of all gradients for finite training sets Randomization leads to easier analysis and faster rates Relationship with Shalev-Shwartz and Zhang (2012); Mairal (2013) Extensions Non-differentiable terms, kernels, line-search, parallelization, etc. Beyond supervised learning, beyond convex problems

References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235 3249, 2012. F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Technical Report 00804431, HAL, 2013. F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report 00831977, HAL, 2013. Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, 2012. D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29 51, 2008. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008. L. Györfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, 1996. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report 00674995, HAL, 2013.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995. Julien Mairal. Optimization with first-order surrogate functions. arxiv preprint arxiv:1305.3120, 2013. A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, 1983. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400 407, 1951. ISSN 0003-4851. D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, 1988. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Technical Report 1209.1873, Arxiv, 2012. A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003. A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.