Stochastic gradient descent and robustness to ill-conditioning

Similar documents
Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization

Optimization for Machine Learning

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Linearly-Convergent Stochastic-Gradient Methods

Statistical machine learning and convex optimization

Stochastic optimization in Hilbert spaces

Stochastic Optimization

SVRG++ with Non-uniform Sampling

ONLINE VARIANCE-REDUCING OPTIMIZATION

Convex Optimization Lecture 16

Learning with stochastic proximal gradient

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Fast Stochastic Optimization Algorithms for ML

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Stochastic Gradient Descent. CS 584: Big Data Analytics

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

Stochastic Composition Optimization

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Convergence Rates of Kernel Quadrature Rules

Day 3 Lecture 3. Optimizing deep networks

Comparison of Modern Stochastic Optimization Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic and online algorithms

Lecture 2 February 25th

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Nonparametric Bayesian Methods (Gaussian Processes)

Perturbed Proximal Gradient Algorithm

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Large-scale Stochastic Optimization

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Accelerating Stochastic Gradient Descent for Least Squares Regression

Big Data Analytics: Optimization and Randomization

Probabilistic Graphical Models & Applications

Early Stopping for Computational Learning

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Nonlinear Optimization Methods for Machine Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Accelerating SVRG via second-order information

STA141C: Big Data & High Performance Statistical Computing

Overfitting, Bias / Variance Analysis

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Non-asymptotic bound for stochastic averaging

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

More Optimization. Optimization Methods. Methods

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Trade-Offs in Distributed Learning and Optimization

A Parallel SGD method with Strong Convergence

Big Data Analytics. Lucas Rego Drumond

Deep Feedforward Networks

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Randomized Smoothing Techniques in Optimization

Sequential and reinforcement learning: Stochastic Optimization I

Static Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel

OPTIMIZATION METHODS IN DEEP LEARNING

Oslo Class 2 Tikhonov regularization and kernels

On the Influence of Momentum Acceleration on Online Learning

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Minimizing finite sums with the stochastic average gradient

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Part 4: Conditional Random Fields

Lecture: Adaptive Filtering

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Randomized Smoothing for Stochastic Optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

ECS171: Machine Learning

Coordinate Descent and Ascent Methods

Multiple kernel learning for multiple sources

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

A Conservation Law Method in Optimization

Modern Optimization Techniques

Incremental Quasi-Newton methods with local superlinear convergence rate

A Universal Catalyst for Gradient-Based Optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Ergodic Subgradient Descent

RegML 2018 Class 2 Tikhonov regularization and kernels

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Transcription:

Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion, Eric Moulines - Venezia, May 2016

Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension p

Search engines - Advertising

Visual object recognition

Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p Explicit features adapted to inputs (can be learned as well) Using Hilbert spaces for non-linear / non-parametric estimation

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2

Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2 Classification : y { 1,1}, prediction ŷ = sign( θ,φ(x) ) loss of the form l(y θ,φ(x) ) True 0-1 loss: l(y θ,φ(x) ) = 1 y θ,φ(x) <0 Usual convex losses: 5 4 3 0 1 hinge square logistic 2 1 0 3 2 1 0 1 2 3 4

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L smooth non smooth

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Bounded data: Φ(x) R L = O(R 2 )

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ convex strongly convex

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ (large µ/l) (small µ/l)

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e (µ/l)t ) convergence rate for strongly convex functions

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e (µ/l)t ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e (µ/l)t ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p

Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p Machine learning - statistics f(θ) = Ef n (θ) = El(y n, θ,φ(x n ) ) = generalization error Loss for a single pair of observations: f n (θ) = l(y n, θ,φ(x n ) ) Expected gradient: f (θ) = Ef n(θ) = E { l (y n, θ,φ(x n ) )Φ(x n ) } Beyond convex optimization: see, e.g., Benveniste et al. (2012)

Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Running-time = O(np) Single pass through the data One line of code among many

Convex stochastic approximation Existing analysis Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2

Convex stochastic approximation Existing analysis Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

Convex stochastic approximation Existing analysis Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm for smooth problems with global convergence rate O(1/n) in all situations?

Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n ) f(θ ) 4σ2 p n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] 1 2 3 1/2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/2 5 0 2 4 6 log 10 (n)

Simulations - benchmarks alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000) log 10 [f(θ) f(θ * )] 1 0.5 0 0.5 1 1.5 2 alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 1 0.5 0 0.5 1 1.5 2 alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 0.2 news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 0.6 0.8 1/R 2 1/R 2 n 1/2 SAG 0 2 4 log (n) 10 0 0.2 0.4 0.6 0.8 C/R 2 C/R 2 n 1/2 SAG 0 2 4 log 10 (n)

Isn t least-squares regression a regression?

Isn t least-squares regression a regression? Least-squares regression Simpler to analyze and understand Explicit relationship to bias/variance trade-offs (next slides) Many important loss functions are not quadratic Beyond least-squares with online Newton steps Complexity of O(p) per iteration with rate O(p/n) See Bach and Moulines (2013) for details

Optimal bounds for least-squares? Least-squares: cannot beat σ 2 p/n (Tsybakov, 2003). Really?

Optimal bounds for least-squares? Least-squares: cannot beat σ 2 p/n (Tsybakov, 2003). Really? Refined analysis (Défossez and Bach, 2015) Ef( θ n ) f(θ ) σ2 p n + H 1/2 (θ 0 θ ) 2 2 γ 2 n 2 In practice: bias may be larger than variance, σ 2 p/n pessimistic

Optimal bounds for least-squares? Least-squares: cannot beat σ 2 p/n (Tsybakov, 2003). Really? Refined analysis (Défossez and Bach, 2015) Ef( θ n ) f(θ ) σ2 p n + H 1/2 (θ 0 θ ) 2 2 γ 2 n 2 In practice: bias may be larger than variance, σ 2 p/n pessimistic Refined assumptions with adaptivity (Dieuleveut and Bach, 2014) Ef( θ n ) f(θ ) σ2 γ 1/α trh 1/α + H1/2 r (θ 0 θ ) 2 2 n 1 1/α γ 2r n 2min{r,1} SGD is adaptive to the covariance matrix eigenvalue decay Leads to optimal rates for non-parametric regression

Achieving optimal bias and variance terms Current results with averaged SGD (ill-conditioned problems) Variance (starting from optimal θ ) = σ2 p n { R 2 θ 0 θ 2 Bias(nonoise)= min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2

Achieving optimal bias and variance terms Current results with averaged SGD (ill-conditioned problems) Variance (starting from optimal θ ) = σ2 p n { R 2 θ 0 θ 2 Bias(nonoise)= min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2

Achieving optimal bias and variance terms Current results with averaged SGD (ill-conditioned problems) Variance (starting from optimal θ ) = σ2 p n { R 2 θ 0 θ 2 Bias(nonoise)= min, R4 θ 0 θ,h 1 (θ 0 θ ) } n n 2 Averaged gradient descent (Bach and Moulines, 2013) Bias R 2 θ 0 θ 2 n Variance σ 2 p n

Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Bias R 2 θ 0 θ 2 n Variance σ 2 p n

Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Accelerated gradient descent (Nesterov, 1983) Bias R 2 θ 0 θ 2 n R 2 θ 0 θ 2 n 2 Variance σ 2 p n σ 2 p Acceleration is notoriously non-robust to noise (d Aspremont, 2008; Schmidt et al., 2011) For non-structured noise, see Lan (2012)

Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Accelerated gradient descent (Nesterov, 1983) Between averaging and acceleration (Flammarion and Bach, 2015) Bias R 2 θ 0 θ 2 n R 2 θ 0 θ 2 n 2 R 2 θ 0 θ 2 n 1+α Variance σ 2 p n σ 2 p σ 2 p n 1 α

Achieving optimal bias and variance terms Averaged gradient descent (Bach and Moulines, 2013) Accelerated gradient descent (Nesterov, 1983) Between averaging and acceleration (Flammarion and Bach, 2015) Averaging and acceleration (Dieuleveut, Flammarion, and Bach, 2016) Bias R 2 θ 0 θ 2 n R 2 θ 0 θ 2 n 2 R 2 θ 0 θ 2 n 1+α R 2 θ 0 θ 2 n 2 Variance σ 2 p n σ 2 p σ 2 p n 1 α σ 2 p n

Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection and adaptivity

Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection and adaptivity Extensions and future work Going beyond a single pass (Le Roux, Schmidt, and Bach, 2012; Defazio, Bach, and Lacoste-Julien, 2014) Proximal extensions fo non-differentiable terms Kernels and nonparametric estimation (Dieuleveut and Bach, 2014) Parallelization Non-convex problems

References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235 3249, 2012. F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report 00831977, HAL, 2013. Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, 2012. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008. A. d Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171 1183, 2008. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014. A. Défossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-offs and optimal sampling distributions. 2015. A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical report, ArXiv, 2014. A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for least-squares regression. Technical Report 1602.05419, arxiv, 2016.

N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. arxiv preprint arxiv:1504.01577, 2015. L. Györfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, 1996. G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A): 365 397, 2012. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012. O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995. A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, 1983. Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k 2 ). Soviet Math. Doklady, 269(3):543 547, 1983. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400 407, 1951. ISSN 0003-4851. D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, 1988. M. Schmidt, N. Le Roux, and F. Bach. Convergence rates for inexact proximal-gradient method. In

Adv. NIPS, 2011. A. B. Tsybakov. Optimal rates of aggregation. 2003.