Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Similar documents
Linearly-Convergent Stochastic-Gradient Methods

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning

Optimization for Machine Learning

Minimizing finite sums with the stochastic average gradient

Fast Stochastic Optimization Algorithms for ML

Stochastic and online algorithms

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Convex Optimization Lecture 16

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Beyond stochastic gradient descent for large-scale machine learning

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Large-scale Stochastic Optimization

ONLINE VARIANCE-REDUCING OPTIMIZATION

Big Data Analytics: Optimization and Randomization

Coordinate Descent and Ascent Methods

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Trade-Offs in Distributed Learning and Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Stochastic gradient descent and robustness to ill-conditioning

Linear Convergence under the Polyak-Łojasiewicz Inequality

Comparison of Modern Stochastic Optimization Algorithms

Nesterov s Acceleration

Accelerating Stochastic Optimization

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

STA141C: Big Data & High Performance Statistical Computing

Primal-dual coordinate descent

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

ECS289: Scalable Machine Learning

Accelerating SVRG via second-order information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Learning with stochastic proximal gradient

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

arxiv: v1 [math.oc] 1 Jul 2016

Non-Smooth, Non-Finite, and Non-Convex Optimization

Large-scale machine learning and convex optimization

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

Stochastic Composition Optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Block stochastic gradient update method

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Beyond stochastic gradient descent for large-scale machine learning

Selected Topics in Optimization. Some slides borrowed from

Stochastic Optimization Algorithms Beyond SG

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

SVRG++ with Non-uniform Sampling

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

Stochastic Gradient Descent. CS 584: Big Data Analytics

Beyond stochastic gradient descent for large-scale machine learning

CPSC 540: Machine Learning

A Universal Catalyst for Gradient-Based Optimization

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

Lecture 1: Supervised Learning

Stochastic Proximal Gradient Algorithm

Stochastic Gradient Descent with Variance Reduction

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Large-scale machine learning and convex optimization

Optimization Methods for Machine Learning

Adaptive Probabilities in Stochastic Optimization Algorithms

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization Algorithms for Machine Learning in 10 Slides

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Incremental Quasi-Newton methods with local superlinear convergence rate

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Lecture 3: Minimizing Large Sums. Peter Richtárik

Semistochastic quadratic bound methods

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Coordinate descent methods

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Optimized first-order minimization methods

Randomized Smoothing for Stochastic Optimization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Lecture 23: November 21

Nonlinear Optimization Methods for Machine Learning

Optimization methods

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 8: February 9

Optimization methods

High Order Methods for Empirical Risk Minimization

Transcription:

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia

Context: Machine Learning for Big Data Large-scale machine learning: large N, large P N: number of observations (inputs) P: dimension of each observation Regularized empirical risk minimization: find x solution of 1 min x R P N N l(x T a i ) + λr(x) i=1 data fitting term + regularizer

Context: Machine Learning for Big Data Large-scale machine learning: large N, large P N: number of observations (inputs) P: dimension of each observation Regularized empirical risk minimization: find x solution of 1 min x R P N N l(x T a i ) + λr(x) i=1 data fitting term + regularizer Applications to any data-oriented field: Vision, bioinformatics, speech, natural language, web. Main practical challenges: Choosing regularizer r and data-fitting term l. Designing/learning good features a i. Efficiently solving the problem when N or P are very large.

This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min x R P g(x) := 1 N N f i (x). i=1

This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min x R P g(x) := 1 N N f i (x). i=1 We are interested in cases where N is very large.

This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min x R P g(x) := 1 N N f i (x). i=1 We are interested in cases where N is very large. We will focus on strongly-convex functions g. Simplest example is l 2 -regularized least-squares, f i (x) := (a T i x b i ) 2 + λ 2 x 2. Other examples include any l 2 -regularized convex loss: logistic regression, Huber regression, smooth SVMs, CRFs, etc.

Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) = 1 N N i=1 f i(x).

Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) = 1 N N i=1 f i(x). Deterministic gradient method [Cauchy, 1847]: x t+1 = x t α t g (x t ) = x t α t N Linear convergence rate: O(ρ t ). Iteration cost is linear in N. Fancier methods exist, but still in O(N) N i=1 f i (x t).

Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) = 1 N N i=1 f i(x). Deterministic gradient method [Cauchy, 1847]: x t+1 = x t α t g (x t ) = x t α t N Linear convergence rate: O(ρ t ). Iteration cost is linear in N. Fancier methods exist, but still in O(N) N i=1 f i (x t). Stochastic gradient method [Robbins & Monro, 1951]: Random selection of i(t) from {1, 2,..., N}, x t+1 = x t α tf i(t)(x t). Iteration cost is independent of N. Sublinear convergence rate: O(1/t).

nimizing g(θ) = nimizing Stochastic g(θ) vs. = 1 n f i (θ) with f i (θ) =l Deterministic f i (θ) withgradient f i (θ) =l Methods ( y i, θ Φ(x i ) y i, θ Φ(x i ) ) + µ n i=1 + µ n i=1 tch gradient descent: θ t = θ t 1 γ We consider minimizing g(x) = 1 tn g (θ t 1 )=θ t 1 γ n t tch gradient descent: θ N i=1 f i(x). t = θ t 1 γ t g (θ t 1 )=θ t 1 γ n f i nt i=1f i n Deterministic gradient method [Cauchy, 1847]: i=1 chastic gradient descent: θ t = θ t 1 γ t f Stochastic gradient method [Robbins & Monro, i(t) (θ t 1) 1951]: chastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

Motivation for New Methods FG method Stochastic has O(N) vs. cost deterministic with O(ρ t ) rate. methods SG Goal method = best hasof O(1) both cost worlds: with linearratewitho(1) O(1/t) rate. iteration cost log(excess cost) stochastic deterministic time

Motivation for New Methods FG method Stochastic has O(N) vs. cost deterministic with O(ρ t ) rate. methods SG Goal method = best hasof O(1) both cost worlds: with linearratewitho(1) O(1/t) rate. iteration cost log(excess cost) hybrid time stochastic deterministic

Motivation for New Methods FG method Stochastic has O(N) vs. cost deterministic with O(ρ t ) rate. methods SG Goal method = best hasof O(1) both cost worlds: with linearratewitho(1) O(1/t) rate. iteration cost log(excess cost) hybrid time stochastic deterministic Goal is O(1) cost with O(ρ k ) rate.

Prior Work on Speeding up SG Methods A variety of methods have been proposed to speed up SG methods: Step-size strategies, momentum, gradient/iterate averaging Polyak & Juditsky (1992), Tseng (1998), Kushner & Yin (2003) Nesterov (2009), Xiao (2010), Hazan & Kale (2011), Rakhlin et al. (2012) Stochastic version of accelerated and Newton-like methods Bordes et al. (2009), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010), Duchi et al. (2011)

Prior Work on Speeding up SG Methods A variety of methods have been proposed to speed up SG methods: Step-size strategies, momentum, gradient/iterate averaging Polyak & Juditsky (1992), Tseng (1998), Kushner & Yin (2003) Nesterov (2009), Xiao (2010), Hazan & Kale (2011), Rakhlin et al. (2012) Stochastic version of accelerated and Newton-like methods Bordes et al. (2009), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010), Duchi et al. (2011) None of these methods improve on the O(1/t) rate

Prior Work on Speeding up SG Methods Existing linear convergence results: Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000) Linear convergence up to a fixed tolerance: O(ρ t ) + O(α). Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012) Linear rate but iterations make full passes through the data

Prior Work on Speeding up SG Methods Existing linear convergence results: Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000) Linear convergence up to a fixed tolerance: O(ρ t ) + O(α). Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012) Linear rate but iterations make full passes through the data Special Problems Classes Collins et al. (2008), Strohmer & Vershynin (2009), Schmidt and Le Roux (2012), Shalev-Shwartz and Zhang (2012) Linear rate but limited choice for the f i s

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex.

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost?

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES!

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N N i=1 f i (x t )

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N N i=1 y t i

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N Memory: y t i = f i (x t ) from the last iteration t where i was selected. N i=1 y t i

Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N Memory: y t i = f i (x t ) from the last iteration t where i was selected. Assumes that gradients of other examples don t change. This assumption becomes accurate as x t+1 x t 0. Stochastic variant of increment aggregated gradient (IAG). [Blatt et al. 2007] N i=1 y t i

Convergence Rate of SAG: Attempt 1 Proposition 1. With α t = 1 2NL the SAG iterations satisfy ( E[g(x t ) g(x )] 1 µ ) t C. 8LN

Convergence Rate of SAG: Attempt 1 Proposition 1. With α t = 1 2NL the SAG iterations satisfy ( E[g(x t ) g(x )] 1 µ ) t C. 8LN Convergence rate of O(ρ t ) with cost of O(1). Is this useful?!?

Convergence Rate of SAG: Attempt 1 Proposition 1. With α t = 1 2NL the SAG iterations satisfy ( E[g(x t ) g(x )] 1 µ ) t C. 8LN Convergence rate of O(ρ t ) with cost of O(1). Is this useful?!? This rate is very slow: performance similar to cyclic method.

Convergence Rate of SAG: Attempt 2 1 Proposition 2. With α t [ 2Nµ, 1 16L ] and N 8 L µ, the SAG iterations satisfy ( E[g(x t ) g(x )] 1 1 ) t C. 8N

Convergence Rate of SAG: Attempt 2 1 Proposition 2. With α t [ 2Nµ, 1 16L ] and N 8 L µ, the SAG iterations satisfy ( E[g(x t ) g(x )] 1 1 ) t C. 8N Much bigger step-sizes: µ << L and L << NL (causes cyclic algorithm to diverge)

Convergence Rate of SAG: Attempt 2 1 Proposition 2. With α t [ 2Nµ, 1 16L ] and N 8 L µ, the SAG iterations satisfy ( E[g(x t ) g(x )] 1 1 ) t C. 8N Much bigger step-sizes: µ << L and L << NL (causes cyclic algorithm to diverge) Gives constant non-trivial reduction per pass: ( 1 1 ) N ( exp 1 ) = 0.8825. 8N 8 N O( L µ ) has been called big data condition.

Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N

Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N This rate is very fast : Well-conditioned problems: constant non-trivial reduction per pass.

Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N This rate is very fast : Well-conditioned problems: constant non-trivial reduction per pass. Badly-conditioned problems: almost same as deterministic method: g(x t ) g(x ) ( 1 µ L ) 2t C, with α t = 1, but SAG is N times faster. L

Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N This rate is very fast : Well-conditioned problems: constant non-trivial reduction per pass. Badly-conditioned problems: almost same as deterministic method: g(x t ) g(x ) ( 1 µ L ) 2t C, with α t = 1, but SAG is N times faster. L Still get linear rate for any α t 1 16L.

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N:

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998.

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761.

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250.

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048.

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048. SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N).

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048. SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ:

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048. SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)).

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048. SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)). Gradient: O(N L µ log(1/ɛ)).

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048. SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)). Gradient: O(N L log(1/ɛ)). µ L Accelerated: O(N log(1/ɛ)). µ

Rate of Convergence Comparison Assume that N = 700000, L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = 0.99998. Accelerated gradient method has rate ( 1 ) µ L = 0.99761. SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = 0.88250. Fastest possible first-order method: ( L µ L+ µ ) 2 = 0.99048. SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)). Gradient: O(N L log(1/ɛ)). µ L Accelerated: O(N log(1/ɛ)). µ SAG: O(max{N, L } log(1/ɛ)). µ

Proof Technique: Lyapunov Function We define a Lyapunov function of the form L(θ t ) = 2h[g(x t +de y t ) g(x )]+(θ t θ ) [ A B B C ] (θ t θ ), with θ t = y t 1. y N t x t, θ = f i (x ) I A = a. f N (x ), e = 1 ee + a 2 I,., B = be, I C = ci. x

Proof Technique: Lyapunov Function We define a Lyapunov function of the form L(θ t ) = 2h[g(x t +de y t ) g(x )]+(θ t θ ) [ A B B C ] (θ t θ ), with θ t = y t 1. y N t x t, θ = f i (x ) I A = a. f N (x ), e = 1 ee + a 2 I,., B = be, I C = ci. x Proof involves finding {α, a 1, a 2, b, c, d, h, δ, γ} such that E(L(θ t ) F t 1 ) (1 δ)l(θ t 1 ), L(θ t ) γ[g(x t ) g(x )]. Apply recursively and initial Lyapunov function gives constant.

Constants in Convergence Rate What are the constants?

Constants in Convergence Rate What are the constants? If we initialize with y 0 i = 0 we have C = [g(x 0 ) g(x )] + 4L N x 0 x 2 + σ2 16L.

Constants in Convergence Rate What are the constants? If we initialize with y 0 i = 0 we have C = [g(x 0 ) g(x )] + 4L N x 0 x 2 + σ2 16L. If we initialize with y 0 i = f i (x 0 ) g (x 0 ) we have C = 3 2 [g(x 0 ) g(x )] + 4L N x 0 x 2.

Constants in Convergence Rate What are the constants? If we initialize with y 0 i = 0 we have C = [g(x 0 ) g(x )] + 4L N x 0 x 2 + σ2 16L. If we initialize with y 0 i = f i (x 0 ) g (x 0 ) we have C = 3 2 [g(x 0 ) g(x )] + 4L N x 0 x 2. If we initialize with N stochastic gradient iterations, [g(x 0 ) g(x )] = O(1/N), x 0 x 2 = O(1/N).

Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists.

Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t).

Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x.

Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x. Same algorithm could be used in non-convex case.

Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x. Same algorithm could be used in non-convex case. Contrast with stochastic dual coordinate ascent: Requires explicit strongly-convex regularizer. Not adaptive to µ, does not allow µ = 0.

Comparing FG and SG Methods quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236) 10 0 10 0 IAG Objective minus Optimum 10 2 10 4 ASG AFG IAG SG Objective minus Optimum 10 1 10 2 10 3 SG L BFGS ASG AFG 10 6 L BFGS 0 10 20 30 40 50 Effective Passes 10 4 0 10 20 30 40 50 Effective Passes Comparison of competitive deterministic and stochastic methods.

SAG Compared to FG and SG Methods quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236) Objective minus Optimum 10 0 10 2 10 4 10 6 10 8 AFG IAG ASG SAG LS SG L BFGS Objective minus Optimum 10 0 10 5 10 10 L BFGS SG SAG LS ASG IAG AFG 10 10 10 15 10 12 0 10 20 30 40 50 Effective Passes 0 10 20 30 40 50 Effective Passes SAG starts fast and stays fast.

SAG Compared to Coordinate-Based Methods quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236) 10 0 10 0 Objective minus Optimum 10 2 10 4 10 6 10 8 DCA SAG LS ASG PCD L L BFGS Objective minus Optimum 10 5 10 10 DCA L BFGS ASG SAG LS PCD L 10 10 10 12 0 10 20 30 40 50 Effective Passes 10 15 0 10 20 30 40 50 Effective Passes PCD/DCA are similar on some problems, much worse on others.

SAG Implementation Issues while(1) Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d.

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches? Can we handle constraints or non-smooth problems?

SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches? Can we handle constraints or non-smooth problems? Should we shuffle the data?

Implementation Issues: Normalization Should we normalize by N in the early iterations? The parameter update: x = x α N d.

Implementation Issues: Normalization Should we normalize by N in the early iterations? The parameter update: x = x α d. M We normalize by number of examples seen (M). Better performance on early iterations. Similar to doing one pass of SG.

Implementation Issues: Memory Requirements Can we reduce the memory? The memory update for f i (a T i x): Compute f i (a T i x). d = d + f i (a T i x) y i. y i = f i (a T i x).

Implementation Issues: Memory Requirements Can we reduce the memory? The memory update for f i (a T i x): Compute f i (δ), with δ = a T i x. d = d + a i (f (δ) y i ). y i = f i (δ). Use that f i (at i x) = a i f i (δ).

Implementation Issues: Memory Requirements Can we reduce the memory? The memory update for f i (a T i x): Compute f i (δ), with δ = a T i x. d = d + a i (f (δ) y i ). y i = f i (δ). Use that f i (at i x) = a i f i (δ). Only store the scalars f i (δ). Reduces the memory from O(NP) to O(N).

Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as: x t+1 = x t α t N f i (x t ) f i (x i ) + N f j (x j ). j=1

Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as: x t+1 = x t α t N f i (x t ) f i (x i ) + N f j (x j ). The SVRG/mixedGrad method ( memory-free methods ): x t+1 = x t α f i (x t ) f i ( x) + 1 N f j N ( x), where we occasionally update x. [Mahdavi et al., 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konecny and Richtarik, 2013, Xiao and Zhang, 2014] j=1 j=1

Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as: x t+1 = x t α t N f i (x t ) f i (x i ) + N f j (x j ). The SVRG/mixedGrad method ( memory-free methods ): x t+1 = x t α f i (x t ) f i ( x) + 1 N f j N ( x), where we occasionally update x. [Mahdavi et al., 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konecny and Richtarik, 2013, Xiao and Zhang, 2014] Gets rid of memory for two f i j=1 j=1 evaluations per iteration.

Implementation Issues: Sparsity Can we handle sparse data? The parameter update for each variable j: x j = x j α M d j.

Implementation Issues: Sparsity Can we handle sparse data? The parameter update for each variable j: x j = x j kα d M j. For sparse data, d j is typically constant. Apply previous k updates when it changes.

Implementation Issues: Sparsity Can we handle sparse data? The parameter update for each variable j: x j = x j kα M d j. For sparse data, d j is typically constant. Apply previous k updates when it changes. Reduces the iteration cost from O(P) to O( f i (x) 0). Standard tricks allow l 2 -regularization and l 1 -regularization.

Implementation Issues: Step-Size How should we set the step size?

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L.

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L. What if L is unknown or smaller near x?

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L. What if L is unknown or smaller near x? Start with a small L. Increase L until we satisfy: f i (x 1 L f i (x)) f i (x) 1 2L f i (x) 2. (assuming f i (x) 2 ɛ) Decrease L between iterations. (Lipschitz approximation procedure from FISTA)

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L. What if L is unknown or smaller near x? Start with a small L. Increase L until we satisfy: f i (x 1 L f i (x)) f i (x) 1 2L f i (x) 2. (assuming f i (x) 2 ɛ) Decrease L between iterations. (Lipschitz approximation procedure from FISTA) For f i (a T i x), this costs O(1) in N and P: f i (a T i x f i (δ) L a i 2 ).

Implementation Issues: Termination Criterion When should we stop?

Implementation Issues: Termination Criterion When should we stop? Normally we check the size of f (x).

Implementation Issues: Termination Criterion When should we stop? Normally we check the size of f (x). And SAG has y i f i (x).

Implementation Issues: Termination Criterion When should we stop? Normally we check the size of f (x). And SAG has y i f i (x). We can check the size of 1 N d = 1 N N i=1 y i f (x)

SAG Implementation Issues: Mini-Batches Can we use mini-batches?

SAG Implementation Issues: Mini-Batches Can we use mini-batches? Yes, define each f i to include more than one example. Reduces memory requirements. Allows parallelization. But must decrease L for good performance L B 1 B i B L i max i B {L i}.

SAG Implementation Issues: Mini-Batches Can we use mini-batches? Yes, define each f i to include more than one example. Reduces memory requirements. Allows parallelization. But must decrease L for good performance L B 1 B i B L i max i B {L i}. In practice, Lipschitz approximation procedure on to determine L B.

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). i=1

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). We can do a proximal-gradient variant when r is simple: i=1 x = prox αr( ) [x α N d]. [Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014] E.g., with r(x) = λ x 1 : stochastic iterative soft-thresholding. Same converge rate as smooth case.

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). We can do a proximal-gradient variant when r is simple: i=1 x = prox αr( ) [x α N d]. [Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014] E.g., with r(x) = λ x 1 : stochastic iterative soft-thresholding. Same converge rate as smooth case. Exist ADMM variants when r is simple/linear composition, r(ax). [Wong et al., 2013]

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). We can do a proximal-gradient variant when r is simple: i=1 x = prox αr( ) [x α N d]. [Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014] E.g., with r(x) = λ x 1 : stochastic iterative soft-thresholding. Same converge rate as smooth case. Exist ADMM variants when r is simple/linear composition, r(ax). [Wong et al., 2013] If f i are non-smooth, could smooth them or use dual methods. [Nesterov, 2005, Lacoste-Julien et al., 2013, Shalev-Schwartz and Zhang, 2013]

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better?

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO!

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help?

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help? Bias sampling towards Lipschitz constants L i.

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help? Bias sampling towards Lipschitz constants L i. Justification: duplicate examples proportional to L i : 1 N N i=1 f i (x) = 1 Li N L i i=1 j=1 L mean f i (x) L i, convergence rate depends on L mean instead of L max.

SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help? Bias sampling towards Lipschitz constants L i. Justification: duplicate examples proportional to L i : 1 N N i=1 f i (x) = 1 Li N L i i=1 j=1 L mean f i (x) L i, convergence rate depends on L mean instead of L max. Combine with the line-search for adaptive sampling. (see paper/code for details)

SAG with Non-Uniform Sampling AGD protein (n = 145751, p = 74) and sido (n = 12678, p = 4932) 10 0 10 0 Objective minus Optimum 10 1 10 2 10 3 10 4 AGD L BFGS SAG PCD L SG C IAG ASG C DCA Objective minus Optimum 10 1 10 2 10 3 10 4 10 5 IAG L BFGS SG C SAG DCA PCD L ASG C 10 5 0 10 20 30 40 50 Effective Passes 10 6 0 10 20 30 40 50 Effective Passes Datasets where SAG had the worst relative performance.

SAG with Non-Uniform Sampling protein (n = 145751, p = 74) and sido (n = 12678, p = 4932) Objective minus Optimum 10 0 10 5 10 10 10 15 DCA IAG AGD L BFGS PCD L SG C ASG C SAG LS (Lipschitz) SAG Objective minus Optimum 10 0 10 5 10 10 IAG AGD L BFGS SG C SAG LS (Lipschitz) DCA PCD L ASG C SAG 10 20 0 10 20 30 40 50 Effective Passes 10 15 0 10 20 30 40 50 Effective Passes Lipschitz sampling helps a lot.

Newton-like Methods Can we use a scaling matrix H? x = x α N H 1 d.

Newton-like Methods Can we use a scaling matrix H? x = x α N H 1 d. Didn t help in my experiments with diagonal H. Non-diagonal will lose sparsity.

Newton-like Methods Can we use a scaling matrix H? x = x α N H 1 d. Didn t help in my experiments with diagonal H. Non-diagonal will lose sparsity. Quasi-Newton method proposed that has empirically-faster convergence, but much overhead. [Sohl-Dickstein et al., 2014]

Conclusion and Discussion Faster theoretical convergence using only the sum structure.

Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts.

Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion.

Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion. Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants.

Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion. Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants. Future developments: Non-convex analysis. Parallel/distributed methods.

Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion. Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants. Future developments: Non-convex analysis. Parallel/distributed methods. Thank you for the invitation.