Stochastic Optimization Algorithms Beyond SG

Similar documents
Second-Order Methods for Stochastic Optimization

Stochastic Quasi-Newton Methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Stochastic Gradient Methods for Large-Scale Machine Learning

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Nonlinear Optimization Methods for Machine Learning

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Sub-Sampled Newton Methods

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Large-scale Stochastic Optimization

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

ECS171: Machine Learning

Linearly-Convergent Stochastic-Gradient Methods

Stochastic Analogues to Deterministic Optimizers

Unconstrained optimization

Convex Optimization Lecture 16

Advanced computational methods X Selected Topics: SGD

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Incremental Quasi-Newton methods with local superlinear convergence rate

High Order Methods for Empirical Risk Minimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Higher-Order Methods

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Linear Convergence under the Polyak-Łojasiewicz Inequality

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Exact and Inexact Subsampled Newton Methods for Optimization

Algorithms for Nonsmooth Optimization

Why should you care about the solution strategies?

Accelerating SVRG via second-order information

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Lecture 6 Optimization for Deep Neural Networks

Selected Topics in Optimization. Some slides borrowed from

Stochastic and online algorithms

Convex Optimization. Problem set 2. Due Monday April 26th

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Second order machine learning

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

ECS289: Scalable Machine Learning

Accelerating Stochastic Optimization

Lecture 1: Supervised Learning

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

ABSTRACT 1. INTRODUCTION

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Linear Convergence under the Polyak-Łojasiewicz Inequality

Neural Network Training

2. Quasi-Newton methods

Second order machine learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

Conditional Gradient (Frank-Wolfe) Method

Coordinate Descent and Ascent Methods

Introduction to gradient descent

STA141C: Big Data & High Performance Statistical Computing

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

5 Quasi-Newton Methods

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Adaptive Gradient Methods AdaGrad / Adam

Day 3 Lecture 3. Optimizing deep networks

CS260: Machine Learning Algorithms

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Stochastic optimization in Hilbert spaces

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Block stochastic gradient update method

Multilevel Optimization Methods: Convergence and Problem Structure

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Algorithmic Stability and Generalization Christoph Lampert

arxiv: v1 [math.oc] 1 Jul 2016

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Chapter 4. Unconstrained optimization

Line Search Methods for Unconstrained Optimisation

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Comparison of Modern Stochastic Optimization Algorithms

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

5 Handling Constraints

On Nesterov s Random Coordinate Descent Algorithms - Continued

Efficient Methods for Large-Scale Optimization

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Mini-Course 1: SGD Escapes Saddle Points

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Programming, numerics and optimization

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Newton s Method. Javier Peña Convex Optimization /36-725

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Transcription:

Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods for Large-Scale Machine Learning http://arxiv.org/abs/1606.04838 Google NYC 3 November 2016 1 Mike s older brother Stochastic Optimization Algorithms Beyond SG 1 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 2 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 3 of 42

Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Stochastic Optimization Algorithms Beyond SG 4 of 42

Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by min w R d fn(w), where fn(w) = 1 n n l(h(x i ; w), y i ). i=1 Stochastic Optimization Algorithms Beyond SG 4 of 42

Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by For this talk, let s assume min w R d fn(w), where fn(w) = 1 n n l(h(x i ; w), y i ). f is continuously differentiable, bounded below, and potentially nonconvex; f is L-Lipschitz continuous, i.e., f(w) f(w) 2 L w w 2. Focus on optimization algorithms, not data fitting issues, regularization, etc. i=1 Stochastic Optimization Algorithms Beyond SG 4 of 42

Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) w k Stochastic Optimization Algorithms Beyond SG 5 of 42

Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w)? f(w)? w k Stochastic Optimization Algorithms Beyond SG 5 of 42

Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) + 1 2 L w w k 2 2 w k Stochastic Optimization Algorithms Beyond SG 5 of 42

Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) + 1 2 L w w k 2 2 f(w k ) + f(w k ) T (w w k ) + 1 2 c w w k 2 2 w k Stochastic Optimization Algorithms Beyond SG 5 of 42

GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) + 1 2 L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 Stochastic Optimization Algorithms Beyond SG 6 of 42

GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 If, in addition, f is c-strongly convex, then for all k 1: f(w k ) f (1 αc) k (f(x 0 ) f ). Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) + 1 2 L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 f(w k ) αc(f(w k ) f ). = f(w k+1 ) f (1 αc)(f(w k ) f ). Stochastic Optimization Algorithms Beyond SG 6 of 42

GD illustration Figure: GD with fixed stepsize Stochastic Optimization Algorithms Beyond SG 7 of 42

Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Stochastic Optimization Algorithms Beyond SG 8 of 42

Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Not a descent method!... but can guarantee eventual descent in expectation (with E k [g k ] = f(w k )): f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) + 1 2 L w k+1 w k 2 2 = f(w k ) α k f(w k ) T g k + 1 2 α2 k L g k 2 2 = E k [f(w k+1 )] f(w k ) α k f(w k ) 2 2 + 1 2 α2 k LE k[ g k 2 2]. Markov process: w k+1 depends only on w k and random choice at iteration k. Stochastic Optimization Algorithms Beyond SG 8 of 42

SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then α k = 1 L α k = O ( ) 1 k = E 1 k f(w j ) 2 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 (*Assumed unbiased gradient estimates; see paper for more generality.) Stochastic Optimization Algorithms Beyond SG 9 of 42

SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then α k = 1 L α k = O ( ) 1 k If, in addition, f is c-strongly convex, then = E 1 k f(w j ) 2 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 α k = 1 L α k = O ( ) 1 k = E[f(w k ) f ] (M/c) 2 ( ) (L/c)(M/c) = E[f(w k ) f ] = O. k (*Assumed unbiased gradient estimates; see paper for more generality.) Stochastic Optimization Algorithms Beyond SG 9 of 42

SG illustration Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right) Stochastic Optimization Algorithms Beyond SG 10 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 11 of 42

Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Stochastic Optimization Algorithms Beyond SG 12 of 42

Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Motivation Intuitive Practical Theoretical Explanation data redundancy SG vs. L-BFGS with batch gradient (below) E[f n(w k ) f n, ] = O(1/k) and E[f(w k ) f ] = O(1/k) 0.6 0.5 Empirical Risk 0.4 0.3 LBFGS 0.2 0.1 SGD 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Accessed Data Points x 10 5 Stochastic Optimization Algorithms Beyond SG 12 of 42

Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Stochastic Optimization Algorithms Beyond SG 13 of 42

Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Considering total (estimation + optimization) error as E = E[f(w n ) f(w )] + E[f( w n ) f(w n )] 1 n + ɛ and a time budget T, one finds: SG: Process as many samples as possible (n T ), leading to E 1 T. GD: With n T / log(1/ɛ), minimizing E yields ɛ 1/T and E 1 T + log(t ). T Stochastic Optimization Algorithms Beyond SG 13 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 14 of 42

End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016) Stochastic Optimization Algorithms Beyond SG 15 of 42

End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016) No, we should want more... SG requires a lot of tuning Sublinear convergence is not satisfactory... linearly convergent method eventually wins... with higher budget, faster computation, parallel?, distributed? Also, any gradient -based method is not scale invariant. Stochastic Optimization Algorithms Beyond SG 15 of 42

What can be improved? stochastic gradient better rate better constant Stochastic Optimization Algorithms Beyond SG 16 of 42

What can be improved? stochastic gradient better rate better constant better rate and better constant Stochastic Optimization Algorithms Beyond SG 16 of 42

Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Stochastic Optimization Algorithms Beyond SG 17 of 42

2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Stochastic Optimization Algorithms Beyond SG 18 of 42

2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Stochastic Optimization Algorithms Beyond SG 19 of 42

Even more... momentum acceleration (dual) coordinate descent trust region / step normalization exploring negative curvature Stochastic Optimization Algorithms Beyond SG 20 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 21 of 42

Scale invariance Neither SG nor GD are invariant to linear transformations. min f(w) = w w R d k+1 w k α k f(w k ) min f(b w) = w w R d k+1 w k α k B f(b w k ) (for given B 0) Stochastic Optimization Algorithms Beyond SG 22 of 42

Scale invariance Neither SG nor GD are invariant to linear transformations. min f(w) = w w R d k+1 w k α k f(w k ) min f(b w) = w w R d k+1 w k α k B f(b w k ) (for given B 0) Scaling latter by B and defining {w k } = {B w k } yields w k+1 w k α k B 2 f(w k ) Algorithm is clearly affected by choice of B Surely, some choices may be better than others Stochastic Optimization Algorithms Beyond SG 22 of 42

Newton scaling Consider the function below and suppose that w k = (0, 3): Stochastic Optimization Algorithms Beyond SG 23 of 42

Newton scaling GD step along f(w k ) ignores curvature of the function: Stochastic Optimization Algorithms Beyond SG 23 of 42

Newton scaling Newton scaling (B = ( 2 f(w k )) 1/2 ): gradient step moves to the minimizer: Stochastic Optimization Algorithms Beyond SG 23 of 42

Newton scaling... corresponds to minimizing a quadratic model of f in the original space: w k+1 w k + α k s k where 2 f(w k )s k = f(w k ) Stochastic Optimization Algorithms Beyond SG 23 of 42

Deterministic case What is known about Newton s method for deterministic optimization? local rescaling based on inverse Hessian information unit steps are good near strong minimizer (no tuning!)... locally quadratically convergent global convergence rate better than gradient method (when regularized) Stochastic Optimization Algorithms Beyond SG 24 of 42

Deterministic case to stochastic case What is known about Newton s method for deterministic optimization? local rescaling based on inverse Hessian information unit steps are good near strong minimizer (no tuning!)... locally quadratically convergent global convergence rate better than gradient method (when regularized) However, it is way too expensive. But all is not lost: scaling can be practical. Wide variety of scaling techniques improve performance.... could hope to remove condition number (L/c) from convergence rate! Added costs can be minimial when coupled with noise reduction. Stochastic Optimization Algorithms Beyond SG 24 of 42

Quasi-Newton Only approximate second-order information with gradient displacements: w k+1 w k w Secant equation H k v k = s k to match gradient of f at w k, where s k := w k+1 w k and v k := f(w k+1 ) f(w k ) Stochastic Optimization Algorithms Beyond SG 25 of 42

Balance between extremes For deterministic, smooth optimization, a nice balance achieved by quasi-newton: w k+1 w k α k M k g k, where α k > 0 is a stepsize; g k f(w k ); {M k } is updated dynamically. Background on quasi-newton: local rescaling of step (overcome ill-conditioning) only first-order derivatives required no linear system solves required global convergence guarantees (say, with line search) superlinear local convergence rate How can the idea be carried over to a stochastic setting? Stochastic Optimization Algorithms Beyond SG 26 of 42

Previous work: BFGS-type methods Much focus on the secant equation (H k+1 Hessian approximation) H k+1 s k = y k where { sk := w k+1 w k y k := f(w k+1 ) f(w k ) and an appropriate replacement for the gradient displacement: y k f(w k+1, ξ k ) f(w k, ξ k ) }{{} use same seed olbfgs, Schraudolph et al. (2007) SGD-QN, Bordes et al. (2009) RES, Mokhtari & Ribeiro (2014) or y k i S H k 2 f(w k+1, ξ k+1,i ) s k }{{} use action of step on subsampled Hessian SQN, Byrd et al. (2015) Is this the right focus? Is there a better way (especially for nonconvex f)? Stochastic Optimization Algorithms Beyond SG 27 of 42

Proposal Propose a quasi-newton method for stochastic (nonconvex) optimization exploit self-correcting properties of BFGS-type updates Powell (1976) Ritter (1979, 1981) Werner (1978) Byrd, Nocedal (1989) properties of Hessians offer useful bounds for inverse Hessians motivating convergence theory for convex and nonconvex objectives dynamic noise reduction strategy limited memory variant Observed stable behavior and overall good performance Stochastic Optimization Algorithms Beyond SG 28 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 29 of 42

BFGS-type updates Inverse Hessian and Hessian approximation updating formulas (s T k v k > 0): ( M k+1 I v ks T k s T k v k ( H k+1 I s ks T k H k s T k H ks k ) T M k ( I v ks T k s T k v k ) T ( H k ) I s ks T k H k s T k H ks k + s ks T k s T k v k ) + v kv T k s T k v k Satisfy secant-type equations M k+1 v k = s k and H k+1 s k = v k, but these are not relevant for our purposes here. Choosing v k y k := g k+1 g k yields standard BFGS, but in this talk v k β k s k + (1 β k )α k y k for some β k [0, 1]. This scheme is important to preserve self-correcting properties. Stochastic Optimization Algorithms Beyond SG 30 of 42

Geometric properties of Hessian update Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Stochastic Optimization Algorithms Beyond SG 31 of 42

Geometric properties of Hessian update Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Returning to the Hessian update: ( H k+1 I s ks T k H ) T ( k s T k H H k I s ks T k H ) k ks k s T k H + v kvk T ks k s T k }{{} v k }{{} rank n 1 rank 1 Curvature projected out along span(s k ) ) ( Curvature corrected by v kvk T s T = k v k ( v k v T k v k 2 2 ) v k 2 2 v k T (inverse Rayleigh). M k+1v k Stochastic Optimization Algorithms Beyond SG 31 of 42

Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Stochastic Optimization Algorithms Beyond SG 32 of 42

Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Theorem SC (Byrd, Nocedal (1989)) Suppose that, for all k, there exists {η, θ} R ++ such that η st k v k s k 2 2 and v k 2 2 s T k v k θ. (KEY) Then, for any p (0, 1), there exist constants {ι, κ, λ} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: ι s T k H ks k s k 2 H k s k 2 and κ H ks k 2 s k 2 λ. Proof technique. Building on work of Powell (1976), etc., involves bounding growth of γ(h k ) = tr(h k ) ln(det(h k )). Stochastic Optimization Algorithms Beyond SG 32 of 42

Self-correcting properties of inverse Hessian update Rather than focus on superlinear convergence results, we care about the following. Corollary SC Suppose the conditions of Theorem SC hold. Then, for any p (0, 1), there exist constants {µ, ν} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: µ g k 2 2 gt k M kg k and M k g k 2 2 ν g k 2 2 Proof sketch. Follows simply after algebraic manipulations from the result of Theorem SC, using the facts that s k = α k M k g k and M k = H 1 k for all k. Stochastic Optimization Algorithms Beyond SG 33 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 34 of 42

Algorithm SC : Self-Correcting BFGS Algorithm 1: Choose w 1 R d. 2: Set g 1 f(w 1 ). 3: Choose a symmetric positive definite M 1 R d d. 4: Choose a positive scalar sequence {α k }. 5: for k = 1, 2,... do 6: Set s k α k M k g k. 7: Set w k+1 w k + s k. 8: Set g k+1 f(w k+1 ). 9: Set y k g k+1 g k. 10: Set β k min{β [0, 1] : v(β) := βs k + (1 β)α k y k satisfies (KEY)}. 11: Set v k v(β k ). 12: Set ( ) T ( ) M k 13: end for M k+1 I v ks T k s T k v k I v ks T k s T k v k + s ks T k s T k v. k Stochastic Optimization Algorithms Beyond SG 35 of 42

Global convergence theorem Theorem (Bottou, Curtis, Nocedal (2016)) Suppose that, for all k, there exists a scalar constant ρ > 0 such that f(w k ) T E ξk [M k g k ] ρ f(w k ) 2 2, and there exist scalars σ > 0 and τ > 0 such that E ξk [ M k g k 2 2] σ + τ f(w k ) 2 2. Then, {E[f(w k )]} converges to a finite limit and lim inf k E[ f(w k)] = 0. Proof technique. Follows from the critical inequality E ξk [f(w k+1 )] f(w k ) α k f(w k ) T E ξk [M k g k ] + α 2 k LE ξ k [ M k g k 2 2 ]. Stochastic Optimization Algorithms Beyond SG 36 of 42

Reality The conditions in this theorem cannot be verified in practice. They require knowing f(w k ). They require knowing E ξk [M k g k ] and E ξk [ M k g k 2 2 ]... but M k and g k are not independent! That said, Corollary SC ensures that they hold with g k = f(w k ); recall µ g k 2 2 g T k M kg k and M k g k 2 2 ν g k 2 2. Stochastic Optimization Algorithms Beyond SG 37 of 42

Reality The conditions in this theorem cannot be verified in practice. They require knowing f(w k ). They require knowing E ξk [M k g k ] and E ξk [ M k g k 2 2 ]... but M k and g k are not independent! That said, Corollary SC ensures that they hold with g k = f(w k ); recall µ g k 2 2 gk T M kg k and M k g k 2 2 ν g k 2 2. Stabilized variant (SC-s): Loop over (stochastic) gradient computation until ρ ĝ k+1 2 2 ĝ T k+1 M k+1g k+1 and M k+1 g k+1 2 2 σ + τ ĝ k+1 2 2. Recompute g k+1, ĝ k+1, and M k+1 until these hold. Stochastic Optimization Algorithms Beyond SG 37 of 42

Numerical Experiments: a1a logistic regression, data a1a, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 38 of 42

Numerical Experiments: rcv1 SC-L and SC-L-s: limited memory variants of SC and SC-s, respectively: logistic regression, data rcv1, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 39 of 42

Numerical Experiments: mnist deep neural network, data mnist, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 40 of 42

Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 41 of 42

Contributions Proposed a quasi-newton method for stochastic (nonconvex) optimization exploited self-correcting properties of BFGS-type updates properties of Hessians offer useful bounds for inverse Hessians motivating convergence theory for convex and nonconvex objectives dynamic noise reduction strategy limited memory variant Observed stable behavior and overall good performance F. E. Curtis. A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR. Stochastic Optimization Algorithms Beyond SG 42 of 42