Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods for Large-Scale Machine Learning http://arxiv.org/abs/1606.04838 Google NYC 3 November 2016 1 Mike s older brother Stochastic Optimization Algorithms Beyond SG 1 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 2 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 3 of 42
Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Stochastic Optimization Algorithms Beyond SG 4 of 42
Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by min w R d fn(w), where fn(w) = 1 n n l(h(x i ; w), y i ). i=1 Stochastic Optimization Algorithms Beyond SG 4 of 42
Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by For this talk, let s assume min w R d fn(w), where fn(w) = 1 n n l(h(x i ; w), y i ). f is continuously differentiable, bounded below, and potentially nonconvex; f is L-Lipschitz continuous, i.e., f(w) f(w) 2 L w w 2. Focus on optimization algorithms, not data fitting issues, regularization, etc. i=1 Stochastic Optimization Algorithms Beyond SG 4 of 42
Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) w k Stochastic Optimization Algorithms Beyond SG 5 of 42
Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w)? f(w)? w k Stochastic Optimization Algorithms Beyond SG 5 of 42
Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) + 1 2 L w w k 2 2 w k Stochastic Optimization Algorithms Beyond SG 5 of 42
Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) + 1 2 L w w k 2 2 f(w k ) + f(w k ) T (w w k ) + 1 2 c w w k 2 2 w k Stochastic Optimization Algorithms Beyond SG 5 of 42
GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) + 1 2 L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 Stochastic Optimization Algorithms Beyond SG 6 of 42
GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 If, in addition, f is c-strongly convex, then for all k 1: f(w k ) f (1 αc) k (f(x 0 ) f ). Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) + 1 2 L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 f(w k ) αc(f(w k ) f ). = f(w k+1 ) f (1 αc)(f(w k ) f ). Stochastic Optimization Algorithms Beyond SG 6 of 42
GD illustration Figure: GD with fixed stepsize Stochastic Optimization Algorithms Beyond SG 7 of 42
Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Stochastic Optimization Algorithms Beyond SG 8 of 42
Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Not a descent method!... but can guarantee eventual descent in expectation (with E k [g k ] = f(w k )): f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) + 1 2 L w k+1 w k 2 2 = f(w k ) α k f(w k ) T g k + 1 2 α2 k L g k 2 2 = E k [f(w k+1 )] f(w k ) α k f(w k ) 2 2 + 1 2 α2 k LE k[ g k 2 2]. Markov process: w k+1 depends only on w k and random choice at iteration k. Stochastic Optimization Algorithms Beyond SG 8 of 42
SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then α k = 1 L α k = O ( ) 1 k = E 1 k f(w j ) 2 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 (*Assumed unbiased gradient estimates; see paper for more generality.) Stochastic Optimization Algorithms Beyond SG 9 of 42
SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then α k = 1 L α k = O ( ) 1 k If, in addition, f is c-strongly convex, then = E 1 k f(w j ) 2 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 α k = 1 L α k = O ( ) 1 k = E[f(w k ) f ] (M/c) 2 ( ) (L/c)(M/c) = E[f(w k ) f ] = O. k (*Assumed unbiased gradient estimates; see paper for more generality.) Stochastic Optimization Algorithms Beyond SG 9 of 42
SG illustration Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right) Stochastic Optimization Algorithms Beyond SG 10 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 11 of 42
Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Stochastic Optimization Algorithms Beyond SG 12 of 42
Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Motivation Intuitive Practical Theoretical Explanation data redundancy SG vs. L-BFGS with batch gradient (below) E[f n(w k ) f n, ] = O(1/k) and E[f(w k ) f ] = O(1/k) 0.6 0.5 Empirical Risk 0.4 0.3 LBFGS 0.2 0.1 SGD 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Accessed Data Points x 10 5 Stochastic Optimization Algorithms Beyond SG 12 of 42
Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Stochastic Optimization Algorithms Beyond SG 13 of 42
Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Considering total (estimation + optimization) error as E = E[f(w n ) f(w )] + E[f( w n ) f(w n )] 1 n + ɛ and a time budget T, one finds: SG: Process as many samples as possible (n T ), leading to E 1 T. GD: With n T / log(1/ɛ), minimizing E yields ɛ 1/T and E 1 T + log(t ). T Stochastic Optimization Algorithms Beyond SG 13 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 14 of 42
End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016) Stochastic Optimization Algorithms Beyond SG 15 of 42
End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016) No, we should want more... SG requires a lot of tuning Sublinear convergence is not satisfactory... linearly convergent method eventually wins... with higher budget, faster computation, parallel?, distributed? Also, any gradient -based method is not scale invariant. Stochastic Optimization Algorithms Beyond SG 15 of 42
What can be improved? stochastic gradient better rate better constant Stochastic Optimization Algorithms Beyond SG 16 of 42
What can be improved? stochastic gradient better rate better constant better rate and better constant Stochastic Optimization Algorithms Beyond SG 16 of 42
Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Stochastic Optimization Algorithms Beyond SG 17 of 42
2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Stochastic Optimization Algorithms Beyond SG 18 of 42
2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Stochastic Optimization Algorithms Beyond SG 19 of 42
Even more... momentum acceleration (dual) coordinate descent trust region / step normalization exploring negative curvature Stochastic Optimization Algorithms Beyond SG 20 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 21 of 42
Scale invariance Neither SG nor GD are invariant to linear transformations. min f(w) = w w R d k+1 w k α k f(w k ) min f(b w) = w w R d k+1 w k α k B f(b w k ) (for given B 0) Stochastic Optimization Algorithms Beyond SG 22 of 42
Scale invariance Neither SG nor GD are invariant to linear transformations. min f(w) = w w R d k+1 w k α k f(w k ) min f(b w) = w w R d k+1 w k α k B f(b w k ) (for given B 0) Scaling latter by B and defining {w k } = {B w k } yields w k+1 w k α k B 2 f(w k ) Algorithm is clearly affected by choice of B Surely, some choices may be better than others Stochastic Optimization Algorithms Beyond SG 22 of 42
Newton scaling Consider the function below and suppose that w k = (0, 3): Stochastic Optimization Algorithms Beyond SG 23 of 42
Newton scaling GD step along f(w k ) ignores curvature of the function: Stochastic Optimization Algorithms Beyond SG 23 of 42
Newton scaling Newton scaling (B = ( 2 f(w k )) 1/2 ): gradient step moves to the minimizer: Stochastic Optimization Algorithms Beyond SG 23 of 42
Newton scaling... corresponds to minimizing a quadratic model of f in the original space: w k+1 w k + α k s k where 2 f(w k )s k = f(w k ) Stochastic Optimization Algorithms Beyond SG 23 of 42
Deterministic case What is known about Newton s method for deterministic optimization? local rescaling based on inverse Hessian information unit steps are good near strong minimizer (no tuning!)... locally quadratically convergent global convergence rate better than gradient method (when regularized) Stochastic Optimization Algorithms Beyond SG 24 of 42
Deterministic case to stochastic case What is known about Newton s method for deterministic optimization? local rescaling based on inverse Hessian information unit steps are good near strong minimizer (no tuning!)... locally quadratically convergent global convergence rate better than gradient method (when regularized) However, it is way too expensive. But all is not lost: scaling can be practical. Wide variety of scaling techniques improve performance.... could hope to remove condition number (L/c) from convergence rate! Added costs can be minimial when coupled with noise reduction. Stochastic Optimization Algorithms Beyond SG 24 of 42
Quasi-Newton Only approximate second-order information with gradient displacements: w k+1 w k w Secant equation H k v k = s k to match gradient of f at w k, where s k := w k+1 w k and v k := f(w k+1 ) f(w k ) Stochastic Optimization Algorithms Beyond SG 25 of 42
Balance between extremes For deterministic, smooth optimization, a nice balance achieved by quasi-newton: w k+1 w k α k M k g k, where α k > 0 is a stepsize; g k f(w k ); {M k } is updated dynamically. Background on quasi-newton: local rescaling of step (overcome ill-conditioning) only first-order derivatives required no linear system solves required global convergence guarantees (say, with line search) superlinear local convergence rate How can the idea be carried over to a stochastic setting? Stochastic Optimization Algorithms Beyond SG 26 of 42
Previous work: BFGS-type methods Much focus on the secant equation (H k+1 Hessian approximation) H k+1 s k = y k where { sk := w k+1 w k y k := f(w k+1 ) f(w k ) and an appropriate replacement for the gradient displacement: y k f(w k+1, ξ k ) f(w k, ξ k ) }{{} use same seed olbfgs, Schraudolph et al. (2007) SGD-QN, Bordes et al. (2009) RES, Mokhtari & Ribeiro (2014) or y k i S H k 2 f(w k+1, ξ k+1,i ) s k }{{} use action of step on subsampled Hessian SQN, Byrd et al. (2015) Is this the right focus? Is there a better way (especially for nonconvex f)? Stochastic Optimization Algorithms Beyond SG 27 of 42
Proposal Propose a quasi-newton method for stochastic (nonconvex) optimization exploit self-correcting properties of BFGS-type updates Powell (1976) Ritter (1979, 1981) Werner (1978) Byrd, Nocedal (1989) properties of Hessians offer useful bounds for inverse Hessians motivating convergence theory for convex and nonconvex objectives dynamic noise reduction strategy limited memory variant Observed stable behavior and overall good performance Stochastic Optimization Algorithms Beyond SG 28 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 29 of 42
BFGS-type updates Inverse Hessian and Hessian approximation updating formulas (s T k v k > 0): ( M k+1 I v ks T k s T k v k ( H k+1 I s ks T k H k s T k H ks k ) T M k ( I v ks T k s T k v k ) T ( H k ) I s ks T k H k s T k H ks k + s ks T k s T k v k ) + v kv T k s T k v k Satisfy secant-type equations M k+1 v k = s k and H k+1 s k = v k, but these are not relevant for our purposes here. Choosing v k y k := g k+1 g k yields standard BFGS, but in this talk v k β k s k + (1 β k )α k y k for some β k [0, 1]. This scheme is important to preserve self-correcting properties. Stochastic Optimization Algorithms Beyond SG 30 of 42
Geometric properties of Hessian update Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Stochastic Optimization Algorithms Beyond SG 31 of 42
Geometric properties of Hessian update Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Returning to the Hessian update: ( H k+1 I s ks T k H ) T ( k s T k H H k I s ks T k H ) k ks k s T k H + v kvk T ks k s T k }{{} v k }{{} rank n 1 rank 1 Curvature projected out along span(s k ) ) ( Curvature corrected by v kvk T s T = k v k ( v k v T k v k 2 2 ) v k 2 2 v k T (inverse Rayleigh). M k+1v k Stochastic Optimization Algorithms Beyond SG 31 of 42
Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Stochastic Optimization Algorithms Beyond SG 32 of 42
Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Theorem SC (Byrd, Nocedal (1989)) Suppose that, for all k, there exists {η, θ} R ++ such that η st k v k s k 2 2 and v k 2 2 s T k v k θ. (KEY) Then, for any p (0, 1), there exist constants {ι, κ, λ} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: ι s T k H ks k s k 2 H k s k 2 and κ H ks k 2 s k 2 λ. Proof technique. Building on work of Powell (1976), etc., involves bounding growth of γ(h k ) = tr(h k ) ln(det(h k )). Stochastic Optimization Algorithms Beyond SG 32 of 42
Self-correcting properties of inverse Hessian update Rather than focus on superlinear convergence results, we care about the following. Corollary SC Suppose the conditions of Theorem SC hold. Then, for any p (0, 1), there exist constants {µ, ν} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: µ g k 2 2 gt k M kg k and M k g k 2 2 ν g k 2 2 Proof sketch. Follows simply after algebraic manipulations from the result of Theorem SC, using the facts that s k = α k M k g k and M k = H 1 k for all k. Stochastic Optimization Algorithms Beyond SG 33 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 34 of 42
Algorithm SC : Self-Correcting BFGS Algorithm 1: Choose w 1 R d. 2: Set g 1 f(w 1 ). 3: Choose a symmetric positive definite M 1 R d d. 4: Choose a positive scalar sequence {α k }. 5: for k = 1, 2,... do 6: Set s k α k M k g k. 7: Set w k+1 w k + s k. 8: Set g k+1 f(w k+1 ). 9: Set y k g k+1 g k. 10: Set β k min{β [0, 1] : v(β) := βs k + (1 β)α k y k satisfies (KEY)}. 11: Set v k v(β k ). 12: Set ( ) T ( ) M k 13: end for M k+1 I v ks T k s T k v k I v ks T k s T k v k + s ks T k s T k v. k Stochastic Optimization Algorithms Beyond SG 35 of 42
Global convergence theorem Theorem (Bottou, Curtis, Nocedal (2016)) Suppose that, for all k, there exists a scalar constant ρ > 0 such that f(w k ) T E ξk [M k g k ] ρ f(w k ) 2 2, and there exist scalars σ > 0 and τ > 0 such that E ξk [ M k g k 2 2] σ + τ f(w k ) 2 2. Then, {E[f(w k )]} converges to a finite limit and lim inf k E[ f(w k)] = 0. Proof technique. Follows from the critical inequality E ξk [f(w k+1 )] f(w k ) α k f(w k ) T E ξk [M k g k ] + α 2 k LE ξ k [ M k g k 2 2 ]. Stochastic Optimization Algorithms Beyond SG 36 of 42
Reality The conditions in this theorem cannot be verified in practice. They require knowing f(w k ). They require knowing E ξk [M k g k ] and E ξk [ M k g k 2 2 ]... but M k and g k are not independent! That said, Corollary SC ensures that they hold with g k = f(w k ); recall µ g k 2 2 g T k M kg k and M k g k 2 2 ν g k 2 2. Stochastic Optimization Algorithms Beyond SG 37 of 42
Reality The conditions in this theorem cannot be verified in practice. They require knowing f(w k ). They require knowing E ξk [M k g k ] and E ξk [ M k g k 2 2 ]... but M k and g k are not independent! That said, Corollary SC ensures that they hold with g k = f(w k ); recall µ g k 2 2 gk T M kg k and M k g k 2 2 ν g k 2 2. Stabilized variant (SC-s): Loop over (stochastic) gradient computation until ρ ĝ k+1 2 2 ĝ T k+1 M k+1g k+1 and M k+1 g k+1 2 2 σ + τ ĝ k+1 2 2. Recompute g k+1, ĝ k+1, and M k+1 until these hold. Stochastic Optimization Algorithms Beyond SG 37 of 42
Numerical Experiments: a1a logistic regression, data a1a, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 38 of 42
Numerical Experiments: rcv1 SC-L and SC-L-s: limited memory variants of SC and SC-s, respectively: logistic regression, data rcv1, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 39 of 42
Numerical Experiments: mnist deep neural network, data mnist, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 40 of 42
Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 41 of 42
Contributions Proposed a quasi-newton method for stochastic (nonconvex) optimization exploited self-correcting properties of BFGS-type updates properties of Hessians offer useful bounds for inverse Hessians motivating convergence theory for convex and nonconvex objectives dynamic noise reduction strategy limited memory variant Observed stable behavior and overall good performance F. E. Curtis. A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR. Stochastic Optimization Algorithms Beyond SG 42 of 42