Second-Order Methods for Stochastic Optimization

Size: px

Start display at page:

Download "Second-Order Methods for Stochastic Optimization"

Vernon Dixon
5 years ago
Views:

Second-Order Methods for Stochastic Optimization Frank E.

Bottou, Facebook AI Research Jorge Nocedal, Northwestern

1 Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods for Large-Scale Machine Learning Columbia University, Department of IEOR 16 October 2017 Second-Order Methods for Stochastic Optimization 1 of 49

2 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 2 of 49

3 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 3 of 49

4 Problem statement Consider the problem to find w R d to minimize f subject to being in W R d : min w W f(w). (P) Interested in algorithms for solving (P) when f might not be convex. Nonconvex optimization is experiencing a heyday! nonlinear least squares training deep neural networks subspace clustering... Second-Order Methods for Stochastic Optimization 4 of 49

5 History Nonlinear optimization has had parallel developments convexity smoothness Rockafellar Fenchel Nemirovski Nesterov Powell Fletcher Goldfarb Nocedal subgradient inequality sufficient decrease convergence, complexity guarantees convergence, fast local convergence These worlds are (finally) colliding! Where should emphasis be placed? Second-Order Methods for Stochastic Optimization 5 of 49

6 First- versus second-order First-order methods follow a steepest descent methodology: w k+1 w k α k f(w k ) Second-order methods follow Newton s methodology: w k+1 w k α k [ 2 f(w k )] 1 f(w k ), which one should view as minimizing a quadratic model of f at w k : f(w k ) + f(w k ) T (w w k ) (w w k) T 2 f(w k )(w w k ) Second-Order Methods for Stochastic Optimization 6 of 49

7 First- versus quasi-second-order First-order methods follow a steepest descent methodology: w k+1 w k α k f(w k ) Second-order methods follow Newton s methodology: w k+1 w k α k M k f(w k ), which one should view as minimizing a quadratic model of f at w k : f(w k ) + f(w k ) T (w w k ) (w w k) T H k (w w k ) Might also replace the Hessian with an approximation H k with inverse M k Second-Order Methods for Stochastic Optimization 6 of 49

8 Why second-order??? Traditional motivation: Fast local convergence guarantees Recent motivation: Better complexity properties Second-Order Methods for Stochastic Optimization 7 of 49

9 Why second-order??? Traditional motivation: Fast local convergence guarantees Recent motivation: Better complexity properties However, I believe these convey the wrong message, especially when problems... involve stochasticity / randomness... involve nonsmoothness I believe there are other more appropriate motivations Second-Order Methods for Stochastic Optimization 7 of 49

10 Early 2010 s Complexity guarantees for nonconvex optimization algorithms Iterations or function/derivative evaluations to achieve f(w k ) 2 ɛ Steepest descent (first-order): O(ɛ 2 ) Line search (second-order): O(ɛ 2 ) Trust region (second-order): O(ɛ 2 ) Cubic regularization (second-order): O(ɛ 3/2 ) Cubic regularization has longer history, but picks up steam in early 2010 s: Griewank (1981) Nesterov & Polyak (2006) Weiser, Deuflhard, Erdmann (2007) Cartis, Gould, Toint (2011), the ARC method Second-Order Methods for Stochastic Optimization 8 of 49

11 Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Second-Order Methods for Stochastic Optimization 9 of 49

12 Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Little evidence that cubic regularization methods offer improved performance: Trust region (TR) methods remain the state-of-the-art TR-like methods can achieve the same complexity guarantees Second-Order Methods for Stochastic Optimization 9 of 49

13 Trust region methods with optimal complexity Second-Order Methods for Stochastic Optimization 10 of 49

14 So, why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) Second-Order Methods for Stochastic Optimization 11 of 49

15 So, why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) For fast local convergence guarantees? Eh, probably not... Hard to achieve, especially in large-scale, nonsmooth, or stochastic settings Second-Order Methods for Stochastic Optimization 11 of 49

16 So, why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) For fast local convergence guarantees? Eh, probably not... Hard to achieve, especially in large-scale, nonsmooth, or stochastic settings Then why? Adaptive, natural scaling (gradient descent 1/L while Newton 1) Mitigate effects of ill-conditioning Easier to tune parameters(?) Better at avoiding saddle points(?) Better trade-off in parallel and distributed computing settings (Also, opportunities for NEW algorithms! Not analyzing the same old... ) Second-Order Methods for Stochastic Optimization 11 of 49

17 Message of this talk People want to solve more complicated, nonconvex problems... involving stochasticity / randomness... involving nonsmoothness We might waste this spotlight on nonconvex optimization if we do not... Make clear the gap between theory and practice (and close it!) Learn from advances that have already been made... and adapt them appropriately for modern problems Second-Order Methods for Stochastic Optimization 12 of 49

18 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 13 of 49

19 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(w; x) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Second-Order Methods for Stochastic Optimization 14 of 49

20 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(w; x) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by min w R d fn(w), where fn(w) = 1 n n l(h(w; x i ), y i ). i=1 Second-Order Methods for Stochastic Optimization 14 of 49

21 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(w; x) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by For this talk, let s assume min w R d fn(w), where fn(w) = 1 n n l(h(w; x i ), y i ). f is continuously differentiable, bounded below, and potentially nonconvex; f is L-Lipschitz continuous, i.e., f(w) f(w) 2 L w w 2. i=1 Second-Order Methods for Stochastic Optimization 14 of 49

22 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) w k Second-Order Methods for Stochastic Optimization 15 of 49

23 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w)? f(w)? w k Second-Order Methods for Stochastic Optimization 15 of 49

24 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) L w w k 2 2 w k Second-Order Methods for Stochastic Optimization 15 of 49

25 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) L w w k 2 2 f(w k ) + f(w k ) T (w w k ) c w w k 2 2 w k Second-Order Methods for Stochastic Optimization 15 of 49

26 GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 Second-Order Methods for Stochastic Optimization 16 of 49

27 GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 If, in addition, f is c-strongly convex, then for all k 1: f(w k ) f (1 αc) k (f(x 0 ) f ). Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 f(w k ) αc(f(w k ) f ). = f(w k+1 ) f (1 αc)(f(w k ) f ). Second-Order Methods for Stochastic Optimization 16 of 49

28 GD illustration Figure: GD with fixed stepsize Second-Order Methods for Stochastic Optimization 17 of 49

29 Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Second-Order Methods for Stochastic Optimization 18 of 49

30 Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Not a descent method!... but can guarantee eventual descent in expectation (with E k [g k ] = f(w k )): f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 = f(w k ) α k f(w k ) T g k α2 k L g k 2 2 = E k [f(w k+1 )] f(w k ) α k f(w k ) α2 k LE k[ g k 2 2]. Markov process: w k+1 depends only on w k and random choice at iteration k. Second-Order Methods for Stochastic Optimization 18 of 49

31 SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then: α k = 1 L α k = O ( ) 1 k = E 1 k f(w j ) 2 k 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 (*Assumed unbiased gradient estimates; see paper for more generality.) Second-Order Methods for Stochastic Optimization 19 of 49

32 SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then: α k = 1 L α k = O ( ) 1 k If, in addition, f is c-strongly convex, then: = E 1 k f(w j ) 2 k 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 α k = 1 L α k = O ( ) 1 k = E[f(w k ) f k ] (M/c) 2 ( ) (L/c)(M/c) = E[f(w k ) f ] = O. k (*Assumed unbiased gradient estimates; see paper for more generality.) Second-Order Methods for Stochastic Optimization 19 of 49

33 SG illustration Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right) Second-Order Methods for Stochastic Optimization 20 of 49

34 Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Second-Order Methods for Stochastic Optimization 21 of 49

35 Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Motivation Intuitive Empirical Theoretical Explanation data redundancy SG works well in practice vs. GD E[f n(w k ) f n, ] = O(1/k) and E[f(w k ) f ] = O(1/k) Second-Order Methods for Stochastic Optimization 21 of 49

36 Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Second-Order Methods for Stochastic Optimization 22 of 49

37 Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Considering total (estimation + optimization) error as E = E[f(w n ) f(w )] + E[f( w n ) f(w n )] 1 n + ɛ and a time budget T, one finds: SG: Process as many samples as possible (n T ), leading to E 1 T. GD: With n T / log(1/ɛ), minimizing E yields ɛ 1/T and E 1 T + log(t ). T Second-Order Methods for Stochastic Optimization 22 of 49

38 End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016)... (many more) Second-Order Methods for Stochastic Optimization 23 of 49

39 End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016)... (many more) No, we should want more... SG requires a lot of tuning Sublinear convergence is not satisfactory... linearly convergent method eventually wins... with higher budget, faster computation, parallel?, distributed? Also, any gradient -based method is not scale invariant. Second-Order Methods for Stochastic Optimization 23 of 49

40 Stochastic Optimization: No Parameter Tuning Limited memory stochastic gradient method (extends Barzilai-Borwein): x k+1 x k α k g k where α k > 0 chosen adaptively Minimizing logistic loss for binary classification with RCV1 dataset Second-Order Methods for Stochastic Optimization 24 of 49

41 Stochastic Optimization: Avoiding Saddle Points / Stagnation Training a convolutional neural network for classifying digits in mnist: Stochastic-gradient-type method versus one that follows negative curvature: Overcomes slow initial progress by SG-type method... Second-Order Methods for Stochastic Optimization 25 of 49

42 Stochastic Optimization: Avoiding Saddle Points / Stagnation Training a convolutional neural network for classifying digits in mnist: Stochastic-gradient-type method versus one that follows negative curvature:... while still yielding good behavior in terms of testing accuracy Second-Order Methods for Stochastic Optimization 25 of 49

43 What can be improved? stochastic gradient better rate better constant Second-Order Methods for Stochastic Optimization 26 of 49

44 What can be improved? stochastic gradient better rate better constant better rate and better constant Second-Order Methods for Stochastic Optimization 26 of 49

45 Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Second-Order Methods for Stochastic Optimization 27 of 49

46 2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Second-Order Methods for Stochastic Optimization 28 of 49

47 2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Second-Order Methods for Stochastic Optimization 29 of 49

48 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 30 of 49

49 Quasi-Newton Only approximate second-order information with gradient displacements: x k+1 x k x Secant equation H k v k = s k to match gradient of f at w k, where s k := w k+1 w k and v k := f(w k+1 ) f(w k ) Second-Order Methods for Stochastic Optimization 31 of 49

50 Previous work: BFGS-type methods Much focus on the secant equation (H k+1 Hessian approximation) H k+1 s k = y k where { sk := w k+1 w k y k := f(w k+1 ) f(w k ) and an appropriate replacement for the gradient displacement: y k f(w k+1, ξ k ) f(w k, ξ k ) }{{} use same seed olbfgs, Schraudolph et al. (2007) SGD-QN, Bordes et al. (2009) RES, Mokhtari & Ribeiro (2014) or y k i S H k 2 f(w k+1, ξ k+1,i ) s k }{{} use action of step on subsampled Hessian SQN, Byrd et al. (2015) Goldfarb et al. (2016) Is this the right focus? Is there a better way (especially for nonconvex f)? Second-Order Methods for Stochastic Optimization 32 of 49

51 Proposal Propose a quasi-newton method for stochastic (nonconvex) optimization exploit self-correcting properties of BFGS-type updates Powell (1976) Ritter (1979, 1981) Werner (1978) Byrd, Nocedal (1989) Second-Order Methods for Stochastic Optimization 33 of 49

52 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 34 of 49

53 BFGS-type updates Inverse Hessian and Hessian approximation updating formulas (s T k v k > 0): ( M k+1 I v ks T k s T k v k ( H k+1 I s ks T k H k s T k H ks k ) T M k ( I v ks T k s T k v k ) T ( H k ) I s ks T k H k s T k H ks k + s ks T k s T k v k ) + v kv T k s T k v k These satisfy secant-type equations but these are not critical for this talk. M k+1 v k = s k and H k+1 s k = v k, Second-Order Methods for Stochastic Optimization 35 of 49

54 Geometric properties of Hessian update: Burke, Lewis, Overton (2007) Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Second-Order Methods for Stochastic Optimization 36 of 49

55 Geometric properties of Hessian update: Burke, Lewis, Overton (2007) Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Returning to the Hessian update: ( H k+1 I s ks T k H ) T ( k s T k H H k I s ks T k H ) k ks k s T k H + v kvk T ks k s T k }{{} v k }{{} rank n 1 rank 1 Curvature projected out along span(s k ) ) ( Curvature corrected by v kvk T s T = k v k ( v k v T k v k 2 2 ) v k 2 2 v k T (inverse Rayleigh). M k+1v k Second-Order Methods for Stochastic Optimization 36 of 49

56 Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Second-Order Methods for Stochastic Optimization 37 of 49

57 Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Theorem 3 (Byrd, Nocedal (1989)) Suppose that, for all k, there exists {η, θ} R ++ such that η st k v k s k 2 2 and v k 2 2 s T k v k θ. ( ) Then, for any p (0, 1), there exist constants {ι, κ, λ} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: ι s T k H ks k s k 2 H k s k 2 and κ H ks k 2 s k 2 λ. Proof technique. Building on work of Powell (1976), involves bounding growth of γ(h k ) = tr(h k ) ln(det(h k )). Second-Order Methods for Stochastic Optimization 37 of 49

58 Self-correcting properties of inverse Hessian update Rather than focus on superlinear convergence results, we care about the following. Corollary 4 Suppose the conditions of Theorem 3 hold. Then, for any p (0, 1), there exist constants {µ, ν} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: µ g k 2 2 gt k M kg k and M k g k 2 2 ν g k 2 2 Here g k is the vector such that the iterate displacement is w k+1 w k = s k = M k g k Proof sketch. Follows simply after algebraic manipulations from the result of Theorem 3, using the facts that s k = M k g k and M k = H 1 k for all k. Second-Order Methods for Stochastic Optimization 38 of 49

59 Summary Our main idea is to use a carefully selected type of damping: Choosing v k y k := g k+1 g k yields standard BFGS, but we consider v k β k Hs k + (1 β k )ỹ k for some β k [0, 1] and ỹ k R n. This scheme preserves the self-correcting properties of BFGS. Second-Order Methods for Stochastic Optimization 39 of 49

60 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 40 of 49

61 Algorithm SC : Self-Correcting BFGS Algorithm 1: Choose w 1 R d. 2: Set g 1 f(w 1 ). 3: Choose a symmetric positive definite M 1 R d d. 4: Choose a positive scalar sequence {α k }. 5: for k = 1, 2,... do 6: Set s k α k M k g k. 7: Set w k+1 w k + s k. 8: Set g k+1 f(w k+1 ). 9: Set y k g k+1 g k. 10: Set β k min{β [0, 1] : v(β) := βs k + (1 β)α k y k satisfies ( )}. 11: Set v k v(β k ). 12: Set ( ) T ( ) M k 13: end for M k+1 I v ks T k s T k v k I v ks T k s T k v k + s ks T k s T k v. k Second-Order Methods for Stochastic Optimization 41 of 49

62 Global convergence theorem Theorem (Bottou, Curtis, Nocedal (2016)) Suppose that, for all k, there exists a scalar constant ρ > 0 such that f(w k ) T E ξk [M k g k ] ρ f(w k ) 2 2, and there exist scalars σ > 0 and τ > 0 such that E ξk [ M k g k 2 2] σ + τ f(w k ) 2 2. Then, {E[f(w k )]} converges to a finite limit and lim inf k E[ f(w k)] = 0. Proof technique. Follows from the critical inequality E ξk [f(w k+1 )] f(w k ) α k f(w k ) T E ξk [M k g k ] + α 2 k LE ξ k [ M k g k 2 2 ]. Second-Order Methods for Stochastic Optimization 42 of 49

63 Numerical Experiments: a1a logistic regression, data a1a, diminishing stepsizes Second-Order Methods for Stochastic Optimization 43 of 49

64 Numerical Experiments: rcv1 SC-L and SC-L-s: limited memory variants of SC and SC-s, respectively: logistic regression, data rcv1, diminishing stepsizes Second-Order Methods for Stochastic Optimization 44 of 49

65 Numerical Experiments: mnist deep neural network, data mnist, diminishing stepsizes Second-Order Methods for Stochastic Optimization 45 of 49

66 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 46 of 49

67 Why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) For fast local convergence guarantees? Eh, probably not... Hard to achieve, especially in large-scale, nonsmooth, or stochastic settings Then why? Adaptive, natural scaling (gradient descent 1/L while Newton 1) Mitigate effects of ill-conditioning Easier to tune parameters(?) Better at avoiding saddle points(?) Better trade-off in parallel and distributed computing settings (Also, opportunities for NEW algorithms! Not analyzing the same old... ) Second-Order Methods for Stochastic Optimization 47 of 49

68 Message of this talk People want to solve more complicated, nonconvex problems... involving stochasticity / randomness... involving nonsmoothness We might waste this spotlight on nonconvex optimization if we do not... Make clear the gap between theory and practice (and close it!) Learn from advances that have already been made... and adapt them appropriately for modern problems Second-Order Methods for Stochastic Optimization 48 of 49

References For references, please see http://coral.ise.lehigh.

69 References For references, please see Please also visit the Lehigh website! Second-Order Methods for Stochastic Optimization 49 of 49

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods