Second-Order Methods for Stochastic Optimization

Size: px
Start display at page:

Download "Second-Order Methods for Stochastic Optimization"

Transcription

1 Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods for Large-Scale Machine Learning Columbia University, Department of IEOR 16 October 2017 Second-Order Methods for Stochastic Optimization 1 of 49

2 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 2 of 49

3 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 3 of 49

4 Problem statement Consider the problem to find w R d to minimize f subject to being in W R d : min w W f(w). (P) Interested in algorithms for solving (P) when f might not be convex. Nonconvex optimization is experiencing a heyday! nonlinear least squares training deep neural networks subspace clustering... Second-Order Methods for Stochastic Optimization 4 of 49

5 History Nonlinear optimization has had parallel developments convexity smoothness Rockafellar Fenchel Nemirovski Nesterov Powell Fletcher Goldfarb Nocedal subgradient inequality sufficient decrease convergence, complexity guarantees convergence, fast local convergence These worlds are (finally) colliding! Where should emphasis be placed? Second-Order Methods for Stochastic Optimization 5 of 49

6 First- versus second-order First-order methods follow a steepest descent methodology: w k+1 w k α k f(w k ) Second-order methods follow Newton s methodology: w k+1 w k α k [ 2 f(w k )] 1 f(w k ), which one should view as minimizing a quadratic model of f at w k : f(w k ) + f(w k ) T (w w k ) (w w k) T 2 f(w k )(w w k ) Second-Order Methods for Stochastic Optimization 6 of 49

7 First- versus quasi-second-order First-order methods follow a steepest descent methodology: w k+1 w k α k f(w k ) Second-order methods follow Newton s methodology: w k+1 w k α k M k f(w k ), which one should view as minimizing a quadratic model of f at w k : f(w k ) + f(w k ) T (w w k ) (w w k) T H k (w w k ) Might also replace the Hessian with an approximation H k with inverse M k Second-Order Methods for Stochastic Optimization 6 of 49

8 Why second-order??? Traditional motivation: Fast local convergence guarantees Recent motivation: Better complexity properties Second-Order Methods for Stochastic Optimization 7 of 49

9 Why second-order??? Traditional motivation: Fast local convergence guarantees Recent motivation: Better complexity properties However, I believe these convey the wrong message, especially when problems... involve stochasticity / randomness... involve nonsmoothness I believe there are other more appropriate motivations Second-Order Methods for Stochastic Optimization 7 of 49

10 Early 2010 s Complexity guarantees for nonconvex optimization algorithms Iterations or function/derivative evaluations to achieve f(w k ) 2 ɛ Steepest descent (first-order): O(ɛ 2 ) Line search (second-order): O(ɛ 2 ) Trust region (second-order): O(ɛ 2 ) Cubic regularization (second-order): O(ɛ 3/2 ) Cubic regularization has longer history, but picks up steam in early 2010 s: Griewank (1981) Nesterov & Polyak (2006) Weiser, Deuflhard, Erdmann (2007) Cartis, Gould, Toint (2011), the ARC method Second-Order Methods for Stochastic Optimization 8 of 49

11 Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Second-Order Methods for Stochastic Optimization 9 of 49

12 Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Little evidence that cubic regularization methods offer improved performance: Trust region (TR) methods remain the state-of-the-art TR-like methods can achieve the same complexity guarantees Second-Order Methods for Stochastic Optimization 9 of 49

13 Trust region methods with optimal complexity Second-Order Methods for Stochastic Optimization 10 of 49

14 So, why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) Second-Order Methods for Stochastic Optimization 11 of 49

15 So, why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) For fast local convergence guarantees? Eh, probably not... Hard to achieve, especially in large-scale, nonsmooth, or stochastic settings Second-Order Methods for Stochastic Optimization 11 of 49

16 So, why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) For fast local convergence guarantees? Eh, probably not... Hard to achieve, especially in large-scale, nonsmooth, or stochastic settings Then why? Adaptive, natural scaling (gradient descent 1/L while Newton 1) Mitigate effects of ill-conditioning Easier to tune parameters(?) Better at avoiding saddle points(?) Better trade-off in parallel and distributed computing settings (Also, opportunities for NEW algorithms! Not analyzing the same old... ) Second-Order Methods for Stochastic Optimization 11 of 49

17 Message of this talk People want to solve more complicated, nonconvex problems... involving stochasticity / randomness... involving nonsmoothness We might waste this spotlight on nonconvex optimization if we do not... Make clear the gap between theory and practice (and close it!) Learn from advances that have already been made... and adapt them appropriately for modern problems Second-Order Methods for Stochastic Optimization 12 of 49

18 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 13 of 49

19 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(w; x) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Second-Order Methods for Stochastic Optimization 14 of 49

20 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(w; x) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by min w R d fn(w), where fn(w) = 1 n n l(h(w; x i ), y i ). i=1 Second-Order Methods for Stochastic Optimization 14 of 49

21 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(w; x) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by For this talk, let s assume min w R d fn(w), where fn(w) = 1 n n l(h(w; x i ), y i ). f is continuously differentiable, bounded below, and potentially nonconvex; f is L-Lipschitz continuous, i.e., f(w) f(w) 2 L w w 2. i=1 Second-Order Methods for Stochastic Optimization 14 of 49

22 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) w k Second-Order Methods for Stochastic Optimization 15 of 49

23 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w)? f(w)? w k Second-Order Methods for Stochastic Optimization 15 of 49

24 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) L w w k 2 2 w k Second-Order Methods for Stochastic Optimization 15 of 49

25 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) L w w k 2 2 f(w k ) + f(w k ) T (w w k ) c w w k 2 2 w k Second-Order Methods for Stochastic Optimization 15 of 49

26 GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 Second-Order Methods for Stochastic Optimization 16 of 49

27 GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 If, in addition, f is c-strongly convex, then for all k 1: f(w k ) f (1 αc) k (f(x 0 ) f ). Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 f(w k ) αc(f(w k ) f ). = f(w k+1 ) f (1 αc)(f(w k ) f ). Second-Order Methods for Stochastic Optimization 16 of 49

28 GD illustration Figure: GD with fixed stepsize Second-Order Methods for Stochastic Optimization 17 of 49

29 Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Second-Order Methods for Stochastic Optimization 18 of 49

30 Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Not a descent method!... but can guarantee eventual descent in expectation (with E k [g k ] = f(w k )): f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 = f(w k ) α k f(w k ) T g k α2 k L g k 2 2 = E k [f(w k+1 )] f(w k ) α k f(w k ) α2 k LE k[ g k 2 2]. Markov process: w k+1 depends only on w k and random choice at iteration k. Second-Order Methods for Stochastic Optimization 18 of 49

31 SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then: α k = 1 L α k = O ( ) 1 k = E 1 k f(w j ) 2 k 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 (*Assumed unbiased gradient estimates; see paper for more generality.) Second-Order Methods for Stochastic Optimization 19 of 49

32 SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then: α k = 1 L α k = O ( ) 1 k If, in addition, f is c-strongly convex, then: = E 1 k f(w j ) 2 k 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 α k = 1 L α k = O ( ) 1 k = E[f(w k ) f k ] (M/c) 2 ( ) (L/c)(M/c) = E[f(w k ) f ] = O. k (*Assumed unbiased gradient estimates; see paper for more generality.) Second-Order Methods for Stochastic Optimization 19 of 49

33 SG illustration Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right) Second-Order Methods for Stochastic Optimization 20 of 49

34 Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Second-Order Methods for Stochastic Optimization 21 of 49

35 Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Motivation Intuitive Empirical Theoretical Explanation data redundancy SG works well in practice vs. GD E[f n(w k ) f n, ] = O(1/k) and E[f(w k ) f ] = O(1/k) Second-Order Methods for Stochastic Optimization 21 of 49

36 Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Second-Order Methods for Stochastic Optimization 22 of 49

37 Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Considering total (estimation + optimization) error as E = E[f(w n ) f(w )] + E[f( w n ) f(w n )] 1 n + ɛ and a time budget T, one finds: SG: Process as many samples as possible (n T ), leading to E 1 T. GD: With n T / log(1/ɛ), minimizing E yields ɛ 1/T and E 1 T + log(t ). T Second-Order Methods for Stochastic Optimization 22 of 49

38 End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016)... (many more) Second-Order Methods for Stochastic Optimization 23 of 49

39 End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016)... (many more) No, we should want more... SG requires a lot of tuning Sublinear convergence is not satisfactory... linearly convergent method eventually wins... with higher budget, faster computation, parallel?, distributed? Also, any gradient -based method is not scale invariant. Second-Order Methods for Stochastic Optimization 23 of 49

40 Stochastic Optimization: No Parameter Tuning Limited memory stochastic gradient method (extends Barzilai-Borwein): x k+1 x k α k g k where α k > 0 chosen adaptively Minimizing logistic loss for binary classification with RCV1 dataset Second-Order Methods for Stochastic Optimization 24 of 49

41 Stochastic Optimization: Avoiding Saddle Points / Stagnation Training a convolutional neural network for classifying digits in mnist: Stochastic-gradient-type method versus one that follows negative curvature: Overcomes slow initial progress by SG-type method... Second-Order Methods for Stochastic Optimization 25 of 49

42 Stochastic Optimization: Avoiding Saddle Points / Stagnation Training a convolutional neural network for classifying digits in mnist: Stochastic-gradient-type method versus one that follows negative curvature:... while still yielding good behavior in terms of testing accuracy Second-Order Methods for Stochastic Optimization 25 of 49

43 What can be improved? stochastic gradient better rate better constant Second-Order Methods for Stochastic Optimization 26 of 49

44 What can be improved? stochastic gradient better rate better constant better rate and better constant Second-Order Methods for Stochastic Optimization 26 of 49

45 Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Second-Order Methods for Stochastic Optimization 27 of 49

46 2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Second-Order Methods for Stochastic Optimization 28 of 49

47 2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Second-Order Methods for Stochastic Optimization 29 of 49

48 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 30 of 49

49 Quasi-Newton Only approximate second-order information with gradient displacements: x k+1 x k x Secant equation H k v k = s k to match gradient of f at w k, where s k := w k+1 w k and v k := f(w k+1 ) f(w k ) Second-Order Methods for Stochastic Optimization 31 of 49

50 Previous work: BFGS-type methods Much focus on the secant equation (H k+1 Hessian approximation) H k+1 s k = y k where { sk := w k+1 w k y k := f(w k+1 ) f(w k ) and an appropriate replacement for the gradient displacement: y k f(w k+1, ξ k ) f(w k, ξ k ) }{{} use same seed olbfgs, Schraudolph et al. (2007) SGD-QN, Bordes et al. (2009) RES, Mokhtari & Ribeiro (2014) or y k i S H k 2 f(w k+1, ξ k+1,i ) s k }{{} use action of step on subsampled Hessian SQN, Byrd et al. (2015) Goldfarb et al. (2016) Is this the right focus? Is there a better way (especially for nonconvex f)? Second-Order Methods for Stochastic Optimization 32 of 49

51 Proposal Propose a quasi-newton method for stochastic (nonconvex) optimization exploit self-correcting properties of BFGS-type updates Powell (1976) Ritter (1979, 1981) Werner (1978) Byrd, Nocedal (1989) Second-Order Methods for Stochastic Optimization 33 of 49

52 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 34 of 49

53 BFGS-type updates Inverse Hessian and Hessian approximation updating formulas (s T k v k > 0): ( M k+1 I v ks T k s T k v k ( H k+1 I s ks T k H k s T k H ks k ) T M k ( I v ks T k s T k v k ) T ( H k ) I s ks T k H k s T k H ks k + s ks T k s T k v k ) + v kv T k s T k v k These satisfy secant-type equations but these are not critical for this talk. M k+1 v k = s k and H k+1 s k = v k, Second-Order Methods for Stochastic Optimization 35 of 49

54 Geometric properties of Hessian update: Burke, Lewis, Overton (2007) Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Second-Order Methods for Stochastic Optimization 36 of 49

55 Geometric properties of Hessian update: Burke, Lewis, Overton (2007) Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Returning to the Hessian update: ( H k+1 I s ks T k H ) T ( k s T k H H k I s ks T k H ) k ks k s T k H + v kvk T ks k s T k }{{} v k }{{} rank n 1 rank 1 Curvature projected out along span(s k ) ) ( Curvature corrected by v kvk T s T = k v k ( v k v T k v k 2 2 ) v k 2 2 v k T (inverse Rayleigh). M k+1v k Second-Order Methods for Stochastic Optimization 36 of 49

56 Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Second-Order Methods for Stochastic Optimization 37 of 49

57 Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Theorem 3 (Byrd, Nocedal (1989)) Suppose that, for all k, there exists {η, θ} R ++ such that η st k v k s k 2 2 and v k 2 2 s T k v k θ. ( ) Then, for any p (0, 1), there exist constants {ι, κ, λ} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: ι s T k H ks k s k 2 H k s k 2 and κ H ks k 2 s k 2 λ. Proof technique. Building on work of Powell (1976), involves bounding growth of γ(h k ) = tr(h k ) ln(det(h k )). Second-Order Methods for Stochastic Optimization 37 of 49

58 Self-correcting properties of inverse Hessian update Rather than focus on superlinear convergence results, we care about the following. Corollary 4 Suppose the conditions of Theorem 3 hold. Then, for any p (0, 1), there exist constants {µ, ν} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: µ g k 2 2 gt k M kg k and M k g k 2 2 ν g k 2 2 Here g k is the vector such that the iterate displacement is w k+1 w k = s k = M k g k Proof sketch. Follows simply after algebraic manipulations from the result of Theorem 3, using the facts that s k = M k g k and M k = H 1 k for all k. Second-Order Methods for Stochastic Optimization 38 of 49

59 Summary Our main idea is to use a carefully selected type of damping: Choosing v k y k := g k+1 g k yields standard BFGS, but we consider v k β k Hs k + (1 β k )ỹ k for some β k [0, 1] and ỹ k R n. This scheme preserves the self-correcting properties of BFGS. Second-Order Methods for Stochastic Optimization 39 of 49

60 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 40 of 49

61 Algorithm SC : Self-Correcting BFGS Algorithm 1: Choose w 1 R d. 2: Set g 1 f(w 1 ). 3: Choose a symmetric positive definite M 1 R d d. 4: Choose a positive scalar sequence {α k }. 5: for k = 1, 2,... do 6: Set s k α k M k g k. 7: Set w k+1 w k + s k. 8: Set g k+1 f(w k+1 ). 9: Set y k g k+1 g k. 10: Set β k min{β [0, 1] : v(β) := βs k + (1 β)α k y k satisfies ( )}. 11: Set v k v(β k ). 12: Set ( ) T ( ) M k 13: end for M k+1 I v ks T k s T k v k I v ks T k s T k v k + s ks T k s T k v. k Second-Order Methods for Stochastic Optimization 41 of 49

62 Global convergence theorem Theorem (Bottou, Curtis, Nocedal (2016)) Suppose that, for all k, there exists a scalar constant ρ > 0 such that f(w k ) T E ξk [M k g k ] ρ f(w k ) 2 2, and there exist scalars σ > 0 and τ > 0 such that E ξk [ M k g k 2 2] σ + τ f(w k ) 2 2. Then, {E[f(w k )]} converges to a finite limit and lim inf k E[ f(w k)] = 0. Proof technique. Follows from the critical inequality E ξk [f(w k+1 )] f(w k ) α k f(w k ) T E ξk [M k g k ] + α 2 k LE ξ k [ M k g k 2 2 ]. Second-Order Methods for Stochastic Optimization 42 of 49

63 Numerical Experiments: a1a logistic regression, data a1a, diminishing stepsizes Second-Order Methods for Stochastic Optimization 43 of 49

64 Numerical Experiments: rcv1 SC-L and SC-L-s: limited memory variants of SC and SC-s, respectively: logistic regression, data rcv1, diminishing stepsizes Second-Order Methods for Stochastic Optimization 44 of 49

65 Numerical Experiments: mnist deep neural network, data mnist, diminishing stepsizes Second-Order Methods for Stochastic Optimization 45 of 49

66 Outline Perspectives on Nonconvex Optimization GD, SG, and Beyond Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Second-Order Methods for Stochastic Optimization 46 of 49

67 Why second-order? For better complexity properties? Eh, not really... Many are no better than first-order methods in terms of complexity... and ones with better complexity aren t necessarily best in practice (yet) For fast local convergence guarantees? Eh, probably not... Hard to achieve, especially in large-scale, nonsmooth, or stochastic settings Then why? Adaptive, natural scaling (gradient descent 1/L while Newton 1) Mitigate effects of ill-conditioning Easier to tune parameters(?) Better at avoiding saddle points(?) Better trade-off in parallel and distributed computing settings (Also, opportunities for NEW algorithms! Not analyzing the same old... ) Second-Order Methods for Stochastic Optimization 47 of 49

68 Message of this talk People want to solve more complicated, nonconvex problems... involving stochasticity / randomness... involving nonsmoothness We might waste this spotlight on nonconvex optimization if we do not... Make clear the gap between theory and practice (and close it!) Learn from advances that have already been made... and adapt them appropriately for modern problems Second-Order Methods for Stochastic Optimization 48 of 49

69 References For references, please see Please also visit the Lehigh website! Second-Order Methods for Stochastic Optimization 49 of 49

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Stochastic Gradient Methods for Large-Scale Machine Learning

Stochastic Gradient Methods for Large-Scale Machine Learning Stochastic Gradient Methods for Large-Scale Machine Learning Leon Bottou Facebook A Research Frank E. Curtis Lehigh University Jorge Nocedal Northwestern University 1 Our Goal 1. This is a tutorial about

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Second order machine learning

Second order machine learning Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem

More information

Second order machine learning

Second order machine learning Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal Zero-Order Methods for the Optimization of Noisy Functions Jorge Nocedal Northwestern University Simons Institute, October 2017 1 Collaborators Albert Berahas Northwestern University Richard Byrd University

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG

More information

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods

More information

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS) Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning by Jennifer B. Erway, Joshua Griffin, Riadh Omheni, and Roummel Marcia Technical report

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Exact and Inexact Subsampled Newton Methods for Optimization

Exact and Inexact Subsampled Newton Methods for Optimization Exact and Inexact Subsampled Newton Methods for Optimization Raghu Bollapragada Richard Byrd Jorge Nocedal September 27, 2016 Abstract The paper studies the solution of stochastic optimization problems

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with James V. Burke, University of Washington Daniel

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity Mohammadreza Samadi, Lehigh University joint work with Frank E. Curtis (stand-in presenter), Lehigh University

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

A Robust Multi-Batch L-BFGS Method for Machine Learning

A Robust Multi-Batch L-BFGS Method for Machine Learning A Robust Multi-Batch L-BFGS Method for Machine Learning Albert S. Berahas albertberahas@u.northwestern.edu Department of Engineering Sciences and Applied Mathematics Northwestern University Evanston, IL

More information

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent R-Linear Convergence of Limited Memory Steepest Descent Frank E. Curtis, Lehigh University joint work with Wei Guo, Lehigh University OP17 Vancouver, British Columbia, Canada 24 May 2017 R-Linear Convergence

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived

More information

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review Department of Statistical Sciences and Operations Research Virginia Commonwealth University Oct 16, 2013 (Lecture 14) Nonlinear Optimization

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

2. Quasi-Newton methods

2. Quasi-Newton methods L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization

More information

Optimization and Root Finding. Kurt Hornik

Optimization and Root Finding. Kurt Hornik Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

A Stochastic Quasi-Newton Method for Large-Scale Optimization

A Stochastic Quasi-Newton Method for Large-Scale Optimization A Stochastic Quasi-Newton Method for Large-Scale Optimization R. H. Byrd S.L. Hansen Jorge Nocedal Y. Singer February 17, 2015 Abstract The question of how to incorporate curvature information in stochastic

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information