Stochastic Optimization Algorithms Beyond SG
|
|
- Ilene Powers
- 5 years ago
- Views:
Transcription
1 Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods for Large-Scale Machine Learning Google NYC 3 November Mike s older brother Stochastic Optimization Algorithms Beyond SG 1 of 42
2 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 2 of 42
3 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 3 of 42
4 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Stochastic Optimization Algorithms Beyond SG 4 of 42
5 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by min w R d fn(w), where fn(w) = 1 n n l(h(x i ; w), y i ). i=1 Stochastic Optimization Algorithms Beyond SG 4 of 42
6 Stochastic optimization Over a parameter vector w R d and given l( ; y) h(x; w) (loss w.r.t. true label prediction w.r.t. features ), consider the unconstrained optimization problem min f(w), where f(w) = E (x,y)[l(h(w; x), y)]. w R d Given training set {(x i, y i )} n i=1, approximate problem given by For this talk, let s assume min w R d fn(w), where fn(w) = 1 n n l(h(x i ; w), y i ). f is continuously differentiable, bounded below, and potentially nonconvex; f is L-Lipschitz continuous, i.e., f(w) f(w) 2 L w w 2. Focus on optimization algorithms, not data fitting issues, regularization, etc. i=1 Stochastic Optimization Algorithms Beyond SG 4 of 42
7 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) w k Stochastic Optimization Algorithms Beyond SG 5 of 42
8 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w)? f(w)? w k Stochastic Optimization Algorithms Beyond SG 5 of 42
9 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) L w w k 2 2 w k Stochastic Optimization Algorithms Beyond SG 5 of 42
10 Gradient descent Algorithm GD : Gradient Descent 1: choose an initial point w 0 R n and stepsize α > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α f(w k ) 4: end for f(w k ) f(w k ) + f(w k ) T (w w k ) L w w k 2 2 f(w k ) + f(w k ) T (w w k ) c w w k 2 2 w k Stochastic Optimization Algorithms Beyond SG 5 of 42
11 GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 Stochastic Optimization Algorithms Beyond SG 6 of 42
12 GD theory Theorem GD If α (0, 1/L], then f(w k ) 2 2 <, which implies { f(w k)} 0. k=0 If, in addition, f is c-strongly convex, then for all k 1: f(w k ) f (1 αc) k (f(x 0 ) f ). Proof. f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 f(w k ) 1 2 α f(w k) 2 2 f(w k ) αc(f(w k ) f ). = f(w k+1 ) f (1 αc)(f(w k ) f ). Stochastic Optimization Algorithms Beyond SG 6 of 42
13 GD illustration Figure: GD with fixed stepsize Stochastic Optimization Algorithms Beyond SG 7 of 42
14 Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Stochastic Optimization Algorithms Beyond SG 8 of 42
15 Stochastic gradient descent Approximate gradient only; e.g., random i k and wl(h(w; x ik ), y ik ) f(w). Algorithm SG : Stochastic Gradient 1: choose an initial point w 0 R n and stepsizes {α k } > 0 2: for k {0, 1, 2,... } do 3: set w k+1 w k α k g k, where g k f(w k ) 4: end for Not a descent method!... but can guarantee eventual descent in expectation (with E k [g k ] = f(w k )): f(w k+1 ) f(w k ) + f(w k ) T (w k+1 w k ) L w k+1 w k 2 2 = f(w k ) α k f(w k ) T g k α2 k L g k 2 2 = E k [f(w k+1 )] f(w k ) α k f(w k ) α2 k LE k[ g k 2 2]. Markov process: w k+1 depends only on w k and random choice at iteration k. Stochastic Optimization Algorithms Beyond SG 8 of 42
16 SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then α k = 1 L α k = O ( ) 1 k = E 1 k f(w j ) 2 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 (*Assumed unbiased gradient estimates; see paper for more generality.) Stochastic Optimization Algorithms Beyond SG 9 of 42
17 SG theory Theorem SG If E k [ g k 2 2 ] M + f(w k) 2 2, then α k = 1 L α k = O ( ) 1 k If, in addition, f is c-strongly convex, then = E 1 k f(w j ) 2 2 M k j=1 k = E α j f(w j ) 2 2 <. j=1 α k = 1 L α k = O ( ) 1 k = E[f(w k ) f ] (M/c) 2 ( ) (L/c)(M/c) = E[f(w k ) f ] = O. k (*Assumed unbiased gradient estimates; see paper for more generality.) Stochastic Optimization Algorithms Beyond SG 9 of 42
18 SG illustration Figure: SG with fixed stepsize (left) vs. diminishing stepsizes (right) Stochastic Optimization Algorithms Beyond SG 10 of 42
19 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 11 of 42
20 Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Stochastic Optimization Algorithms Beyond SG 12 of 42
21 Why SG over GD for large-scale machine learning? We have seen: GD: E[f n(w k ) f n, ] = O(ρ k ) linear convergence SG: E[f n(w k ) f n, ] = O(1/k) sublinear convergence So why SG? Motivation Intuitive Practical Theoretical Explanation data redundancy SG vs. L-BFGS with batch gradient (below) E[f n(w k ) f n, ] = O(1/k) and E[f(w k ) f ] = O(1/k) Empirical Risk LBFGS SGD Accessed Data Points x 10 5 Stochastic Optimization Algorithms Beyond SG 12 of 42
22 Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Stochastic Optimization Algorithms Beyond SG 13 of 42
23 Work complexity Time, not data, as limiting factor; Bottou, Bousquet (2008) and Bottou (2010). Cost Cost for Convergence rate per iteration ɛ-optimality GD: E[f n(w k ) f n, ] = O(ρ k ) + O(n) = n log(1/ɛ) SG: E[f n(w k ) f n, ] = O(1/k) + O(1) = 1/ɛ Considering total (estimation + optimization) error as E = E[f(w n ) f(w )] + E[f( w n ) f(w n )] 1 n + ɛ and a time budget T, one finds: SG: Process as many samples as possible (n T ), leading to E 1 T. GD: With n T / log(1/ɛ), minimizing E yields ɛ 1/T and E 1 T + log(t ). T Stochastic Optimization Algorithms Beyond SG 13 of 42
24 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 14 of 42
25 End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016) Stochastic Optimization Algorithms Beyond SG 15 of 42
26 End of the story? SG is great! Let s keep proving how great it is! Stability of SG; Hardt, Recht, Singer (2015) SG avoids steep minima; Keskar, Mudigere, Nocedal, Smelyanskiy (2016) No, we should want more... SG requires a lot of tuning Sublinear convergence is not satisfactory... linearly convergent method eventually wins... with higher budget, faster computation, parallel?, distributed? Also, any gradient -based method is not scale invariant. Stochastic Optimization Algorithms Beyond SG 15 of 42
27 What can be improved? stochastic gradient better rate better constant Stochastic Optimization Algorithms Beyond SG 16 of 42
28 What can be improved? stochastic gradient better rate better constant better rate and better constant Stochastic Optimization Algorithms Beyond SG 16 of 42
29 Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Stochastic Optimization Algorithms Beyond SG 17 of 42
30 2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Stochastic Optimization Algorithms Beyond SG 18 of 42
31 2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Stochastic Optimization Algorithms Beyond SG 19 of 42
32 Even more... momentum acceleration (dual) coordinate descent trust region / step normalization exploring negative curvature Stochastic Optimization Algorithms Beyond SG 20 of 42
33 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 21 of 42
34 Scale invariance Neither SG nor GD are invariant to linear transformations. min f(w) = w w R d k+1 w k α k f(w k ) min f(b w) = w w R d k+1 w k α k B f(b w k ) (for given B 0) Stochastic Optimization Algorithms Beyond SG 22 of 42
35 Scale invariance Neither SG nor GD are invariant to linear transformations. min f(w) = w w R d k+1 w k α k f(w k ) min f(b w) = w w R d k+1 w k α k B f(b w k ) (for given B 0) Scaling latter by B and defining {w k } = {B w k } yields w k+1 w k α k B 2 f(w k ) Algorithm is clearly affected by choice of B Surely, some choices may be better than others Stochastic Optimization Algorithms Beyond SG 22 of 42
36 Newton scaling Consider the function below and suppose that w k = (0, 3): Stochastic Optimization Algorithms Beyond SG 23 of 42
37 Newton scaling GD step along f(w k ) ignores curvature of the function: Stochastic Optimization Algorithms Beyond SG 23 of 42
38 Newton scaling Newton scaling (B = ( 2 f(w k )) 1/2 ): gradient step moves to the minimizer: Stochastic Optimization Algorithms Beyond SG 23 of 42
39 Newton scaling... corresponds to minimizing a quadratic model of f in the original space: w k+1 w k + α k s k where 2 f(w k )s k = f(w k ) Stochastic Optimization Algorithms Beyond SG 23 of 42
40 Deterministic case What is known about Newton s method for deterministic optimization? local rescaling based on inverse Hessian information unit steps are good near strong minimizer (no tuning!)... locally quadratically convergent global convergence rate better than gradient method (when regularized) Stochastic Optimization Algorithms Beyond SG 24 of 42
41 Deterministic case to stochastic case What is known about Newton s method for deterministic optimization? local rescaling based on inverse Hessian information unit steps are good near strong minimizer (no tuning!)... locally quadratically convergent global convergence rate better than gradient method (when regularized) However, it is way too expensive. But all is not lost: scaling can be practical. Wide variety of scaling techniques improve performance.... could hope to remove condition number (L/c) from convergence rate! Added costs can be minimial when coupled with noise reduction. Stochastic Optimization Algorithms Beyond SG 24 of 42
42 Quasi-Newton Only approximate second-order information with gradient displacements: w k+1 w k w Secant equation H k v k = s k to match gradient of f at w k, where s k := w k+1 w k and v k := f(w k+1 ) f(w k ) Stochastic Optimization Algorithms Beyond SG 25 of 42
43 Balance between extremes For deterministic, smooth optimization, a nice balance achieved by quasi-newton: w k+1 w k α k M k g k, where α k > 0 is a stepsize; g k f(w k ); {M k } is updated dynamically. Background on quasi-newton: local rescaling of step (overcome ill-conditioning) only first-order derivatives required no linear system solves required global convergence guarantees (say, with line search) superlinear local convergence rate How can the idea be carried over to a stochastic setting? Stochastic Optimization Algorithms Beyond SG 26 of 42
44 Previous work: BFGS-type methods Much focus on the secant equation (H k+1 Hessian approximation) H k+1 s k = y k where { sk := w k+1 w k y k := f(w k+1 ) f(w k ) and an appropriate replacement for the gradient displacement: y k f(w k+1, ξ k ) f(w k, ξ k ) }{{} use same seed olbfgs, Schraudolph et al. (2007) SGD-QN, Bordes et al. (2009) RES, Mokhtari & Ribeiro (2014) or y k i S H k 2 f(w k+1, ξ k+1,i ) s k }{{} use action of step on subsampled Hessian SQN, Byrd et al. (2015) Is this the right focus? Is there a better way (especially for nonconvex f)? Stochastic Optimization Algorithms Beyond SG 27 of 42
45 Proposal Propose a quasi-newton method for stochastic (nonconvex) optimization exploit self-correcting properties of BFGS-type updates Powell (1976) Ritter (1979, 1981) Werner (1978) Byrd, Nocedal (1989) properties of Hessians offer useful bounds for inverse Hessians motivating convergence theory for convex and nonconvex objectives dynamic noise reduction strategy limited memory variant Observed stable behavior and overall good performance Stochastic Optimization Algorithms Beyond SG 28 of 42
46 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 29 of 42
47 BFGS-type updates Inverse Hessian and Hessian approximation updating formulas (s T k v k > 0): ( M k+1 I v ks T k s T k v k ( H k+1 I s ks T k H k s T k H ks k ) T M k ( I v ks T k s T k v k ) T ( H k ) I s ks T k H k s T k H ks k + s ks T k s T k v k ) + v kv T k s T k v k Satisfy secant-type equations M k+1 v k = s k and H k+1 s k = v k, but these are not relevant for our purposes here. Choosing v k y k := g k+1 g k yields standard BFGS, but in this talk v k β k s k + (1 β k )α k y k for some β k [0, 1]. This scheme is important to preserve self-correcting properties. Stochastic Optimization Algorithms Beyond SG 30 of 42
48 Geometric properties of Hessian update Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Stochastic Optimization Algorithms Beyond SG 31 of 42
49 Geometric properties of Hessian update Consider the matrices (which only depend on s k and H k, not g k!) P k := s ks T k H k s T k H ks k and Q k := I P k. Both H k -orthogonal projection matrices (i.e., idempotent and H k -self-adjoint). P k yields H k -orthogonal projection onto span(s k ). Q k yields H k -orthogonal projection onto span(s k ) H k. Returning to the Hessian update: ( H k+1 I s ks T k H ) T ( k s T k H H k I s ks T k H ) k ks k s T k H + v kvk T ks k s T k }{{} v k }{{} rank n 1 rank 1 Curvature projected out along span(s k ) ) ( Curvature corrected by v kvk T s T = k v k ( v k v T k v k 2 2 ) v k 2 2 v k T (inverse Rayleigh). M k+1v k Stochastic Optimization Algorithms Beyond SG 31 of 42
50 Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Stochastic Optimization Algorithms Beyond SG 32 of 42
51 Self-correcting properties of Hessian update Since curvature is constantly projected out, what happens after many updates? Theorem SC (Byrd, Nocedal (1989)) Suppose that, for all k, there exists {η, θ} R ++ such that η st k v k s k 2 2 and v k 2 2 s T k v k θ. (KEY) Then, for any p (0, 1), there exist constants {ι, κ, λ} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: ι s T k H ks k s k 2 H k s k 2 and κ H ks k 2 s k 2 λ. Proof technique. Building on work of Powell (1976), etc., involves bounding growth of γ(h k ) = tr(h k ) ln(det(h k )). Stochastic Optimization Algorithms Beyond SG 32 of 42
52 Self-correcting properties of inverse Hessian update Rather than focus on superlinear convergence results, we care about the following. Corollary SC Suppose the conditions of Theorem SC hold. Then, for any p (0, 1), there exist constants {µ, ν} R ++ such that, for any K 2, the following relations hold for at least pk values of k {1,..., K}: µ g k 2 2 gt k M kg k and M k g k 2 2 ν g k 2 2 Proof sketch. Follows simply after algebraic manipulations from the result of Theorem SC, using the facts that s k = α k M k g k and M k = H 1 k for all k. Stochastic Optimization Algorithms Beyond SG 33 of 42
53 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 34 of 42
54 Algorithm SC : Self-Correcting BFGS Algorithm 1: Choose w 1 R d. 2: Set g 1 f(w 1 ). 3: Choose a symmetric positive definite M 1 R d d. 4: Choose a positive scalar sequence {α k }. 5: for k = 1, 2,... do 6: Set s k α k M k g k. 7: Set w k+1 w k + s k. 8: Set g k+1 f(w k+1 ). 9: Set y k g k+1 g k. 10: Set β k min{β [0, 1] : v(β) := βs k + (1 β)α k y k satisfies (KEY)}. 11: Set v k v(β k ). 12: Set ( ) T ( ) M k 13: end for M k+1 I v ks T k s T k v k I v ks T k s T k v k + s ks T k s T k v. k Stochastic Optimization Algorithms Beyond SG 35 of 42
55 Global convergence theorem Theorem (Bottou, Curtis, Nocedal (2016)) Suppose that, for all k, there exists a scalar constant ρ > 0 such that f(w k ) T E ξk [M k g k ] ρ f(w k ) 2 2, and there exist scalars σ > 0 and τ > 0 such that E ξk [ M k g k 2 2] σ + τ f(w k ) 2 2. Then, {E[f(w k )]} converges to a finite limit and lim inf k E[ f(w k)] = 0. Proof technique. Follows from the critical inequality E ξk [f(w k+1 )] f(w k ) α k f(w k ) T E ξk [M k g k ] + α 2 k LE ξ k [ M k g k 2 2 ]. Stochastic Optimization Algorithms Beyond SG 36 of 42
56 Reality The conditions in this theorem cannot be verified in practice. They require knowing f(w k ). They require knowing E ξk [M k g k ] and E ξk [ M k g k 2 2 ]... but M k and g k are not independent! That said, Corollary SC ensures that they hold with g k = f(w k ); recall µ g k 2 2 g T k M kg k and M k g k 2 2 ν g k 2 2. Stochastic Optimization Algorithms Beyond SG 37 of 42
57 Reality The conditions in this theorem cannot be verified in practice. They require knowing f(w k ). They require knowing E ξk [M k g k ] and E ξk [ M k g k 2 2 ]... but M k and g k are not independent! That said, Corollary SC ensures that they hold with g k = f(w k ); recall µ g k 2 2 gk T M kg k and M k g k 2 2 ν g k 2 2. Stabilized variant (SC-s): Loop over (stochastic) gradient computation until ρ ĝ k ĝ T k+1 M k+1g k+1 and M k+1 g k σ + τ ĝ k Recompute g k+1, ĝ k+1, and M k+1 until these hold. Stochastic Optimization Algorithms Beyond SG 37 of 42
58 Numerical Experiments: a1a logistic regression, data a1a, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 38 of 42
59 Numerical Experiments: rcv1 SC-L and SC-L-s: limited memory variants of SC and SC-s, respectively: logistic regression, data rcv1, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 39 of 42
60 Numerical Experiments: mnist deep neural network, data mnist, diminishing stepsizes Stochastic Optimization Algorithms Beyond SG 40 of 42
61 Outline GD and SG GD vs. SG Beyond SG Stochastic Quasi-Newton Self-Correcting Properties of BFGS Proposed Algorithm: SC-BFGS Summary Stochastic Optimization Algorithms Beyond SG 41 of 42
62 Contributions Proposed a quasi-newton method for stochastic (nonconvex) optimization exploited self-correcting properties of BFGS-type updates properties of Hessians offer useful bounds for inverse Hessians motivating convergence theory for convex and nonconvex objectives dynamic noise reduction strategy limited memory variant Observed stable behavior and overall good performance F. E. Curtis. A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, JMLR. Stochastic Optimization Algorithms Beyond SG 42 of 42
Second-Order Methods for Stochastic Optimization
Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationStochastic Gradient Methods for Large-Scale Machine Learning
Stochastic Gradient Methods for Large-Scale Machine Learning Leon Bottou Facebook A Research Frank E. Curtis Lehigh University Jorge Nocedal Northwestern University 1 Our Goal 1. This is a tutorial about
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationIncremental Quasi-Newton methods with local superlinear convergence rate
Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationInfeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization
Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with James V. Burke, University of Washington Daniel
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationZero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal
Zero-Order Methods for the Optimization of Noisy Functions Jorge Nocedal Northwestern University Simons Institute, October 2017 1 Collaborators Albert Berahas Northwestern University Richard Byrd University
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationExact and Inexact Subsampled Newton Methods for Optimization
Exact and Inexact Subsampled Newton Methods for Optimization Raghu Bollapragada Richard Byrd Jorge Nocedal September 27, 2016 Abstract The paper studies the solution of stochastic optimization problems
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationA Stochastic Quasi-Newton Method for Large-Scale Optimization
A Stochastic Quasi-Newton Method for Large-Scale Optimization R. H. Byrd S.L. Hansen Jorge Nocedal Y. Singer February 17, 2015 Abstract The question of how to incorporate curvature information in stochastic
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationABSTRACT 1. INTRODUCTION
A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More information2. Quasi-Newton methods
L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More information5 Quasi-Newton Methods
Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationOPER 627: Nonlinear Optimization Lecture 14: Mid-term Review
OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review Department of Statistical Sciences and Operations Research Virginia Commonwealth University Oct 16, 2013 (Lecture 14) Nonlinear Optimization
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationNumerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen
Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationMultilevel Optimization Methods: Convergence and Problem Structure
Multilevel Optimization Methods: Convergence and Problem Structure Chin Pang Ho, Panos Parpas Department of Computing, Imperial College London, United Kingdom October 8, 206 Abstract Building upon multigrid
More informationSub-Sampled Newton Methods I: Globally Convergent Algorithms
Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationChapter 4. Unconstrained optimization
Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file
More informationLine Search Methods for Unconstrained Optimisation
Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic
More informationMultipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations
Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations Oleg Burdakov a,, Ahmad Kamandi b a Department of Mathematics, Linköping University,
More informationQuasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)
Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb
More informationComplexity analysis of second-order algorithms based on line search for smooth nonconvex optimization
Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,
More informationLBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)
LBFGS John Langford, Large Scale Machine Learning Class, February 5 (post presentation version) We are still doing Linear Learning Features: a vector x R n Label: y R Goal: Learn w R n such that ŷ w (x)
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationLarge-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima J. Nocedal with N. Keskar Northwestern University D. Mudigere INTEL P. Tang INTEL M. Smelyanskiy INTEL 1 Initial Remarks SGD
More information5 Handling Constraints
5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationEfficient Methods for Large-Scale Optimization
Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro
More informationMethods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent
Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More information