Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Similar documents
Second-Order Methods for Stochastic Optimization

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity

Higher-Order Methods

Stochastic Optimization Algorithms Beyond SG

A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization

Conditional Gradient (Frank-Wolfe) Method

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 9 Oct 2018

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

An introduction to complexity analysis for nonconvex optimization

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

This manuscript is for review purposes only.

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

arxiv: v2 [math.oc] 1 Nov 2017

Nonlinear Optimization Methods for Machine Learning

1. Introduction. We consider the numerical solution of the unconstrained (possibly nonconvex) optimization problem

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Optimal Newton-type methods for nonconvex smooth optimization problems

Numerical Experience with a Class of Trust-Region Algorithms May 23, for 2016 Unconstrained 1 / 30 S. Smooth Optimization

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Linear Convergence under the Polyak-Łojasiewicz Inequality

An example of slow convergence for Newton s method on a function with globally Lipschitz continuous Hessian

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Linear Convergence under the Polyak-Łojasiewicz Inequality

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Worst Case Complexity of Direct Search

CPSC 540: Machine Learning

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Sub-Sampled Newton Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Cubic regularization of Newton s method for convex problems with constraints

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Large-scale Stochastic Optimization

On the complexity of an Inexact Restoration method for constrained optimization

Composite nonlinear models at scale

Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Second order machine learning

Trust Regions. Charles J. Geyer. March 27, 2013

Parameter Optimization in the Nonlinear Stepsize Control Framework for Trust-Region July 12, 2017 Methods 1 / 39

Selected Topics in Optimization. Some slides borrowed from

Second order machine learning

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Optimization Tutorial 1. Basic Gradient Descent

Lecture 14: October 17

On Lagrange multipliers of trust region subproblems

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725

arxiv: v1 [cs.lg] 6 Sep 2018

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Proximal and First-Order Methods for Convex Optimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Taylor-like models in nonsmooth optimization

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Keywords: Nonlinear least-squares problems, regularized models, error bound condition, local convergence.

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

8 Numerical methods for unconstrained problems

Lecture 17: October 27

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Accelerating the cubic regularization of Newton s method on convex problems

Worst Case Complexity of Direct Search

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Parallel Coordinate Optimization

Chapter 3 Numerical Methods

5 Handling Constraints

Contents. Preface. 1 Introduction Optimization view on mathematical models NLP models, black-box versus explicit expression 3

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

On Nesterov s Random Coordinate Descent Algorithms - Continued

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

Geometry optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent


Stochastic Analogues to Deterministic Optimizers

Unconstrained minimization of smooth functions

On Lagrange multipliers of trust-region subproblems

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Programming, numerics and optimization

10. Unconstrained minimization

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Constrained Optimization and Lagrangian Duality

Non-convex optimization. Issam Laradji

Iteration and evaluation complexity on the minimization of functions whose computation is intrinsically inexact

Complexity of gradient descent for multiobjective optimization

Interpolation-Based Trust-Region Methods for DFO

Gradient Methods Using Momentum and Memory

Transcription:

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 1 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 2 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 3 of 33

Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33

Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Main message of this talk: For nonconvex optimization...... complexity bounds should be taken with a grain of salt (for now). Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33

Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Main message of this talk: For nonconvex optimization...... complexity bounds should be taken with a grain of salt (for now). Parting words: There are other, better motivations for second-order methods. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33

Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33

Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Theoretical guarantees to assess a nonconvex optimization algorithm: Global convergence, i.e., f(x k ) 0 and maybe min(eig( 2 f(x k ))) 0 Local convergence rate, i.e., f(x k+1 ) 2 / f(x k ) 2 0 (or more) Worst-case complexity, i.e., upper bound on number of iterations 1 to achieve f(x k ) 2 ɛ and perhaps min(eig( 2 f(x k ))) ɛ for some ɛ > 0 1... or function evaluations, subproblem solves, etc. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33

Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Global convergence, local quadratic rate when 2 f(x ) 0 O(ɛ 2 ) complexity to first-order ɛ-criticality, O(ɛ 3 ) to second-order Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Global convergence, local quadratic rate when 2 f(x ) 0 O(ɛ 3/2 ) complexity to first-order ɛ-criticality, O(ɛ 3 ) to second-order Theoretical guarantees to assess a nonconvex optimization algorithm: Global convergence, i.e., f(x k ) 0 and maybe min(eig( 2 f(x k ))) 0 Local convergence rate, i.e., f(x k+1 ) 2 / f(x k ) 2 0 (or more) Worst-case complexity, i.e., upper bound on number of iterations 1 to achieve f(x k ) 2 ɛ and perhaps min(eig( 2 f(x k ))) ɛ for some ɛ > 0 1... or function evaluations, subproblem solves, etc. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33

Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Little evidence that cubic regularization methods offer improved performance: Trust region (TR) methods remain the state-of-the-art TR-like methods can achieve the same complexity guarantees Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 6 of 33

Trust region methods with optimal complexity Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 7 of 33

Closer look Let s understand these complexity bounds, then look closely at them. I ll refer to three algorithms: ttr: Traditional Trust Region algorithm arc: Adaptive Regularisation algorithm using Cubics Cartis, Gould, & Toint (2011) trace: Trust Region Algorithm with Contractions and Expansions Curtis, Robinson, & Samadi (2017) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 8 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 9 of 33

Algorithm basics 1: Solve to compute s k : min s R n ttr q k(s) := f k + g T k s + 1 2 st H k s s.t. s 2 δ k (dual: λ k ) 2: Compute ratio: ρ q k f k f(x k +s k ) f k q k (s k ) 3: Update radius: ρ q k η: accept and δ k ρ q k < η: reject and δ k 1: Solve to compute s k : min s R n arc c k(s) := f k + g T k s + 1 2 st H k s + 1 3 σ k s 3 2 2: Compute ratio: ρ c k f k f(x k +s k ) f k c k (s k ) 3: Update regularization: ρ c k η: accept and σ k ρ c k < η: reject and σ k Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 10 of 33

Algorithm basics: Subproblem solution correspondence 1: Solve to compute s k : min s R n ttr q k(s) := f k + g T k s + 1 2 st H k s s.t. s 2 δ k (dual: λ k ) 2: Compute ratio: ρ q k f k f(x k +s k ) f k q k (s k ) 3: Update radius: ρ q k η: accept and δ k ρ q k < η: reject and δ k σ k = λ k δ k δ k = s k 2 1: Solve to compute s k : min s R n arc c k(s) := f k + g T k s + 1 2 st H k s + 1 3 σ k s 3 2 2: Compute ratio: ρ c k f k f(x k +s k ) f k c k (s k ) 3: Update regularization: ρ c k η: accept and σ k ρ c k < η: reject and σ k Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 10 of 33

Discussion What are the similarities? algorithmic frameworks are almost identical one-to-one correspondence (except λ k = 0) between subproblem solutions What are the key differences? step acceptance criteria trust region vs. regularization coefficient updates Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 11 of 33

Discussion What are the similarities? algorithmic frameworks are almost identical one-to-one correspondence (except λ k = 0) between subproblem solutions What are the key differences? step acceptance criteria trust region vs. regularization coefficient updates Recall that a solution s k of the TR subproblem is also a solution of min s R n f k + g T k s + 1 2 st (H k + λ k I)s, so the dual variable λ k can be viewed as a quadratic regularization coefficient. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 11 of 33

Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) (λ k (δ k ), s k (δ k ) 2 ) At a given iterate x k, curve illustrates dual variable (i.e., quadratic regularization) and norm of corresponding step as a function of TR radius 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33

Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) (λ k (δ k ), s k (δ k ) 2 ) After a rejected step (i.e., with x k+1 = x k ) we set δ k+1 γδ k (linear rate of decrease) while λ k+1 > λ k (λ k+1 (δ k+1 ), s k+1 (δ k+1 ) 2 ) 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33

Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) In fact, the increase in the dual can be quite severe in some cases! (We have no direct control over this.) (λ k (δ k ), s k (δ k ) 2 ) (λ k+1 (δ k+1 ), s k+1 (δ k+1 ) 2 ) 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33

Intuition, please! λ Intuitively, what is so important about k = λ k? s k 2 δ k Large δ k implies s k may not yield objective decrease. Small δ k prohibits long steps. Small λ k suggests the TR is not restricting us too much. Large λ k suggests more objective decrease is possible. So what is so bad (for complexity s sake) with the following? It s that we may go from a large, but unproductive step to a productive, but (too) short step! λ k δ k 0 and λ k+1 δ k+1 0. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 13 of 33

arc magic So what s the magic of arc? It s not the types of steps you compute (since TR subproblem gives the same). It s that a simple update for σ k gives a good regularization/stepsize balance. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 14 of 33

arc magic So what s the magic of arc? It s not the types of steps you compute (since TR subproblem gives the same). It s that a simple update for σ k gives a good regularization/stepsize balance. In arc, restricting σ k σ min for all k and proving that σ k σ max for all k ensures that all accepted steps satisfy f k f k+1 c 1 σ min s k 3 2 ( and s k 2 c 2 σ max + c 3 ) 2 g k+1 1/2 2. One can also show that, at any point, the number of rejected steps that can occur consecutively is bounded above by a constant (independent of k and ɛ). Important to note that arc always has the regularization on. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 14 of 33

Regularization/stepsize trade-off: arc s k (σ) 2 slope = 1/σ k All points on the dashed line yield the same ratio σ = λ/ s 2 so, given σ k, the properties of s k are determined by the intersection of the dashed line and the curve 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33

Regularization/stepsize trade-off: arc s k (σ) 2 slope = 1/σ k A sequence of rejected steps follow the curve much differently than ttr; 1/σ k+1 1/σ k+2 1/σ k+3 1/σ k+4 1/σ k+5 1/σ k+6 1/σ k+7 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33

Regularization/stepsize trade-off: arc s k (σ) 2 A sequence of rejected steps follow the curve much differently than ttr; in particular, for sufficiently large σ, the rate of decrease in s is sublinear 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33

From ttr to trace trace involves three key modifications of ttr. 1: Different step acceptance ratio 2: New expansion step: May reject step while increasing TR radius 3: New contraction procedure: Explicit or implicit (through update of λ) With these, we obtain the same complexity properties as arc. We also recently proposed and analyzed a framework that generalizes both... and allows for inexact subproblem solutions (maintaining complexity). min f (s,λ) R n k + gk T s + 1 R 2 st (H k + λi)s s.t. (σ L k )2 s 2 2 λ (σu k )2 s 2 2 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 16 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 17 of 33

Complexity for arc and trace Suppose that f is twice continuously differentiable...... bounded below by f min... with g and H both Lipschitz continuous (constants L g and L H ) Combine the inequalities f k f k+1 η s k 3 2 and s k 2 (L H + σ max) 1/2 g k+1 1/2 2 The cardinality of the set {k N : g k 2 > ɛ} is bounded above by ( ) f 0 f min η(l H + σ max) 3/2 ɛ 3/2 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 18 of 33

Complexity for arc and trace Suppose that f is twice continuously differentiable...... bounded below by f min... with g and H both Lipschitz continuous (constants L g and L H ) Combine the inequalities f k f k+1 η s k 3 2 and s k 2 (L H + σ max) 1/2 g k+1 1/2 2 The cardinality of the set {k N : g k 2 > ɛ} is bounded above by ( ) f 0 f min η(l H + σ max) 3/2 ɛ 3/2 But these bounds are very pessimistic... Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 18 of 33

Simpler setting Consider the iteration x k+1 x k 1 L g k. This type of complexity analysis considers the set G(ɛ g) := {x R n : g(x) 2 ɛ g} and aims to find an upper bound on the cardinality of K g(ɛ g) := {k N : x k G(ɛ g)}. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 19 of 33

Upper bound on K g (ɛ g ) Using s k = 1 L g g k and the upper bound f k+1 f k + g T k s k + 1 2 Lg s k 2 2, one finds f k f k+1 1 2L g g k 2 2 = (f 0 f ) 1 2L g K g(ɛ g) ɛ 2 g = K g(ɛ g) 2L g(f 0 f )ɛ 2 g. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 20 of 33

Nice f But what if f is nice?... e.g., satisfying the Polyak-Lojasiewicz condition (for c R ++ ), i.e., f(x) f 1 2c g(x) 2 2 for all x Rn. Now consider the set F(ɛ f ) := {x R n : f(x) f ɛ f } and consider an upper bound on the cardinality of K f (ɛ f ) := {k N : x k F(ɛ f )}. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 21 of 33

Upper bound on K f (ɛ f ) Using s k = 1 L g g k and the upper bound one finds f k+1 f k + g T k s k + 1 2 Lg s k 2 2, f k f k+1 1 2L g g k 2 2 c L g (f k f ) = (1 c L g )(f k f ) f k+1 f = (1 c ) k (f L 0 f g ) f k f = ( ) ( ( )) f0 f 1 Lg K f (ɛ f ) log log. ɛ f L g c Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 22 of 33

For the first step... In the general nonconvex analysis...... the expected decrease for the first step is much more pessimistic: general nonconvex: f 0 f 1 1 2L g ɛ g PL condition: (1 c L g )(f 0 f ) f 1 f... and it remains more pessimistic throughout! Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 23 of 33

Upper bounds on K f (ɛ f ) versus K g (ɛ g ) Let f(x) = 1 2 x2, meaning that g(x) = x. Let ɛ f = 1 2 ɛ2 g, meaning that F(ɛ f ) = G(ɛ g). Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 24 of 33

Upper bounds on K f (ɛ f ) versus {k N : 1 2 g k 2 2 > ɛ g} Let f(x) = 1 2 x2, meaning that 1 2 g(x)2 = 1 2 x2. Let ɛ f = ɛ g, meaning that F(ɛ f ) = G(ɛ g). Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 25 of 33

Bad worst-case! Worst-case complexity bounds in the general nonconvex case are very pessimistic. The analysis immediately admits a large gap when the function is nice. The essentially tight examples for the worst-case bounds are... weird. 2 2 Cartis, Gould, Toint (2010) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 26 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 27 of 33

Summary We have seen that cubic regularization admits good complexity properties. But we have also seen that tweaking trust region methods yields the same. And the empirical performance of trust region remains better (I claim!). These complexity bounds are extremely pessimistic. If your function is nice, these bounds are way off. How can we close the gap between theory and practice? Nesterov & Polyak (2006) consider different phases...... but can this be generalized (for non-strongly convex)? Complexity properties should probably not (yet) drive algorithm selection. Then why second-order? Mitigate effects of ill-conditioning, easier to tune, parallel and distributed, etc. Going back to machine learning (ML) connection... Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 28 of 33

Optimization for ML: Beyond SG stochastic gradient better rate better constant Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 29 of 33

Optimization for ML: Beyond SG stochastic gradient better rate better constant better rate and better constant Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 29 of 33

Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 30 of 33

2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 31 of 33

2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 32 of 33

Fairly comparing algorithms How should we compare optimization algorithms for machine learning? Fair comparison would... not ignore time spent tuning parameters... demonstrate speed and reliability... involve many problems (test sets?)... involve many runs of each algorithm What about testing accuracy? Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 33 of 33