Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 1 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 2 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 3 of 33

Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Main message of this talk: For nonconvex optimization...... complexity bounds should be taken with a grain of salt (for now). Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33

Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Main message of this talk: For nonconvex optimization...... complexity bounds should be taken with a grain of salt (for now). Parting words: There are other, better motivations for second-order methods. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33

Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Theoretical guarantees to assess a nonconvex optimization algorithm: Global convergence, i.e., f(x k ) 0 and maybe min(eig( 2 f(x k ))) 0 Local convergence rate, i.e., f(x k+1 ) 2 / f(x k ) 2 0 (or more) Worst-case complexity, i.e., upper bound on number of iterations 1 to achieve f(x k ) 2 ɛ and perhaps min(eig( 2 f(x k ))) ɛ for some ɛ > 0 1... or function evaluations, subproblem solves, etc. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33

Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Global convergence, local quadratic rate when 2 f(x ) 0 O(ɛ 2 ) complexity to first-order ɛ-criticality, O(ɛ 3 ) to second-order Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Global convergence, local quadratic rate when 2 f(x ) 0 O(ɛ 3/2 ) complexity to first-order ɛ-criticality, O(ɛ 3 ) to second-order Theoretical guarantees to assess a nonconvex optimization algorithm: Global convergence, i.e., f(x k ) 0 and maybe min(eig( 2 f(x k ))) 0 Local convergence rate, i.e., f(x k+1 ) 2 / f(x k ) 2 0 (or more) Worst-case complexity, i.e., upper bound on number of iterations 1 to achieve f(x k ) 2 ɛ and perhaps min(eig( 2 f(x k ))) ɛ for some ɛ > 0 1... or function evaluations, subproblem solves, etc. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33

Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Little evidence that cubic regularization methods offer improved performance: Trust region (TR) methods remain the state-of-the-art TR-like methods can achieve the same complexity guarantees Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 6 of 33

Trust region methods with optimal complexity Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 7 of 33

Closer look Let s understand these complexity bounds, then look closely at them. I ll refer to three algorithms: ttr: Traditional Trust Region algorithm arc: Adaptive Regularisation algorithm using Cubics Cartis, Gould, & Toint (2011) trace: Trust Region Algorithm with Contractions and Expansions Curtis, Robinson, & Samadi (2017) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 8 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 9 of 33

Algorithm basics 1: Solve to compute s k : min s R n ttr q k(s) := f k + g T k s + 1 2 st H k s s.t. s 2 δ k (dual: λ k ) 2: Compute ratio: ρ q k f k f(x k +s k ) f k q k (s k ) 3: Update radius: ρ q k η: accept and δ k ρ q k < η: reject and δ k 1: Solve to compute s k : min s R n arc c k(s) := f k + g T k s + 1 2 st H k s + 1 3 σ k s 3 2 2: Compute ratio: ρ c k f k f(x k +s k ) f k c k (s k ) 3: Update regularization: ρ c k η: accept and σ k ρ c k < η: reject and σ k Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 10 of 33

Algorithm basics: Subproblem solution correspondence 1: Solve to compute s k : min s R n ttr q k(s) := f k + g T k s + 1 2 st H k s s.t. s 2 δ k (dual: λ k ) 2: Compute ratio: ρ q k f k f(x k +s k ) f k q k (s k ) 3: Update radius: ρ q k η: accept and δ k ρ q k < η: reject and δ k σ k = λ k δ k δ k = s k 2 1: Solve to compute s k : min s R n arc c k(s) := f k + g T k s + 1 2 st H k s + 1 3 σ k s 3 2 2: Compute ratio: ρ c k f k f(x k +s k ) f k c k (s k ) 3: Update regularization: ρ c k η: accept and σ k ρ c k < η: reject and σ k Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 10 of 33

Discussion What are the similarities? algorithmic frameworks are almost identical one-to-one correspondence (except λ k = 0) between subproblem solutions What are the key differences? step acceptance criteria trust region vs. regularization coefficient updates Recall that a solution s k of the TR subproblem is also a solution of min s R n f k + g T k s + 1 2 st (H k + λ k I)s, so the dual variable λ k can be viewed as a quadratic regularization coefficient. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 11 of 33

Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) (λ k (δ k ), s k (δ k ) 2 ) At a given iterate x k, curve illustrates dual variable (i.e., quadratic regularization) and norm of corresponding step as a function of TR radius 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33

Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) (λ k (δ k ), s k (δ k ) 2 ) After a rejected step (i.e., with x k+1 = x k ) we set δ k+1 γδ k (linear rate of decrease) while λ k+1 > λ k (λ k+1 (δ k+1 ), s k+1 (δ k+1 ) 2 ) 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33

Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) In fact, the increase in the dual can be quite severe in some cases! (We have no direct control over this.) (λ k (δ k ), s k (δ k ) 2 ) (λ k+1 (δ k+1 ), s k+1 (δ k+1 ) 2 ) 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33

Intuition, please! λ Intuitively, what is so important about k = λ k? s k 2 δ k Large δ k implies s k may not yield objective decrease. Small δ k prohibits long steps. Small λ k suggests the TR is not restricting us too much. Large λ k suggests more objective decrease is possible. So what is so bad (for complexity s sake) with the following? It s that we may go from a large, but unproductive step to a productive, but (too) short step! λ k δ k 0 and λ k+1 δ k+1 0. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 13 of 33

arc magic So what s the magic of arc? It s not the types of steps you compute (since TR subproblem gives the same). It s that a simple update for σ k gives a good regularization/stepsize balance. In arc, restricting σ k σ min for all k and proving that σ k σ max for all k ensures that all accepted steps satisfy f k f k+1 c 1 σ min s k 3 2 ( and s k 2 c 2 σ max + c 3 ) 2 g k+1 1/2 2. One can also show that, at any point, the number of rejected steps that can occur consecutively is bounded above by a constant (independent of k and ɛ). Important to note that arc always has the regularization on. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 14 of 33

Regularization/stepsize trade-off: arc s k (σ) 2 slope = 1/σ k All points on the dashed line yield the same ratio σ = λ/ s 2 so, given σ k, the properties of s k are determined by the intersection of the dashed line and the curve 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33

Regularization/stepsize trade-off: arc s k (σ) 2 slope = 1/σ k A sequence of rejected steps follow the curve much differently than ttr; 1/σ k+1 1/σ k+2 1/σ k+3 1/σ k+4 1/σ k+5 1/σ k+6 1/σ k+7 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33

Regularization/stepsize trade-off: arc s k (σ) 2 A sequence of rejected steps follow the curve much differently than ttr; in particular, for sufficiently large σ, the rate of decrease in s is sublinear 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33

From ttr to trace trace involves three key modifications of ttr. 1: Different step acceptance ratio 2: New expansion step: May reject step while increasing TR radius 3: New contraction procedure: Explicit or implicit (through update of λ) With these, we obtain the same complexity properties as arc. We also recently proposed and analyzed a framework that generalizes both... and allows for inexact subproblem solutions (maintaining complexity). min f (s,λ) R n k + gk T s + 1 R 2 st (H k + λi)s s.t. (σ L k )2 s 2 2 λ (σu k )2 s 2 2 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 16 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 17 of 33

Complexity for arc and trace Suppose that f is twice continuously differentiable...... bounded below by f min... with g and H both Lipschitz continuous (constants L g and L H ) Combine the inequalities f k f k+1 η s k 3 2 and s k 2 (L H + σ max) 1/2 g k+1 1/2 2 The cardinality of the set {k N : g k 2 > ɛ} is bounded above by ( ) f 0 f min η(l H + σ max) 3/2 ɛ 3/2 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 18 of 33

Complexity for arc and trace Suppose that f is twice continuously differentiable...... bounded below by f min... with g and H both Lipschitz continuous (constants L g and L H ) Combine the inequalities f k f k+1 η s k 3 2 and s k 2 (L H + σ max) 1/2 g k+1 1/2 2 The cardinality of the set {k N : g k 2 > ɛ} is bounded above by ( ) f 0 f min η(l H + σ max) 3/2 ɛ 3/2 But these bounds are very pessimistic... Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 18 of 33

Simpler setting Consider the iteration x k+1 x k 1 L g k. This type of complexity analysis considers the set G(ɛ g) := {x R n : g(x) 2 ɛ g} and aims to find an upper bound on the cardinality of K g(ɛ g) := {k N : x k G(ɛ g)}. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 19 of 33

Upper bound on K g (ɛ g ) Using s k = 1 L g g k and the upper bound f k+1 f k + g T k s k + 1 2 Lg s k 2 2, one finds f k f k+1 1 2L g g k 2 2 = (f 0 f ) 1 2L g K g(ɛ g) ɛ 2 g = K g(ɛ g) 2L g(f 0 f )ɛ 2 g. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 20 of 33

Nice f But what if f is nice?... e.g., satisfying the Polyak-Lojasiewicz condition (for c R ++ ), i.e., f(x) f 1 2c g(x) 2 2 for all x Rn. Now consider the set F(ɛ f ) := {x R n : f(x) f ɛ f } and consider an upper bound on the cardinality of K f (ɛ f ) := {k N : x k F(ɛ f )}. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 21 of 33

Upper bound on K f (ɛ f ) Using s k = 1 L g g k and the upper bound one finds f k+1 f k + g T k s k + 1 2 Lg s k 2 2, f k f k+1 1 2L g g k 2 2 c L g (f k f ) = (1 c L g )(f k f ) f k+1 f = (1 c ) k (f L 0 f g ) f k f = ( ) ( ( )) f0 f 1 Lg K f (ɛ f ) log log. ɛ f L g c Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 22 of 33

For the first step... In the general nonconvex analysis...... the expected decrease for the first step is much more pessimistic: general nonconvex: f 0 f 1 1 2L g ɛ g PL condition: (1 c L g )(f 0 f ) f 1 f... and it remains more pessimistic throughout! Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 23 of 33

Upper bounds on K f (ɛ f ) versus K g (ɛ g ) Let f(x) = 1 2 x2, meaning that g(x) = x. Let ɛ f = 1 2 ɛ2 g, meaning that F(ɛ f ) = G(ɛ g). Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 24 of 33

Upper bounds on K f (ɛ f ) versus {k N : 1 2 g k 2 2 > ɛ g} Let f(x) = 1 2 x2, meaning that 1 2 g(x)2 = 1 2 x2. Let ɛ f = ɛ g, meaning that F(ɛ f ) = G(ɛ g). Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 25 of 33

Bad worst-case! Worst-case complexity bounds in the general nonconvex case are very pessimistic. The analysis immediately admits a large gap when the function is nice. The essentially tight examples for the worst-case bounds are... weird. 2 2 Cartis, Gould, Toint (2010) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 26 of 33

Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 27 of 33

Summary We have seen that cubic regularization admits good complexity properties. But we have also seen that tweaking trust region methods yields the same. And the empirical performance of trust region remains better (I claim!). These complexity bounds are extremely pessimistic. If your function is nice, these bounds are way off. How can we close the gap between theory and practice? Nesterov & Polyak (2006) consider different phases...... but can this be generalized (for non-strongly convex)? Complexity properties should probably not (yet) drive algorithm selection. Then why second-order? Mitigate effects of ill-conditioning, easier to tune, parallel and distributed, etc. Going back to machine learning (ML) connection... Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 28 of 33

Optimization for ML: Beyond SG stochastic gradient better rate better constant Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 29 of 33

Optimization for ML: Beyond SG stochastic gradient better rate better constant better rate and better constant Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 29 of 33

Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 30 of 33

2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 31 of 33

2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 32 of 33

Fairly comparing algorithms How should we compare optimization algorithms for machine learning? Fair comparison would... not ignore time spent tuning parameters... demonstrate speed and reliability... involve many problems (test sets?)... involve many runs of each algorithm What about testing accuracy? Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 33 of 33