Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization
|
|
- Geraldine Carr
- 5 years ago
- Views:
Transcription
1 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 1 of 33
2 Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 2 of 33
3 Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 3 of 33
4 Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33
5 Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Main message of this talk: For nonconvex optimization complexity bounds should be taken with a grain of salt (for now). Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33
6 Unconstrained nonconvex optimization Given f : R n R, consider the unconstrained optimization problem What type of method to use? First-order or second-order? min f(x). x R n One motivation for second-order: improved complexity bounds. Main message of this talk: For nonconvex optimization complexity bounds should be taken with a grain of salt (for now). Parting words: There are other, better motivations for second-order methods. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 4 of 33
7 Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33
8 Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Theoretical guarantees to assess a nonconvex optimization algorithm: Global convergence, i.e., f(x k ) 0 and maybe min(eig( 2 f(x k ))) 0 Local convergence rate, i.e., f(x k+1 ) 2 / f(x k ) 2 0 (or more) Worst-case complexity, i.e., upper bound on number of iterations 1 to achieve f(x k ) 2 ɛ and perhaps min(eig( 2 f(x k ))) ɛ for some ɛ > or function evaluations, subproblem solves, etc. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33
9 Methods of interest Trust region methods Decades of algorithmic development Levenberg (1944); Marquardt (1963); Powell (1970); many more! Global convergence, local quadratic rate when 2 f(x ) 0 O(ɛ 2 ) complexity to first-order ɛ-criticality, O(ɛ 3 ) to second-order Cubic regularization methods Relatively recent algorithmic development; fewer variants Griewank (1981); Nesterov & Polyak (2006); Cartis, Gould, & Toint (2011) Global convergence, local quadratic rate when 2 f(x ) 0 O(ɛ 3/2 ) complexity to first-order ɛ-criticality, O(ɛ 3 ) to second-order Theoretical guarantees to assess a nonconvex optimization algorithm: Global convergence, i.e., f(x k ) 0 and maybe min(eig( 2 f(x k ))) 0 Local convergence rate, i.e., f(x k+1 ) 2 / f(x k ) 2 0 (or more) Worst-case complexity, i.e., upper bound on number of iterations 1 to achieve f(x k ) 2 ɛ and perhaps min(eig( 2 f(x k ))) ɛ for some ɛ > or function evaluations, subproblem solves, etc. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 5 of 33
10 Theory vs. practice Researchers have been gravitating to adopt and build on cubic regularization: Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) Carmon, Duchi (2017) Kohler, Lucchi (2017) Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! Little evidence that cubic regularization methods offer improved performance: Trust region (TR) methods remain the state-of-the-art TR-like methods can achieve the same complexity guarantees Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 6 of 33
11 Trust region methods with optimal complexity Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 7 of 33
12 Closer look Let s understand these complexity bounds, then look closely at them. I ll refer to three algorithms: ttr: Traditional Trust Region algorithm arc: Adaptive Regularisation algorithm using Cubics Cartis, Gould, & Toint (2011) trace: Trust Region Algorithm with Contractions and Expansions Curtis, Robinson, & Samadi (2017) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 8 of 33
13 Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 9 of 33
14 Algorithm basics 1: Solve to compute s k : min s R n ttr q k(s) := f k + g T k s st H k s s.t. s 2 δ k (dual: λ k ) 2: Compute ratio: ρ q k f k f(x k +s k ) f k q k (s k ) 3: Update radius: ρ q k η: accept and δ k ρ q k < η: reject and δ k 1: Solve to compute s k : min s R n arc c k(s) := f k + g T k s st H k s σ k s 3 2 2: Compute ratio: ρ c k f k f(x k +s k ) f k c k (s k ) 3: Update regularization: ρ c k η: accept and σ k ρ c k < η: reject and σ k Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 10 of 33
15 Algorithm basics: Subproblem solution correspondence 1: Solve to compute s k : min s R n ttr q k(s) := f k + g T k s st H k s s.t. s 2 δ k (dual: λ k ) 2: Compute ratio: ρ q k f k f(x k +s k ) f k q k (s k ) 3: Update radius: ρ q k η: accept and δ k ρ q k < η: reject and δ k σ k = λ k δ k δ k = s k 2 1: Solve to compute s k : min s R n arc c k(s) := f k + g T k s st H k s σ k s 3 2 2: Compute ratio: ρ c k f k f(x k +s k ) f k c k (s k ) 3: Update regularization: ρ c k η: accept and σ k ρ c k < η: reject and σ k Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 10 of 33
16 Discussion What are the similarities? algorithmic frameworks are almost identical one-to-one correspondence (except λ k = 0) between subproblem solutions What are the key differences? step acceptance criteria trust region vs. regularization coefficient updates Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 11 of 33
17 Discussion What are the similarities? algorithmic frameworks are almost identical one-to-one correspondence (except λ k = 0) between subproblem solutions What are the key differences? step acceptance criteria trust region vs. regularization coefficient updates Recall that a solution s k of the TR subproblem is also a solution of min s R n f k + g T k s st (H k + λ k I)s, so the dual variable λ k can be viewed as a quadratic regularization coefficient. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 11 of 33
18 Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) (λ k (δ k ), s k (δ k ) 2 ) At a given iterate x k, curve illustrates dual variable (i.e., quadratic regularization) and norm of corresponding step as a function of TR radius 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33
19 Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) (λ k (δ k ), s k (δ k ) 2 ) After a rejected step (i.e., with x k+1 = x k ) we set δ k+1 γδ k (linear rate of decrease) while λ k+1 > λ k (λ k+1 (δ k+1 ), s k+1 (δ k+1 ) 2 ) 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33
20 Regularization/stepsize trade-off: ttr s k (δ) 2 (= δ) In fact, the increase in the dual can be quite severe in some cases! (We have no direct control over this.) (λ k (δ k ), s k (δ k ) 2 ) (λ k+1 (δ k+1 ), s k+1 (δ k+1 ) 2 ) 0 max{0, min(eig(h k ))} λ k (δ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 12 of 33
21 Intuition, please! λ Intuitively, what is so important about k = λ k? s k 2 δ k Large δ k implies s k may not yield objective decrease. Small δ k prohibits long steps. Small λ k suggests the TR is not restricting us too much. Large λ k suggests more objective decrease is possible. So what is so bad (for complexity s sake) with the following? It s that we may go from a large, but unproductive step to a productive, but (too) short step! λ k δ k 0 and λ k+1 δ k+1 0. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 13 of 33
22 arc magic So what s the magic of arc? It s not the types of steps you compute (since TR subproblem gives the same). It s that a simple update for σ k gives a good regularization/stepsize balance. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 14 of 33
23 arc magic So what s the magic of arc? It s not the types of steps you compute (since TR subproblem gives the same). It s that a simple update for σ k gives a good regularization/stepsize balance. In arc, restricting σ k σ min for all k and proving that σ k σ max for all k ensures that all accepted steps satisfy f k f k+1 c 1 σ min s k 3 2 ( and s k 2 c 2 σ max + c 3 ) 2 g k+1 1/2 2. One can also show that, at any point, the number of rejected steps that can occur consecutively is bounded above by a constant (independent of k and ɛ). Important to note that arc always has the regularization on. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 14 of 33
24 Regularization/stepsize trade-off: arc s k (σ) 2 slope = 1/σ k All points on the dashed line yield the same ratio σ = λ/ s 2 so, given σ k, the properties of s k are determined by the intersection of the dashed line and the curve 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33
25 Regularization/stepsize trade-off: arc s k (σ) 2 slope = 1/σ k A sequence of rejected steps follow the curve much differently than ttr; 1/σ k+1 1/σ k+2 1/σ k+3 1/σ k+4 1/σ k+5 1/σ k+6 1/σ k+7 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33
26 Regularization/stepsize trade-off: arc s k (σ) 2 A sequence of rejected steps follow the curve much differently than ttr; in particular, for sufficiently large σ, the rate of decrease in s is sublinear 0 max{0, min(eig(h k ))} λ k (σ) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 15 of 33
27 From ttr to trace trace involves three key modifications of ttr. 1: Different step acceptance ratio 2: New expansion step: May reject step while increasing TR radius 3: New contraction procedure: Explicit or implicit (through update of λ) With these, we obtain the same complexity properties as arc. We also recently proposed and analyzed a framework that generalizes both... and allows for inexact subproblem solutions (maintaining complexity). min f (s,λ) R n k + gk T s + 1 R 2 st (H k + λi)s s.t. (σ L k )2 s 2 2 λ (σu k )2 s 2 2 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 16 of 33
28 Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 17 of 33
29 Complexity for arc and trace Suppose that f is twice continuously differentiable bounded below by f min... with g and H both Lipschitz continuous (constants L g and L H ) Combine the inequalities f k f k+1 η s k 3 2 and s k 2 (L H + σ max) 1/2 g k+1 1/2 2 The cardinality of the set {k N : g k 2 > ɛ} is bounded above by ( ) f 0 f min η(l H + σ max) 3/2 ɛ 3/2 Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 18 of 33
30 Complexity for arc and trace Suppose that f is twice continuously differentiable bounded below by f min... with g and H both Lipschitz continuous (constants L g and L H ) Combine the inequalities f k f k+1 η s k 3 2 and s k 2 (L H + σ max) 1/2 g k+1 1/2 2 The cardinality of the set {k N : g k 2 > ɛ} is bounded above by ( ) f 0 f min η(l H + σ max) 3/2 ɛ 3/2 But these bounds are very pessimistic... Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 18 of 33
31 Simpler setting Consider the iteration x k+1 x k 1 L g k. This type of complexity analysis considers the set G(ɛ g) := {x R n : g(x) 2 ɛ g} and aims to find an upper bound on the cardinality of K g(ɛ g) := {k N : x k G(ɛ g)}. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 19 of 33
32 Upper bound on K g (ɛ g ) Using s k = 1 L g g k and the upper bound f k+1 f k + g T k s k Lg s k 2 2, one finds f k f k+1 1 2L g g k 2 2 = (f 0 f ) 1 2L g K g(ɛ g) ɛ 2 g = K g(ɛ g) 2L g(f 0 f )ɛ 2 g. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 20 of 33
33 Nice f But what if f is nice?... e.g., satisfying the Polyak-Lojasiewicz condition (for c R ++ ), i.e., f(x) f 1 2c g(x) 2 2 for all x Rn. Now consider the set F(ɛ f ) := {x R n : f(x) f ɛ f } and consider an upper bound on the cardinality of K f (ɛ f ) := {k N : x k F(ɛ f )}. Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 21 of 33
34 Upper bound on K f (ɛ f ) Using s k = 1 L g g k and the upper bound one finds f k+1 f k + g T k s k Lg s k 2 2, f k f k+1 1 2L g g k 2 2 c L g (f k f ) = (1 c L g )(f k f ) f k+1 f = (1 c ) k (f L 0 f g ) f k f = ( ) ( ( )) f0 f 1 Lg K f (ɛ f ) log log. ɛ f L g c Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 22 of 33
35 For the first step... In the general nonconvex analysis the expected decrease for the first step is much more pessimistic: general nonconvex: f 0 f 1 1 2L g ɛ g PL condition: (1 c L g )(f 0 f ) f 1 f... and it remains more pessimistic throughout! Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 23 of 33
36 Upper bounds on K f (ɛ f ) versus K g (ɛ g ) Let f(x) = 1 2 x2, meaning that g(x) = x. Let ɛ f = 1 2 ɛ2 g, meaning that F(ɛ f ) = G(ɛ g). Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 24 of 33
37 Upper bounds on K f (ɛ f ) versus {k N : 1 2 g k 2 2 > ɛ g} Let f(x) = 1 2 x2, meaning that 1 2 g(x)2 = 1 2 x2. Let ɛ f = ɛ g, meaning that F(ɛ f ) = G(ɛ g). Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 25 of 33
38 Bad worst-case! Worst-case complexity bounds in the general nonconvex case are very pessimistic. The analysis immediately admits a large gap when the function is nice. The essentially tight examples for the worst-case bounds are... weird. 2 2 Cartis, Gould, Toint (2010) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 26 of 33
39 Outline Motivation Trust Region vs. Cubic Regularization Complexity Bounds Summary/Questions Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 27 of 33
40 Summary We have seen that cubic regularization admits good complexity properties. But we have also seen that tweaking trust region methods yields the same. And the empirical performance of trust region remains better (I claim!). These complexity bounds are extremely pessimistic. If your function is nice, these bounds are way off. How can we close the gap between theory and practice? Nesterov & Polyak (2006) consider different phases but can this be generalized (for non-strongly convex)? Complexity properties should probably not (yet) drive algorithm selection. Then why second-order? Mitigate effects of ill-conditioning, easier to tune, parallel and distributed, etc. Going back to machine learning (ML) connection... Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 28 of 33
41 Optimization for ML: Beyond SG stochastic gradient better rate better constant Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 29 of 33
42 Optimization for ML: Beyond SG stochastic gradient better rate better constant better rate and better constant Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 29 of 33
43 Two-dimensional schematic of methods stochastic gradient batch gradient second-order noise reduction stochastic Newton batch Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 30 of 33
44 2D schematic: Noise reduction methods stochastic gradient batch gradient noise reduction dynamic sampling gradient aggregation iterate averaging stochastic Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 31 of 33
45 2D schematic: Second-order methods stochastic gradient batch gradient stochastic Newton second-order diagonal scaling natural gradient Gauss-Newton quasi-newton Hessian-free Newton Bottou et al., Optimization Methods for Large-Scale Machine Learning, SIAM Review (to appear) Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 32 of 33
46 Fairly comparing algorithms How should we compare optimization algorithms for machine learning? Fair comparison would... not ignore time spent tuning parameters... demonstrate speed and reliability... involve many problems (test sets?)... involve many runs of each algorithm What about testing accuracy? Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization 33 of 33
Second-Order Methods for Stochastic Optimization
Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationA Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity
A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity Mohammadreza Samadi, Lehigh University joint work with Frank E. Curtis (stand-in presenter), Lehigh University
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationA trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization
Math. Program., Ser. A DOI 10.1007/s10107-016-1026-2 FULL LENGTH PAPER A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization Frank E. Curtis 1 Daniel P.
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationarxiv: v1 [math.oc] 9 Oct 2018
Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University
More informationHow to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization
How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department
More informationAn introduction to complexity analysis for nonconvex optimization
An introduction to complexity analysis for nonconvex optimization Philippe Toint (with Coralia Cartis and Nick Gould) FUNDP University of Namur, Belgium Séminaire Résidentiel Interdisciplinaire, Saint
More informationThird-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima
Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic
More informationComplexity analysis of second-order algorithms based on line search for smooth nonconvex optimization
Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,
More informationConvergence of Cubic Regularization for Nonconvex Optimization under KŁ Property
Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationThis manuscript is for review purposes only.
1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationarxiv: v2 [math.oc] 1 Nov 2017
Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More information1. Introduction. We consider the numerical solution of the unconstrained (possibly nonconvex) optimization problem
SIAM J. OPTIM. Vol. 2, No. 6, pp. 2833 2852 c 2 Society for Industrial and Applied Mathematics ON THE COMPLEXITY OF STEEPEST DESCENT, NEWTON S AND REGULARIZED NEWTON S METHODS FOR NONCONVEX UNCONSTRAINED
More informationSub-Sampled Newton Methods for Machine Learning. Jorge Nocedal
Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado
More informationOptimal Newton-type methods for nonconvex smooth optimization problems
Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations
More informationNumerical Experience with a Class of Trust-Region Algorithms May 23, for 2016 Unconstrained 1 / 30 S. Smooth Optimization
Numerical Experience with a Class of Trust-Region Algorithms for Unconstrained Smooth Optimization XI Brazilian Workshop on Continuous Optimization Universidade Federal do Paraná Geovani Nunes Grapiglia
More informationAdaptive Negative Curvature Descent with Applications in Non-convex Optimization
Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationAn example of slow convergence for Newton s method on a function with globally Lipschitz continuous Hessian
An example of slow convergence for Newton s method on a function with globally Lipschitz continuous Hessian C. Cartis, N. I. M. Gould and Ph. L. Toint 3 May 23 Abstract An example is presented where Newton
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationInfeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization
Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with James V. Burke, University of Washington Daniel
More informationWorst Case Complexity of Direct Search
Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationMethods for Unconstrained Optimization Numerical Optimization Lectures 1-2
Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationCubic regularization of Newton s method for convex problems with constraints
CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized
More informationStochastic Cubic Regularization for Fast Nonconvex Optimization
Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni Mitchell Stern Chi Jin Jeffrey Regier Michael I. Jordan {nilesh tripuraneni,mitchell,chijin,regier}@berkeley.edu jordan@cs.berkeley.edu
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationOn the complexity of an Inexact Restoration method for constrained optimization
On the complexity of an Inexact Restoration method for constrained optimization L. F. Bueno J. M. Martínez September 18, 2018 Abstract Recent papers indicate that some algorithms for constrained optimization
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationCubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization
Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization J. M. Martínez M. Raydan November 15, 2015 Abstract In a recent paper we introduced a trust-region
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationOPER 627: Nonlinear Optimization Lecture 9: Trust-region methods
OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods Department of Statistical Sciences and Operations Research Virginia Commonwealth University Sept 25, 2013 (Lecture 9) Nonlinear Optimization
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learning s Inverse Problem
More informationTrust Regions. Charles J. Geyer. March 27, 2013
Trust Regions Charles J. Geyer March 27, 2013 1 Trust Region Theory We follow Nocedal and Wright (1999, Chapter 4), using their notation. Fletcher (1987, Section 5.1) discusses the same algorithm, but
More informationParameter Optimization in the Nonlinear Stepsize Control Framework for Trust-Region July 12, 2017 Methods 1 / 39
Parameter Optimization in the Nonlinear Stepsize Control Framework for Trust-Region Methods EUROPT 2017 Federal University of Paraná - Curitiba/PR - Brazil Geovani Nunes Grapiglia Federal University of
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationSecond order machine learning
Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 88 Outline Machine Learning s Inverse Problem
More informationCS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares
CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search
More informationOptimization Tutorial 1. Basic Gradient Descent
E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.
More informationLecture 14: October 17
1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationOn Lagrange multipliers of trust region subproblems
On Lagrange multipliers of trust region subproblems Ladislav Lukšan, Ctirad Matonoha, Jan Vlček Institute of Computer Science AS CR, Prague Applied Linear Algebra April 28-30, 2008 Novi Sad, Serbia Outline
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationQuasi-Newton Methods. Javier Peña Convex Optimization /36-725
Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationarxiv: v1 [cs.lg] 6 Sep 2018
arxiv:1809.016v1 [cs.lg] 6 Sep 018 Aryan Mokhtari Asuman Ozdaglar Ali Jadbabaie Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 0139 Abstract aryanm@mit.edu
More informationOptimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23
Optimization: Nonlinear Optimization without Constraints Nonlinear Optimization without Constraints 1 / 23 Nonlinear optimization without constraints Unconstrained minimization min x f(x) where f(x) is
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationTaylor-like models in nonsmooth optimization
Taylor-like models in nonsmooth optimization Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with Ioffe (Technion), Lewis (Cornell), and Paquette (UW) SIAM Optimization 2017 AFOSR,
More informationA globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications
A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth
More informationOptimisation non convexe avec garanties de complexité via Newton+gradient conjugué
Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG
More informationKeywords: Nonlinear least-squares problems, regularized models, error bound condition, local convergence.
STRONG LOCAL CONVERGENCE PROPERTIES OF ADAPTIVE REGULARIZED METHODS FOR NONLINEAR LEAST-SQUARES S. BELLAVIA AND B. MORINI Abstract. This paper studies adaptive regularized methods for nonlinear least-squares
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationAccelerating the cubic regularization of Newton s method on convex problems
Accelerating the cubic regularization of Newton s method on convex problems Yu. Nesterov September 005 Abstract In this paper we propose an accelerated version of the cubic regularization of Newton s method
More informationWorst Case Complexity of Direct Search
Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationChapter 3 Numerical Methods
Chapter 3 Numerical Methods Part 2 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization 1 Outline 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization Summary 2 Outline 3.2
More information5 Handling Constraints
5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest
More informationContents. Preface. 1 Introduction Optimization view on mathematical models NLP models, black-box versus explicit expression 3
Contents Preface ix 1 Introduction 1 1.1 Optimization view on mathematical models 1 1.2 NLP models, black-box versus explicit expression 3 2 Mathematical modeling, cases 7 2.1 Introduction 7 2.2 Enclosing
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationAn Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization
An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with Travis Johnson, Northwestern University Daniel P. Robinson, Johns
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationCORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28
26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS CORE Voie du Roman Pays 4, L Tel
More informationGeometry optimization
Geometry optimization Trygve Helgaker Centre for Theoretical and Computational Chemistry Department of Chemistry, University of Oslo, Norway European Summer School in Quantum Chemistry (ESQC) 211 Torre
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationMethods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent
Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived
More informationEvaluation complexity for nonlinear constrained optimization using unscaled KKT conditions and high-order models by E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos and Ph. L. Toint Report NAXYS-08-2015
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationOn Lagrange multipliers of trust-region subproblems
On Lagrange multipliers of trust-region subproblems Ladislav Lukšan, Ctirad Matonoha, Jan Vlček Institute of Computer Science AS CR, Prague Programy a algoritmy numerické matematiky 14 1.- 6. června 2008
More informationZero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal
Zero-Order Methods for the Optimization of Noisy Functions Jorge Nocedal Northwestern University Simons Institute, October 2017 1 Collaborators Albert Berahas Northwestern University Richard Byrd University
More informationSub-Sampled Newton Methods I: Globally Convergent Algorithms
Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationIteration and evaluation complexity on the minimization of functions whose computation is intrinsically inexact
Iteration and evaluation complexity on the minimization of functions whose computation is intrinsically inexact E. G. Birgin N. Krejić J. M. Martínez September 5, 017 Abstract In many cases in which one
More informationComplexity of gradient descent for multiobjective optimization
Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization
More informationInterpolation-Based Trust-Region Methods for DFO
Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv
More informationGradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More information