Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Similar documents
Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Higher-Order Methods

Mesures de criticalité d'ordres 1 et 2 en recherche directe

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

A Subsampling Line-Search Method with Second-Order Results

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

On Nesterov s Random Coordinate Descent Algorithms - Continued

An introduction to complexity analysis for nonconvex optimization

Unconstrained optimization

Conditional Gradient (Frank-Wolfe) Method

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Nonlinear Optimization: What s important?

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Numerical Methods for PDE-Constrained Optimization

Unconstrained minimization of smooth functions

A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization

Stochastic Optimization Algorithms Beyond SG

arxiv: v1 [math.oc] 16 Oct 2018

Line Search Methods for Unconstrained Optimisation

Lecture 5: September 12

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

Programming, numerics and optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

8 Numerical methods for unconstrained problems

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Optimization Tutorial 1. Basic Gradient Descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Trust Regions. Charles J. Geyer. March 27, 2013

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Convex Optimization. Problem set 2. Due Monday April 26th

Sub-Sampled Newton Methods

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Introduction to gradient descent

A Second-Order Method for Strongly Convex l 1 -Regularization Problems

An Inexact Newton Method for Optimization

Numerical Optimization

Inexact Newton Methods and Nonlinear Constrained Optimization

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Second Order Optimization Algorithms I

Stochastic Analogues to Deterministic Optimizers

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Introduction to Nonlinear Optimization Paul J. Atzberger

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Gradient Descent. Dr. Xiaowei Huang

Lecture 14: October 17

Constrained Optimization Theory

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Nonlinear Optimization for Optimal Control

Newton s Method. Javier Peña Convex Optimization /36-725

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

The Steepest Descent Algorithm for Unconstrained Optimization

Optimization Methods. Lecture 19: Line Searches and Newton s Method

ECS171: Machine Learning

Lecture 1: Supervised Learning

A Line search Multigrid Method for Large-Scale Nonlinear Optimization

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

This manuscript is for review purposes only.

Optimal Newton-type methods for nonconvex smooth optimization problems

Trajectory-based optimization

Second order machine learning

10. Unconstrained minimization

Math 164: Optimization Barzilai-Borwein Method

Towards stability and optimality in stochastic gradient descent

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Nonlinear Optimization Methods for Machine Learning

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Non-convex optimization. Issam Laradji

Accelerating Nesterov s Method for Strongly Convex Functions

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Numerical optimization

An Inexact Newton Method for Nonlinear Constrained Optimization

5 Handling Constraints

Mini-Course 1: SGD Escapes Saddle Points

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Nonlinear Programming

The Randomized Newton Method for Convex Optimization

Newton-MR: Newton s Method Without Smoothness or Convexity

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Convex Optimization Algorithms for Machine Learning in 10 Slides

ORIE 6326: Convex Optimization. Quasi-Newton Methods

CPSC 540: Machine Learning

On the complexity of an Inexact Restoration method for constrained optimization


arxiv: v2 [math.oc] 1 Nov 2017

On Lagrange multipliers of trust-region subproblems

arxiv: v1 [math.oc] 9 Oct 2018

Complexity of gradient descent for multiobjective optimization

Geometry optimization

Transcription:

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem, Pennsylvania, USA - August 17, 2017 Complexity of second order line search 1

Smooth nonconvex optimization We consider an unconstrained smooth problem: min x R f (x). n Assumptions on f f bounded from below. f twice continuously dierentiable. f is not convex. Complexity of second order line search 2

Optimality conditions Second-order necessary point x satises the second-order necessary conditions if f (x ) = 0, 2 f (x ) 0. Basic paradigm If x is not a second-order necessary point, d such that 1 d f (x) < 0: gradient-type direction. and/or 2 d 2 f (x)d < 0: negative curvature direction specic to nonconvex problems. Complexity of second order line search 3

Motivation Example: Nonconvex formulation of low-rank matrix problems For common classes of problems: min f (U V ). U R n r,v R m r Second-order necessary points are global minimizers (or close). Saddle points have negative curvature. Complexity of second order line search 4

Motivation Example: Nonconvex formulation of low-rank matrix problems For common classes of problems: min f (U V ). U R n r,v R m r Second-order necessary points are global minimizers (or close). Saddle points have negative curvature. Renewed interested: Second-order necessary points of nonconvex problems. Needed: Ecient algorithms. Complexity of second order line search 4

Second-order complexity Principle For a given method, two tolerances ɛ g, ɛ H (0, 1): Obj: bound the worst-case cost of reaching x k such that f (x k ) ɛ g, λ k = λ min ( 2 f (x k )) ɛ H. Focus: Bound dependencies on ɛ g, ɛ H. Complexity of second order line search 5

Second-order complexity Principle For a given method, two tolerances ɛ g, ɛ H (0, 1): Obj: bound the worst-case cost of reaching x k such that f (x k ) ɛ g, λ k = λ min ( 2 f (x k )) ɛ H. Focus: Bound dependencies on ɛ g, ɛ H. Denition of cost? Best rates? Complexity of second order line search 5

Existing complexity results Nonconvex optimization literature Classical cost: Number of (expensive) iterations. Best methods: Newton-type frameworks. Complexity of second order line search 6

Existing complexity results Nonconvex optimization literature Classical cost: Number of (expensive) iterations. Best methods: Newton-type frameworks. Algorithms Classical trust region Cubic regularization TRACE trust region Bounds O ( max{ɛ 2 g ɛ 1 H, ɛ 3 H }) ( ) O max{ɛ 3 2 g, ɛ 3 H } Complexity of second order line search 6

Existing complexity results (2) Learning/Statistics community Specic setting ɛ g = ɛ, ɛ H = O( ɛ). Best Newton-type bound: O(ɛ 3 2 ). Gradient-based cheaper iterations. Cost measure: Hessian-vector products/gradient evaluations. Complexity of second order line search 7

Existing complexity results (2) Learning/Statistics community Specic setting ɛ g = ɛ, ɛ H = O( ɛ). Best Newton-type bound: O(ɛ 3 2 ). Gradient-based cheaper iterations. Cost measure: Hessian-vector products/gradient evaluations. Algorithms Gradient descent methods with random noise Accelerated gradient methods for nonconvex problems Bounds Õ ( ɛ 2) Õ(ɛ 7 4 ) Õ( ): logarithmic factors. Results hold with high probability. Complexity of second order line search 7

Our objective Illustrate all the possible complexities... In terms of iterations, evaluations, etc. For arbitrary ɛ g, ɛ H. Deterministic and high probability results. Complexity of second order line search 8

Our objective Illustrate all the possible complexities... In terms of iterations, evaluations, etc. For arbitrary ɛ g, ɛ H. Deterministic and high probability results....in a single framework Based on line search. Matrix-free: only require Hessian-vector products. Good complexity guarantees. Complexity of second order line search 8

Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 9

Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 10

Basic framework Parameters: x 0 R n, θ (0, 1), η > 0, ɛ g (0, 1), ɛ H (0, 1). For k=0, 1, 2,... 1 Compute a search direction d k. 2 Perform a backtracking line search to compute α k = θ j k such that f (x k + α k d k ) < f (x k ) η 6 α3 k d k 3. 3 Set x k+1 = x k + α k d k. Complexity of second order line search 11

Selecting the search direction d k Step 1: Use gradient related information Compute If R k < ɛ H, set g k = f (x k ), R k = g k 2 f (x k )g k g k 2. d k = R k g k g k. Elseif R k [ ɛ H, ɛ H ] and g k > ɛ g, set Otherwise perform Step 2. g k d k = g k. 1/2 Complexity of second order line search 12

Selecting the search direction d k (2) Step 2: Use eigenvalue information Compute an eigenpair (v k, λ k ) such that λ k = λ min ( 2 f (x k )) and 2 f (x k )v k = λ k v k, vk g k 0, v k = 1. Case λ k < ɛ H : d k = λ k v k ; Case λ k > ɛ H - Newton step: d k = dk n, 2 f (x k )dk n = g k; Case λ k [ ɛ H, ɛ H ] - regularized Newton step: d k = dk r, ( ) f 2 (x k ) + 2ɛ H dk r = g k. Complexity of second order line search 13

Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 14

Assumptions and notations Assumptions L f (x 0 ) = {x f (x) f (x 0 )} compact. f twice continuously dierentiable on a open set containing L f (x 0 ), with Lipschitz continuous Hessian. L H : Lipschitz constant for 2 f. flow: lower bound on {f (x k )}. U H : upper bound on 2 f (x k ). Complexity of second order line search 15

Criterion Approximate solution x k is an (ɛ g, ɛ H )-point if min { g k, g k+1 } ɛ g, λ k ɛ H. Complexity of second order line search 16

Criterion Approximate solution x k is an (ɛ g, ɛ H )-point if Other possibilities: min { g k, g k+1 } ɛ g, λ k ɛ H. Remove gradient directions and use g k+1 No cheap gradient steps. Add a stopping criterion and use g k. No global/local convergence. Complexity of second order line search 16

Analysis of the method Key principle Bound the decrease produced at every step while an (ɛ g, ɛ H )-point has not been reached. Complexity of second order line search 17

Analysis of the method Key principle Bound the decrease produced at every step while an (ɛ g, ɛ H )-point has not been reached. Five possible directions. Two ways of scaling g k : By its (negative) curvature; By its norm; Negative eigenvector; Newton step; Regularized Newton step. Complexity of second order line search 17

Analysis of the method Key principle Bound the decrease produced at every step while an (ɛ g, ɛ H )-point has not been reached. Five possible directions. Two ways of scaling g k : By its (negative) curvature; By its norm; Negative eigenvector; Newton step; Regularized Newton step. One proof technique, typical of backtracking line search If unit step is accepted, guaranteed decrease; Otherwise, lower bound on accepted step size. Complexity of second order line search 17

Example: When d k = g k / g k 1/2 In that case: g k 2 f (x k )g k g k 2 [ ɛ H, ɛ H ], g k > ɛ g. Unit step accepted: f (x k ) f (x k+1 ) η 6 d k 3 η 6 ɛ 3 2 g. Unit step rejected: By Taylor expansion, there exists a step α k = θ j k that is accepted such that { } θ j 1 5 k θ min 3, 1 2 ɛg ɛ 1 L H + η H. So the line search terminates and f (x k ) f (x k+1 ) η 6 α3 k d k 3 O ( ) ɛ 3 gɛ 3 H. Complexity of second order line search 18

Example: When d k = g k / g k 1/2 In that case: g k 2 f (x k )g k g k 2 [ ɛ H, ɛ H ], g k > ɛ g. Unit step accepted: f (x k ) f (x k+1 ) η 6 d k 3 η 6 ɛ 3 2 g. Unit step rejected: By Taylor expansion, there exists a step α k = θ j k that is accepted such that { } θ j 1 5 k θ min 3, 1 2 ɛg ɛ 1 L H + η H. So the line search terminates and Final decrease: f (x k ) f (x k+1 ) η 6 α3 k d k 3 f (x k ) f (x k+1 ) c g min O ( ) ɛ 3 gɛ 3 H. { ɛ 3 gɛ 3 3 H, ɛ 2 g }. Complexity of second order line search 18

Decrease bound General decrease lemma If at the k-th iteration, an (ɛ g, ɛ H )-point has not been reached, then { } 3 2 f (x k ) f (x k+1 ) c min ɛg, ɛ 3 H, ɛ3 gɛ 3 H, ϕ(ɛ g, ɛ H ) 3, where ϕ(ɛ g, ɛ H ) = L 1 H ɛ H ( 2 + ) 4 + 2L H ɛ g /ɛ 2 H. c depends on L H, η, θ. Complexity of second order line search 19

Iteration complexity Iteration complexity bound The method reaches an (ɛ g, ɛ H )-point in at most iterations. Specic rates: f 0 flow c max { ɛ g = ɛ, ɛ H = ɛ: O(ɛ 3 2 ). ɛ g = ɛ H = ɛ: O(ɛ 3 ). } ɛ 3 2 g, ɛ 3 H, ɛ 3 g ɛ 3 H, ϕ(ɛ g, ɛ H ) 3 Optimal bounds for Newton-type methods. Complexity of second order line search 20

Function evaluation complexity #Iterations = #Gradient/#Hessian evaluations. #Iterations #Function evaluations. Complexity of second order line search 21

Function evaluation complexity #Iterations = #Gradient/#Hessian evaluations. #Iterations #Function evaluations. Line-search iterations If x k is not a (ɛ g, ɛ H )-point, the line search takes at most ( )) 1 2 O (log θ min{ɛg ɛ 1 H, ɛ2 H } iterations. Evaluation complexity bound The method reaches an (ɛ g, ɛ H )-point in at most function evaluations. ( { }) Õ max ɛ 3 2 g, ɛ 3 H, ɛ 3 g ɛ 3 H, ϕ(ɛ g, ɛ H ) 3 Complexity of second order line search 21

Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 22

Motivation Algorithmic cost The method should be matrix-free. We use matrix-related operations: Linear system solve; Eigenvalue/Eigenvector computation. Inexactness Perform the matrix operations inexactly. Main cost unit: matrix-vector product/gradient evaluation. Complexity of second order line search 23

Conjugate gradient for linear systems We solve systems of the form Hd = g, with H ɛ H I. Complexity of second order line search 24

Conjugate gradient for linear systems We solve systems of the form Hd = g, with H ɛ H I. Conjugate Gradient (CG) We apply the conjugate gradient algorithm with stopping criterion: Hd + g ξ 2 min { g, ɛ H d }, ξ (0, 1). If κ = λ max (H)/λ min (H), the CG method will nd such a vector in at most { ( )} min n, 1 2 κ log 4κ 5 2 /ξ = min { n, O ( κ log(κ/ξ) )} matrix-vector products. Complexity of second order line search 24

Lanczos for eigenvalue computation Lanczos method to compute a minimum eigenvector. Can fail if deterministic Random start. Results for matrices A 0 Change the Hessian. Complexity of second order line search 25

Lanczos for eigenvalue computation Lanczos method to compute a minimum eigenvector. Can fail if deterministic Random start. Results for matrices A 0 Change the Hessian. Lanczos iterations Let H R n n symmetric with H U H, ɛ > 0, δ (0, 1). With probability at least 1 δ, the Lanczos procedure applied to U H I H outputs a vector v such that v Hv λ min (H) + ɛ. in at most min { n, ln(n/δ2 ) 2 2 U H ɛ } iterations/matrix-vector products. Complexity of second order line search 25

Selecting the search direction d k - Inexact version Step 1: Use gradient related information Compute If R k < ɛ H, set g k = f (x k ), R k = g k 2 f (x k )g k g k 2. d k = R k g k g k. Elseif R k [ ɛ H, ɛ H ] and g k ɛ g, set d k = Otherwise perform the Inexact Step 2. g k g k 1 2. Complexity of second order line search 26

Selecting the direction d k - Inexact version (2) Inexact Step 2: Use (inexact) eigenvalue information Compute an eigenpair (v k i, λi k ) such that with probability 1 δ, λ i k = [v i k ] 2 f (x k )v i k λ k + ɛ H 2, [v i k ] g k 0, v i k = 1. Case λ i k < 1 2 ɛ H: d k = v i k ; Case λ i k > 3 2 ɛ H: - Inexact Newton: Use CG to obtain d k = d in k, 2 f (x k )d in k + g k ξ min { g 2 k, ɛ H dk in } ; Case λ i k [ 1 2 ɛ H, 3 2 ɛ H]: - Inexact regularized Newton: Use CG to obtain d k = d ir k, [ 2 f (x k ) + 2ɛ H ] d ir k + g k ξ min { g 2 k, ɛ H dk ir }. Complexity of second order line search 27

Complexity analysis of the inexact method Identical reasoning: 5 steps, 1 proof. Using Lanczos with a random start, the negative curvature decrease only holds with probability 1 δ. With CG, the inexact Newton and regularized Newton give slightly dierent formulas. Decrease lemma For any iteration k, if x k is not an (ɛ g, ɛ H )-point, f (x k ) f (x k+1 ) ĉ min { 3 ɛ 3 gɛ 3 H, ɛ 2 g, ɛ 3 H, ϕ (ɛ g, ξ 2 ɛ H with probability at least 1 δ, and ĉ only depends on L H, η, θ. ) 3 ( ) }, ϕ ɛ g, 4+ξ ɛ 3 2 H, Complexity of second order line search 28

Complexity results Iteration complexity An (ɛ g, ɛ H )-point is reached in at most ˆK := f 0 flow ĉ max { ɛ 3 3 g ɛ 3 H, ɛ 2 g ), ɛ 3 H (ɛ, ϕ g, ξ ɛ 3 ( ) } 2 H, ϕ ɛ g, 4+ξ ɛ 3 2 H, iterations, with probability at least 1 ˆKδ. Cost complexity The number of Hessian-vector products or gradient evaluations needed to reach an (ɛ g, ɛ H )-point is at most min { ( ) ( )} n, O U 1/2 1 H ɛ 2 H log(ɛ 1 H /ξ), O U 1/2 1 H ɛ 2 H log(n/δ2 ) ˆK, with probability at least 1 ˆKδ. Complexity of second order line search 29

Complexity results (simplied) Setting: ɛ g = ɛ, ɛ H = ɛ. An (ɛ, ɛ)-point is reached in at most O(ɛ 3 2 ) iterations, ( ) Õ ɛ 7 4 Hessian-vector products/gradient evaluations, with probability 1 O(ɛ 3 2 δ). Complexity of second order line search 30

Complexity results (simplied) Setting: ɛ g = ɛ, ɛ H = ɛ. An (ɛ, ɛ)-point is reached in at most O(ɛ 3 2 ) iterations, ( ) Õ ɛ 7 4 Hessian-vector products/gradient evaluations, with probability 1 O(ɛ 3 2 δ). Setting δ = 0 gives results in probability 1: Iterations: O(ɛ 3 2 ). Hessian-vector/gradients: O ( ) nɛ 3 2. Complexity of second order line search 30

Summary Our proposal A class of second-order line-search methods. Best known complexity guarantees. Features gradient steps and inexactness. Can be implemented matrix-free. For more details... Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization, C. W. Royer and S. J. Wright, arxiv:1706.03131. Also contains local convergence results. Complexity of second order line search 31

Follow-up Perspectives Numerical testing of our class of methods. Extension to constrained problems. Complexity of second order line search 32

Follow-up Perspectives Numerical testing of our class of methods. Extension to constrained problems. Thank you for your attention! croyer2@wisc.edu Complexity of second order line search 32