Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem, Pennsylvania, USA - August 17, 2017 Complexity of second order line search 1
Smooth nonconvex optimization We consider an unconstrained smooth problem: min x R f (x). n Assumptions on f f bounded from below. f twice continuously dierentiable. f is not convex. Complexity of second order line search 2
Optimality conditions Second-order necessary point x satises the second-order necessary conditions if f (x ) = 0, 2 f (x ) 0. Basic paradigm If x is not a second-order necessary point, d such that 1 d f (x) < 0: gradient-type direction. and/or 2 d 2 f (x)d < 0: negative curvature direction specic to nonconvex problems. Complexity of second order line search 3
Motivation Example: Nonconvex formulation of low-rank matrix problems For common classes of problems: min f (U V ). U R n r,v R m r Second-order necessary points are global minimizers (or close). Saddle points have negative curvature. Complexity of second order line search 4
Motivation Example: Nonconvex formulation of low-rank matrix problems For common classes of problems: min f (U V ). U R n r,v R m r Second-order necessary points are global minimizers (or close). Saddle points have negative curvature. Renewed interested: Second-order necessary points of nonconvex problems. Needed: Ecient algorithms. Complexity of second order line search 4
Second-order complexity Principle For a given method, two tolerances ɛ g, ɛ H (0, 1): Obj: bound the worst-case cost of reaching x k such that f (x k ) ɛ g, λ k = λ min ( 2 f (x k )) ɛ H. Focus: Bound dependencies on ɛ g, ɛ H. Complexity of second order line search 5
Second-order complexity Principle For a given method, two tolerances ɛ g, ɛ H (0, 1): Obj: bound the worst-case cost of reaching x k such that f (x k ) ɛ g, λ k = λ min ( 2 f (x k )) ɛ H. Focus: Bound dependencies on ɛ g, ɛ H. Denition of cost? Best rates? Complexity of second order line search 5
Existing complexity results Nonconvex optimization literature Classical cost: Number of (expensive) iterations. Best methods: Newton-type frameworks. Complexity of second order line search 6
Existing complexity results Nonconvex optimization literature Classical cost: Number of (expensive) iterations. Best methods: Newton-type frameworks. Algorithms Classical trust region Cubic regularization TRACE trust region Bounds O ( max{ɛ 2 g ɛ 1 H, ɛ 3 H }) ( ) O max{ɛ 3 2 g, ɛ 3 H } Complexity of second order line search 6
Existing complexity results (2) Learning/Statistics community Specic setting ɛ g = ɛ, ɛ H = O( ɛ). Best Newton-type bound: O(ɛ 3 2 ). Gradient-based cheaper iterations. Cost measure: Hessian-vector products/gradient evaluations. Complexity of second order line search 7
Existing complexity results (2) Learning/Statistics community Specic setting ɛ g = ɛ, ɛ H = O( ɛ). Best Newton-type bound: O(ɛ 3 2 ). Gradient-based cheaper iterations. Cost measure: Hessian-vector products/gradient evaluations. Algorithms Gradient descent methods with random noise Accelerated gradient methods for nonconvex problems Bounds Õ ( ɛ 2) Õ(ɛ 7 4 ) Õ( ): logarithmic factors. Results hold with high probability. Complexity of second order line search 7
Our objective Illustrate all the possible complexities... In terms of iterations, evaluations, etc. For arbitrary ɛ g, ɛ H. Deterministic and high probability results. Complexity of second order line search 8
Our objective Illustrate all the possible complexities... In terms of iterations, evaluations, etc. For arbitrary ɛ g, ɛ H. Deterministic and high probability results....in a single framework Based on line search. Matrix-free: only require Hessian-vector products. Good complexity guarantees. Complexity of second order line search 8
Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 9
Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 10
Basic framework Parameters: x 0 R n, θ (0, 1), η > 0, ɛ g (0, 1), ɛ H (0, 1). For k=0, 1, 2,... 1 Compute a search direction d k. 2 Perform a backtracking line search to compute α k = θ j k such that f (x k + α k d k ) < f (x k ) η 6 α3 k d k 3. 3 Set x k+1 = x k + α k d k. Complexity of second order line search 11
Selecting the search direction d k Step 1: Use gradient related information Compute If R k < ɛ H, set g k = f (x k ), R k = g k 2 f (x k )g k g k 2. d k = R k g k g k. Elseif R k [ ɛ H, ɛ H ] and g k > ɛ g, set Otherwise perform Step 2. g k d k = g k. 1/2 Complexity of second order line search 12
Selecting the search direction d k (2) Step 2: Use eigenvalue information Compute an eigenpair (v k, λ k ) such that λ k = λ min ( 2 f (x k )) and 2 f (x k )v k = λ k v k, vk g k 0, v k = 1. Case λ k < ɛ H : d k = λ k v k ; Case λ k > ɛ H - Newton step: d k = dk n, 2 f (x k )dk n = g k; Case λ k [ ɛ H, ɛ H ] - regularized Newton step: d k = dk r, ( ) f 2 (x k ) + 2ɛ H dk r = g k. Complexity of second order line search 13
Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 14
Assumptions and notations Assumptions L f (x 0 ) = {x f (x) f (x 0 )} compact. f twice continuously dierentiable on a open set containing L f (x 0 ), with Lipschitz continuous Hessian. L H : Lipschitz constant for 2 f. flow: lower bound on {f (x k )}. U H : upper bound on 2 f (x k ). Complexity of second order line search 15
Criterion Approximate solution x k is an (ɛ g, ɛ H )-point if min { g k, g k+1 } ɛ g, λ k ɛ H. Complexity of second order line search 16
Criterion Approximate solution x k is an (ɛ g, ɛ H )-point if Other possibilities: min { g k, g k+1 } ɛ g, λ k ɛ H. Remove gradient directions and use g k+1 No cheap gradient steps. Add a stopping criterion and use g k. No global/local convergence. Complexity of second order line search 16
Analysis of the method Key principle Bound the decrease produced at every step while an (ɛ g, ɛ H )-point has not been reached. Complexity of second order line search 17
Analysis of the method Key principle Bound the decrease produced at every step while an (ɛ g, ɛ H )-point has not been reached. Five possible directions. Two ways of scaling g k : By its (negative) curvature; By its norm; Negative eigenvector; Newton step; Regularized Newton step. Complexity of second order line search 17
Analysis of the method Key principle Bound the decrease produced at every step while an (ɛ g, ɛ H )-point has not been reached. Five possible directions. Two ways of scaling g k : By its (negative) curvature; By its norm; Negative eigenvector; Newton step; Regularized Newton step. One proof technique, typical of backtracking line search If unit step is accepted, guaranteed decrease; Otherwise, lower bound on accepted step size. Complexity of second order line search 17
Example: When d k = g k / g k 1/2 In that case: g k 2 f (x k )g k g k 2 [ ɛ H, ɛ H ], g k > ɛ g. Unit step accepted: f (x k ) f (x k+1 ) η 6 d k 3 η 6 ɛ 3 2 g. Unit step rejected: By Taylor expansion, there exists a step α k = θ j k that is accepted such that { } θ j 1 5 k θ min 3, 1 2 ɛg ɛ 1 L H + η H. So the line search terminates and f (x k ) f (x k+1 ) η 6 α3 k d k 3 O ( ) ɛ 3 gɛ 3 H. Complexity of second order line search 18
Example: When d k = g k / g k 1/2 In that case: g k 2 f (x k )g k g k 2 [ ɛ H, ɛ H ], g k > ɛ g. Unit step accepted: f (x k ) f (x k+1 ) η 6 d k 3 η 6 ɛ 3 2 g. Unit step rejected: By Taylor expansion, there exists a step α k = θ j k that is accepted such that { } θ j 1 5 k θ min 3, 1 2 ɛg ɛ 1 L H + η H. So the line search terminates and Final decrease: f (x k ) f (x k+1 ) η 6 α3 k d k 3 f (x k ) f (x k+1 ) c g min O ( ) ɛ 3 gɛ 3 H. { ɛ 3 gɛ 3 3 H, ɛ 2 g }. Complexity of second order line search 18
Decrease bound General decrease lemma If at the k-th iteration, an (ɛ g, ɛ H )-point has not been reached, then { } 3 2 f (x k ) f (x k+1 ) c min ɛg, ɛ 3 H, ɛ3 gɛ 3 H, ϕ(ɛ g, ɛ H ) 3, where ϕ(ɛ g, ɛ H ) = L 1 H ɛ H ( 2 + ) 4 + 2L H ɛ g /ɛ 2 H. c depends on L H, η, θ. Complexity of second order line search 19
Iteration complexity Iteration complexity bound The method reaches an (ɛ g, ɛ H )-point in at most iterations. Specic rates: f 0 flow c max { ɛ g = ɛ, ɛ H = ɛ: O(ɛ 3 2 ). ɛ g = ɛ H = ɛ: O(ɛ 3 ). } ɛ 3 2 g, ɛ 3 H, ɛ 3 g ɛ 3 H, ϕ(ɛ g, ɛ H ) 3 Optimal bounds for Newton-type methods. Complexity of second order line search 20
Function evaluation complexity #Iterations = #Gradient/#Hessian evaluations. #Iterations #Function evaluations. Complexity of second order line search 21
Function evaluation complexity #Iterations = #Gradient/#Hessian evaluations. #Iterations #Function evaluations. Line-search iterations If x k is not a (ɛ g, ɛ H )-point, the line search takes at most ( )) 1 2 O (log θ min{ɛg ɛ 1 H, ɛ2 H } iterations. Evaluation complexity bound The method reaches an (ɛ g, ɛ H )-point in at most function evaluations. ( { }) Õ max ɛ 3 2 g, ɛ 3 H, ɛ 3 g ɛ 3 H, ϕ(ɛ g, ɛ H ) 3 Complexity of second order line search 21
Outline 1 Our algorithm 2 Complexity analysis 3 Inexact variants Complexity of second order line search 22
Motivation Algorithmic cost The method should be matrix-free. We use matrix-related operations: Linear system solve; Eigenvalue/Eigenvector computation. Inexactness Perform the matrix operations inexactly. Main cost unit: matrix-vector product/gradient evaluation. Complexity of second order line search 23
Conjugate gradient for linear systems We solve systems of the form Hd = g, with H ɛ H I. Complexity of second order line search 24
Conjugate gradient for linear systems We solve systems of the form Hd = g, with H ɛ H I. Conjugate Gradient (CG) We apply the conjugate gradient algorithm with stopping criterion: Hd + g ξ 2 min { g, ɛ H d }, ξ (0, 1). If κ = λ max (H)/λ min (H), the CG method will nd such a vector in at most { ( )} min n, 1 2 κ log 4κ 5 2 /ξ = min { n, O ( κ log(κ/ξ) )} matrix-vector products. Complexity of second order line search 24
Lanczos for eigenvalue computation Lanczos method to compute a minimum eigenvector. Can fail if deterministic Random start. Results for matrices A 0 Change the Hessian. Complexity of second order line search 25
Lanczos for eigenvalue computation Lanczos method to compute a minimum eigenvector. Can fail if deterministic Random start. Results for matrices A 0 Change the Hessian. Lanczos iterations Let H R n n symmetric with H U H, ɛ > 0, δ (0, 1). With probability at least 1 δ, the Lanczos procedure applied to U H I H outputs a vector v such that v Hv λ min (H) + ɛ. in at most min { n, ln(n/δ2 ) 2 2 U H ɛ } iterations/matrix-vector products. Complexity of second order line search 25
Selecting the search direction d k - Inexact version Step 1: Use gradient related information Compute If R k < ɛ H, set g k = f (x k ), R k = g k 2 f (x k )g k g k 2. d k = R k g k g k. Elseif R k [ ɛ H, ɛ H ] and g k ɛ g, set d k = Otherwise perform the Inexact Step 2. g k g k 1 2. Complexity of second order line search 26
Selecting the direction d k - Inexact version (2) Inexact Step 2: Use (inexact) eigenvalue information Compute an eigenpair (v k i, λi k ) such that with probability 1 δ, λ i k = [v i k ] 2 f (x k )v i k λ k + ɛ H 2, [v i k ] g k 0, v i k = 1. Case λ i k < 1 2 ɛ H: d k = v i k ; Case λ i k > 3 2 ɛ H: - Inexact Newton: Use CG to obtain d k = d in k, 2 f (x k )d in k + g k ξ min { g 2 k, ɛ H dk in } ; Case λ i k [ 1 2 ɛ H, 3 2 ɛ H]: - Inexact regularized Newton: Use CG to obtain d k = d ir k, [ 2 f (x k ) + 2ɛ H ] d ir k + g k ξ min { g 2 k, ɛ H dk ir }. Complexity of second order line search 27
Complexity analysis of the inexact method Identical reasoning: 5 steps, 1 proof. Using Lanczos with a random start, the negative curvature decrease only holds with probability 1 δ. With CG, the inexact Newton and regularized Newton give slightly dierent formulas. Decrease lemma For any iteration k, if x k is not an (ɛ g, ɛ H )-point, f (x k ) f (x k+1 ) ĉ min { 3 ɛ 3 gɛ 3 H, ɛ 2 g, ɛ 3 H, ϕ (ɛ g, ξ 2 ɛ H with probability at least 1 δ, and ĉ only depends on L H, η, θ. ) 3 ( ) }, ϕ ɛ g, 4+ξ ɛ 3 2 H, Complexity of second order line search 28
Complexity results Iteration complexity An (ɛ g, ɛ H )-point is reached in at most ˆK := f 0 flow ĉ max { ɛ 3 3 g ɛ 3 H, ɛ 2 g ), ɛ 3 H (ɛ, ϕ g, ξ ɛ 3 ( ) } 2 H, ϕ ɛ g, 4+ξ ɛ 3 2 H, iterations, with probability at least 1 ˆKδ. Cost complexity The number of Hessian-vector products or gradient evaluations needed to reach an (ɛ g, ɛ H )-point is at most min { ( ) ( )} n, O U 1/2 1 H ɛ 2 H log(ɛ 1 H /ξ), O U 1/2 1 H ɛ 2 H log(n/δ2 ) ˆK, with probability at least 1 ˆKδ. Complexity of second order line search 29
Complexity results (simplied) Setting: ɛ g = ɛ, ɛ H = ɛ. An (ɛ, ɛ)-point is reached in at most O(ɛ 3 2 ) iterations, ( ) Õ ɛ 7 4 Hessian-vector products/gradient evaluations, with probability 1 O(ɛ 3 2 δ). Complexity of second order line search 30
Complexity results (simplied) Setting: ɛ g = ɛ, ɛ H = ɛ. An (ɛ, ɛ)-point is reached in at most O(ɛ 3 2 ) iterations, ( ) Õ ɛ 7 4 Hessian-vector products/gradient evaluations, with probability 1 O(ɛ 3 2 δ). Setting δ = 0 gives results in probability 1: Iterations: O(ɛ 3 2 ). Hessian-vector/gradients: O ( ) nɛ 3 2. Complexity of second order line search 30
Summary Our proposal A class of second-order line-search methods. Best known complexity guarantees. Features gradient steps and inexactness. Can be implemented matrix-free. For more details... Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization, C. W. Royer and S. J. Wright, arxiv:1706.03131. Also contains local convergence results. Complexity of second order line search 31
Follow-up Perspectives Numerical testing of our class of methods. Extension to constrained problems. Complexity of second order line search 32
Follow-up Perspectives Numerical testing of our class of methods. Extension to constrained problems. Thank you for your attention! croyer2@wisc.edu Complexity of second order line search 32