Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Size: px

Start display at page:

Download "Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué"

Kathryn Mason
5 years ago
Views:

1 Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG 1

2 Where are you at? Nonconvex optimization via Newton-CG 2

3 Where are you at? Wisconsin Institute for Discovery (WID) Part of UW-Madison; Multi-disciplinary institute created in 2010; Currently organized around hubs. Me and WID Optimization theme, group of Stephen Wright; Affiliated with the Data Science hub and institute. Nonconvex optimization via Newton-CG 2

4 Where are you at? IFDS: Institute for Foundations of Data Science Hosted at WID, funded by NSF (13 centers US-wide, 3-4 larger institutes selected in 2020); Lead by Stephen Wright; Gathering Math, Stat and CS expertise in Data Science; Nonconvex optimization via Newton-CG 3

5 Context Nonconvex optimization Many data science problems are convex: linear classification, logistic regression,... Yet there is a shift of focus from convex to nonconvex: Because of deep learning; But also in many other problems: matrix/tensor completion, robust statistics,etc. Nonconvex optimization via Newton-CG 4

6 Context Nonconvex optimization Many data science problems are convex: linear classification, logistic regression,... Yet there is a shift of focus from convex to nonconvex: Because of deep learning; But also in many other problems: matrix/tensor completion, robust statistics,etc. Example: Nonconvex formulation of low-rank matrix completion min P Ω(X M) 2 X R n m F,,rank(X )=r Factored reformulation (Burer and Monteiro, 2003) PΩ min (U V M) U R n r,v R m r 2 F M Rn m, Ω [n] [m]., nonconvex in U and V! Nonconvex optimization via Newton-CG 4

7 Nonconvex smooth optimization We consider the smooth unconstrained problem: Assumptions on f min f (x), x R n f C 2 (R n ), bounded below, nonconvex. Nonconvex optimization via Newton-CG 5

8 Nonconvex smooth optimization We consider the smooth unconstrained problem: Assumptions on f min f (x), x R n f C 2 (R n ), bounded below, nonconvex. Definitions in smooth nonconvex minimization First-order stationary point: f (x) = 0; Second-order stationary point: f (x) = 0, 2 f (x) 0. If x does not satisfy these conditions, d such that 1 d f (x) < 0: gradient-related direction. and/or 2 d 2 f (x)d < 0: negative curvature direction specific to nonconvex problems. Nonconvex optimization via Newton-CG 5

9 Examples Nonconvex formulations for low-rank matrix problems (Bhojanapalli et al. 2016, Ge et al. 2017) min f (U V ). U R n r,v R m r Points that satisfy second-order necessary conditions are global minima (or are close in function value); Strict saddle property: any first-order stationary point that is not a local minimum possesses negative curvature. Our goal: develop efficient algorithms to obtain second-order necessary points; We measure efficiency based on complexity. Nonconvex optimization via Newton-CG 6

10 Worst-case complexity analysis Complexity bounds Bound the cost of an algorithm in the worst case; Ubiquitous in theoretical computer science; Major impact in convex optimization (Nemirovski and Yudin 1983). The accelerated gradient case Consider f = min x R n f (x) where f is convex and let ɛ (0, 1). Then, to find x such that f ( x) f ɛ: Gradient descent needs at most O(ɛ 1 ) iterations; Accelerated methods (Nesterov s method, Heavy ball,etc) require at most O(ɛ 1/2 ) iterations. Nonconvex optimization via Newton-CG 7

11 Accelerated methods: illustration Several methods applied to min x R 100 x Ax. Nonconvex optimization via Newton-CG 8

12 Complexity in nonconvex optimization For an algorithm applied to min x R n f (x): Definition (first order) Let {x k } be a sequence of iterates generated by the algorithm and ɛ (0, 1). Worst-case cost to obtain an ɛ-point x K such that f (x K ) ɛ. Focus: Dependency on ɛ. Definition (second order) Let {x k } be the iterate sequence generated by the algorithm, two tolerances ɛ g, ɛ H (0, 1): Worst-case cost to obtain an (ɛ g, ɛ H )-point x K such that f (x K ) ɛ g, λ min ( 2 f (x K )) ɛ H. Focus: Dependencies on ɛ g, ɛ H. Nonconvex optimization via Newton-CG 9

13 Complexity results From nonconvex optimization (2006-) Cost measure: Number of iterations (but those may be expensive); Two types of guarantees: 1 f (x) ɛ g ; 2 f (x) ɛ g and 2 f (x) ɛ H I. Best methods: Second-order methods, deterministic variations on Newton s iteration involving Hessians. Nonconvex optimization via Newton-CG 10

14 Complexity results From nonconvex optimization (2006-) Cost measure: Number of iterations (but those may be expensive); Two types of guarantees: 1 f (x) ɛ g ; 2 f (x) ɛ g and 2 f (x) ɛ H I. Best methods: Second-order methods, deterministic variations on Newton s iteration involving Hessians. Trust Region 1 O ( ɛ 2 ) g 2 O ( max{ɛ 2 g ɛ 1 H, ɛ 3 H }) Gradient Descent 1 O ( ɛ 2 ) g + Negative Curvature 2 O ( max{ɛg 2, ɛ ( 3 H }) Cubic Regularization 1 O ɛ 3 ) 2 g ( 2 O max{ɛ 3 ) 2 g ɛ 1 H, ɛ 3 H } Nonconvex optimization via Newton-CG 10

15 Complexity results (2) Influenced by convex optimization (e.g. learning) Cost measure: gradient evaluations+hessian-vector products main iteration cost. Two types of guarantees: 1 f (x) ɛ; 2 f (x) ɛ and 2 f (x) ɛ 1/2 I. Best methods: developed from accelerated gradient, assume knowledge of Lipschitz constants. Nonconvex optimization via Newton-CG 11

16 Complexity results (2) Influenced by convex optimization (e.g. learning) Cost measure: gradient evaluations+hessian-vector products main iteration cost. Two types of guarantees: 1 f (x) ɛ; 2 f (x) ɛ and 2 f (x) ɛ 1/2 I. Best methods: developed from accelerated gradient, assume knowledge of Lipschitz constants. Gradient descent + random perturbation Accelerated gradient descent + random perturbation Accelerated gradient descent with nonconvexity detection 1, 2 Õ ( ɛ 2) (High probability) 1, 2 Õ(ɛ 7 4 ) (High probability) 1 Õ(ɛ 7 4 ) (Deterministic) Nonconvex optimization via Newton-CG 11

17 In this talk Cover the range of complexity results... Iterations, evaluations, computation; Different choices for ɛ g, ɛ H ; Deterministic, high probability. Nonconvex optimization via Newton-CG 12

18 In this talk Cover the range of complexity results... Iterations, evaluations, computation; Different choices for ɛ g, ɛ H ; Deterministic, high probability....through a single framework... Newton-type iterations, with line search; Main cost: gradient/hessian-vector product; Nonconvex optimization via Newton-CG 12

19 In this talk Cover the range of complexity results... Iterations, evaluations, computation; Different choices for ɛ g, ɛ H ; Deterministic, high probability....through a single framework... Newton-type iterations, with line search; Main cost: gradient/hessian-vector product;...with best complexity guarantees Revisit the Conjugate Gradient algorithm; Exploit its relationship with accelerated gradient methods. Nonconvex optimization via Newton-CG 12

20 Outline 1 Newton-type methods with negative curvature General framework Inexact variants 2 Newton-Capped Conjugate Gradient Conjugate gradient and nonconvex quadratics Newton-Capped CG algorithms 3 Numerical results Nonconvex optimization via Newton-CG 13

21 Outline 1 Newton-type methods with negative curvature General framework Inexact variants 2 Newton-Capped Conjugate Gradient 3 Numerical results Nonconvex optimization via Newton-CG 14

22 Line-search framework Inputs: x 0 R n, θ (0, 1), η > 0, ɛ g (0, 1), ɛ H (0, 1). For k=0, 1, 2,... 1 Compute a direction d k = d k (ɛ g, ɛ H ). 2 Backtracking line search Compute the largest α k {θ j } j N such that f (x k + α k d k ) < f (x k ) η 6 α3 k d k 3. 3 Set x k+1 = x k + α k d k. Nonconvex optimization via Newton-CG 15

23 Line-search framework Inputs: x 0 R n, θ (0, 1), η > 0, ɛ g (0, 1), ɛ H (0, 1). For k=0, 1, 2,... 1 Compute a direction d k = d k (ɛ g, ɛ H ). 2 Backtracking line search Compute the largest α k {θ j } j N such that f (x k + α k d k ) < f (x k ) η 6 α3 k d k 3. 3 Set x k+1 = x k + α k d k. About the line search Guarantee of cubic decrease; Simplest one giving complexity guarantees. Nonconvex optimization via Newton-CG 15

24 Newton s iteration Basics Iteration k: Compute d k by solving the linear system 2 f (x k )d k = f (x k ); and set x k+1 = x k + d k ; Unique direction when 2 f (x k ) 0; Can guarantee global convergence with (e.g.) line search. For nonconvex problems Use a threshold ɛ H for λ min ( 2 f (x k ) ) ; Regularize to ensure 2 f (x k ) + αi ɛ H I ; Second order: Leverage negative curvature directions d such that d 2 f (x k )d ɛ H d 2. Nonconvex optimization via Newton-CG 16

25 Second-order Newton method with line search Inputs: x 0 R n, θ (0, 1), η > 0, ɛ H (0, 1). Fork = 0, 1, 2,... 1 Computation of a search direction d k Compute λ = λ min ( 2 f (x k )); If λ < ɛ H, choose d k as a negative curvature direction such that d k f (x k ) 0, d k 2 f (x k )d k = λ d k 2 ; Otherwise, choose d k as a Newton direction (possibly regularized) by solving ( 2 f (x k ) + 2ɛ H I ) d k = f (x k ). 2 Backtracking line search (unchanged) 3 Set x k+1 = x k + α k d k. Nonconvex optimization via Newton-CG 17

26 Complexity of the second-order Newton method Theorem (Royer and Wright 2018) The method returns x k such that f (x k ) ɛ g and 2 f (x k ) ɛ H I in at most O(max{ɛ 3 g ɛ 3 H, ɛ 3 H }) iterations. Nonconvex optimization via Newton-CG 18

27 Complexity of the second-order Newton method Theorem (Royer and Wright 2018) The method returns x k such that f (x k ) ɛ g and 2 f (x k ) ɛ H I in at most O(max{ɛ 3 g ɛ 3 H, ɛ 3 H }) iterations. ɛ H = ɛg 1/2 : bound in O(max{ɛ 3/2 g, ɛ 3 H })O(ɛ 3/2 g ); Optimal over a class of second-order methods (Cartis, Gould and Toint 2018). Nonconvex optimization via Newton-CG 18

28 Outline 1 Newton-type methods with negative curvature General framework Inexact variants 2 Newton-Capped Conjugate Gradient 3 Numerical results Nonconvex optimization via Newton-CG 19

29 Introducing inexactness We are concerned with inexactness in the step computation; Inexactness in the function/derivatives requires a different treatment Ex) Bergou, Diouane, Kungurtsev and Royer (2018) when f (x) = i f i(x). Nonconvex optimization via Newton-CG 20

30 Introducing inexactness We are concerned with inexactness in the step computation; Inexactness in the function/derivatives requires a different treatment Ex) Bergou, Diouane, Kungurtsev and Royer (2018) when f (x) = i f i(x). Our framework uses matrix operations to compute a search direction: Eigenvalue/eigenvector calculation; Linear system solve. Inexact strategies Iterative linear algebra (with/without randomness) for matrix operations. Main cost: matrix-vector products. Nonconvex optimization via Newton-CG 20

31 From optimization to linear algebra Two types of direction Depending on ɛ H > 0 (minimum eigenvalue estimate): Regularized Newton direction: ( 2 f (x k ) + 2ɛ H I )d = f (x k ), 2 f (x k ) + 2ɛ H I ɛ H I. Sufficient negative curvature direction: d f (x k ) 0, d 2 f (x k )d ɛ H d 2. Nonconvex optimization via Newton-CG 21

32 From optimization to linear algebra Two types of direction Depending on ɛ H > 0 (minimum eigenvalue estimate): Regularized Newton direction: ( 2 f (x k ) + 2ɛ H I )d = f (x k ), 2 f (x k ) + 2ɛ H I ɛ H I. Sufficient negative curvature direction: d f (x k ) 0, d 2 f (x k )d ɛ H d 2. Related linear algebra problems Let H = H R n n, g R n and ɛ H > 0: Solve (H + 2ɛ H I )d = g where λ min (H) > ɛ H ; Compute d such that d Hd ɛ H d 2 otherwise. Nonconvex optimization via Newton-CG 21

33 From optimization to linear algebra Two types of direction Depending on ɛ H > 0 (minimum eigenvalue estimate): Approximate regularized Newton direction: ( 2 f (x k ) + 2ɛ H I )d f (x k ), 2 f (x k ) + 2ɛ H I ɛ H I. Sufficient negative curvature direction: d f (x k ) 0, d 2 f (x k )d ɛ H d 2. Related linear algebra problems Let H = H R n n, g R n and ɛ H > 0: Approximate the solution of (H + 2ɛ H I )d = g where λ > ɛ H, λ λ min (H); Compute d such that d Hd ɛ H d 2 otherwise. Nonconvex optimization via Newton-CG 21

34 Conjugate gradient (CG) for symmetric linear systems Problem: Hd = g, where H = H ɛ H I. Nonconvex optimization via Newton-CG 22

35 Conjugate gradient (CG) for symmetric linear systems Problem: Hd = g, where H = H ɛ H I. Conjugate Gradient (CG) properties Applied with the stopping criterion: Hd + g ξ 2 min { g, ɛ H d }, ξ (0, 1). If κ = λ max (H)/λ min (H), CG terminates in at most { ( )} min n, 1 2 κ log 4κ 5 2 /ξ = min { n, O ( κ log(κ/ξ) )} iterations/matrix-vector products. CG does not explicitly use eigenvalues of H! Nonconvex optimization via Newton-CG 22

36 Lanczos for minimum eigenvalue estimation Key idea (Kuczyński and Woźniakowski 1992): use a random starting vector uniformly distributed on the unit sphere. Nonconvex optimization via Newton-CG 23

37 Lanczos for minimum eigenvalue estimation Key idea (Kuczyński and Woźniakowski 1992): use a random starting vector uniformly distributed on the unit sphere. Lanczos with a random start Let H R n n symmetric with H M, ɛ H > 0, δ (0, 1). With a probability of at most 1 δ, the Lanczos process returns an unit vector v such that v Hv λ min (H) + ɛ H 2 in at most { min iterations/matrix-vector products. n, ln(3n/δ2 ) 2 Corollary: If λ min (H) ɛ H, v Hv ɛ H 2. } M ɛ H Nonconvex optimization via Newton-CG 23

38 Conjugate gradient for minimum eigenvalue estimation? Conjugate gradient and Lanczos work on the same Krylov subspaces (invariant by translation) when started from the same point; If they detect negative curvature, it will be at the same iteration. Nonconvex optimization via Newton-CG 24

39 Conjugate gradient for minimum eigenvalue estimation? Conjugate gradient and Lanczos work on the same Krylov subspaces (invariant by translation) when started from the same point; If they detect negative curvature, it will be at the same iteration. Theorem (Royer, O Neill, Wright 2018) ( Let H R n n symmetric with H M, δ [0, 1), and CG be applied to H + ɛ H2 I ) d = b with b U(S n 1 ). Then, if λ min (H) < ɛ H, CG outputs a direction of (negative) curvature ɛ H 2 in at most { } ln(3n/δ 2 ) M J = min n,. 2 ɛ H iterations, with probability at least 1 δ. Nonconvex optimization via Newton-CG 24

40 Minimum eigenvalue oracles Corollary For the matrix 2 f (x k ), consider: Either CG applied to ( 2 f (x k ) + ɛ H 2 I ) d = b, with b S n 1 ; Or Lanczos applied to 2 f (x k ), starting from b S n 1. Then, for every δ [0, 1), we obtain one of the two outcomes below: 1 a direction of negative curvature ɛ H /2, 2 a certificate that 2 f (x k ) ɛ H I, ( ) using at most Õ min{n, ɛ 1/2 gradients/hessian-vector products, with probability at least 1 δ. H } Nonconvex optimization via Newton-CG 25

41 Minimum eigenvalue oracles Corollary For the matrix 2 f (x k ), consider: Either CG applied to ( 2 f (x k ) + ɛ H 2 I ) d = b, with b S n 1 ; Or Lanczos applied to 2 f (x k ), starting from b S n 1. Then, for every δ [0, 1), we obtain one of the two outcomes below: 1 a direction of negative curvature ɛ H /2, 2 a certificate that 2 f (x k ) ɛ H I, ( ) using at most Õ min{n, ɛ 1/2 gradients/hessian-vector products, with probability at least 1 δ. H } We say that those methods are (minimum) eigenvalue oracles. Nonconvex optimization via Newton-CG 25

42 Inexact Newton-type variants Inputs: x 0 R n, θ (0, 1), η > 0, ɛ H (0, 1). For k = 0, 1, 2,... 1 Computation of a search direction d k Compute λ λ min ( 2 f (x k )) using an eigenvalue oracle; If λ < ɛ H, choose d k as a negative curvature direction such that d k f (x k ) 0, d k 2 f (x k )d k = λ d k 2 ; Otherwise, choose d k as a Newton direction (possibly regularized) by CG, such that ( 2 f (x k ) + 2ɛ H I ) d k + f (x k ) ξ 2 min { f (x k), ɛ H d k } 2 Backtracking line search (unchanged) 3 Set x k+1 = x k + α k d k. Nonconvex optimization via Newton-CG 26

43 Complexity result for inexact variants Set ɛ g = ɛ, ɛ H = ɛ. The methods returns an (ɛ, ɛ)-point using at most O(ɛ 3 2 ) outer iterations, ( { }) Õ min nɛ 3 2, ɛ 7 4 gradients/hessian-vector products, with probability at least 1 O(ɛ 3 2 δ). Nonconvex optimization via Newton-CG 27

44 Complexity result for inexact variants Set ɛ g = ɛ, ɛ H = ɛ. The methods returns an (ɛ, ɛ)-point using at most O(ɛ 3 2 ) outer iterations, ( { }) Õ min nɛ 3 2, ɛ 7 4 gradients/hessian-vector products, with probability at least 1 O(ɛ 3 2 δ). Setting δ = 0 and assuming that n >> ɛ 1/2 yields almost sure results: Iterations: O(ɛ 3 2 ). Gradients/Hessian-vector products: O ( ) ɛ 7 4. Nonconvex optimization via Newton-CG 27

45 Outline 1 Newton-type methods with negative curvature 2 Newton-Capped Conjugate Gradient Conjugate gradient and nonconvex quadratics Newton-Capped CG algorithms 3 Numerical results Nonconvex optimization via Newton-CG 28

46 Revisiting conjugate gradient Idea Consider applying CG to a linear system Hd = g; where H may not be positive definite. Equivalently, apply CG to the quadratic min q(d) = 1 d 2 d Hd + g d without knowing if q is (strongly) convex. Motivation Rich CV theory for CG in the positive definite/strongly convex case; When applied to an indefinite system: May break down......but then reveals negative curvature. Nonconvex optimization via Newton-CG 29

47 Conjugate gradient for Hy = g Algorithm Init: Set y 0 = 0 R n, r 0 = g, p 0 = g, j = 0. While pj Hp j > 0 Compute y j+1 = y j + α j p j, r j+1 = Hy j+1 + g and p j+1. Set j = j + 1; terminate if r j ζ r 0. Nonconvex optimization via Newton-CG 30

48 Conjugate gradient for Hy = g Algorithm Init: Set y 0 = 0 R n, r 0 = g, p 0 = g, j = 0. While pj Hp j > 0 Compute y j+1 = y j + α j p j, r j+1 = Hy j+1 + g and p j+1. Set j = j + 1; terminate if r j ζ r 0. If H 0, r n = 0; If ɛ H I H MI, ( ) r j 2 2 2j 4κ 1 r 0 2, κ = M κ + 1 ɛ H. Nonconvex optimization via Newton-CG 30

49 Conjugate gradient for Hy = g Algorithm assuming ɛ H I H MI Init: Set y 0 = 0 R n, r 0 = g, p 0 = g, j = 0. ( ) 2j While pj Hp j > ɛ H p j 2 and r j 2 4κ 1 2 r0 κ+1 2 Compute y j+1 = y j + α j p j, r j+1 = Hy j+1 + g and p j+1. Set j = j + 1; terminate if r j ζ r 0. If H 0, r n = 0; If ɛ H I H MI, ( ) r j 2 2 2j 4κ 1 r 0 2, κ = M κ + 1 ɛ H. Nonconvex optimization via Newton-CG 30

50 Conjugate gradient for Hy = g Algorithm assuming ɛ H I H MI Init: Set y 0 = 0 R n, r 0 = g, p 0 = g, j = 0. ( ) 2j While pj Hp j > ɛ H p j 2 and r j 2 4κ 1 2 r0 κ+1 2 Compute y j+1 = y j + α j p j, r j+1 = Hy j+1 + g and p j+1. Set j = j + 1; terminate if r j ζ r 0. If H 0, r n = 0; If ɛ H I H MI, ( ) r j 2 2 2j 4κ 1 r 0 2, κ = M κ + 1 ɛ H. What if H 0? Nonconvex optimization via Newton-CG 30

51 Conjugate gradient for possibly indefinite systems Capped Conjugate Gradient Init: Set y 0 = 0 R n, r 0 = g, p 0 = g, j = 0. While pj Hp j > ɛ H p j 2 and r j 2 T τ j r 0 2 Compute y j+1, r j+1 = Hy j+1 + g and p j+1. Set j = j + 1; terminate if r j ζ r 0. Nonconvex optimization via Newton-CG 31

52 Conjugate gradient for possibly indefinite systems Capped Conjugate Gradient Init: Set y 0 = 0 R n, r 0 = g, p 0 = g, j = 0. While pj Hp j > ɛ H p j 2 and r j 2 T τ j r 0 2 Compute y j+1, r j+1 = Hy j+1 + g and p j+1. Set j = j + 1; terminate if r j ζ r 0. Properties of Capped CG For any matrix H MI : As long as r j is computed: r j 2 T τ j r 0 2, T = 16κ 5, τ = κ κ+1, κ = M ɛ H. { ( )} The method runs at most j pla = min n, Õ M iterations ( cap") ɛh before terminating or violating one condition. Nonconvex optimization via Newton-CG 31

53 Main result - Violating conditions in Capped CG Theorem (Royer, O Neill, Wright 2018) If Capped CG applied to Hd = g runs for J iterations and r J > ζ r 0, then Either p J Hp J ɛ H p J 2 Or r J > T τ J r 0, y J+1 can be computed and there exists j {0,..., J 1} such (y J+1 y j ) H(y J+1 y j ) ɛ H y J+1 y j 2. Nonconvex optimization via Newton-CG 32

54 Main result - Violating conditions in Capped CG Theorem (Royer, O Neill, Wright 2018) If Capped CG applied to (H + 2ɛ H I )d = g runs for J iterations and r J > ζ r 0, then Either p J Hp J ɛ H p J 2 Or r J > T τ J r 0, y J+1 can be computed and there exists j {0,..., J 1} such (y J+1 y j ) H(y J+1 y j ) ɛ H y J+1 y j 2. Nonconvex optimization via Newton-CG 32

55 Main result - Violating conditions in Capped CG Theorem (Royer, O Neill, Wright 2018) If Capped CG applied to (H + 2ɛ H I )d = g runs for J iterations and r J > ζ r 0, then Either p J Hp J ɛ H p J 2 Or r J > T τ J r 0, y J+1 can be computed and there exists j {0,..., J 1} such (y J+1 y j ) H(y J+1 y j ) ɛ H y J+1 y j 2. Proof: follows a proof of accelerated methods from Bubeck (2014) and its variant for nonconvex accelerated gradient (Carmon et al 2017), but applied to quadratic functions. But in our case, we only use intrinsic properties of CG and look at quadratics we directly obtain negative curvature directions! Nonconvex optimization via Newton-CG 32

56 Capped Conjugate Gradient - summary Running Capped CG Applying Capped CG to ( 2 f (x k ) + 2ɛ H I ) d = f (x k ) yields one of the two following outcomes: 1 a regularized Newton step d k with ( 2 f (x k ) + 2ɛ H I )d k + f (x k ) ζ r 0 ; 2 a direction of negative curvature ɛ H. in at most Õ(min{n, ɛ 1/2 H }) iterations/hessian-vector products. Nonconvex optimization via Newton-CG 33

57 Outline 1 Newton-type methods with negative curvature 2 Newton-Capped Conjugate Gradient Conjugate gradient and nonconvex quadratics Newton-Capped CG algorithms 3 Numerical results Nonconvex optimization via Newton-CG 34

58 Building on Capped CG Two new instances of our generic method Phase One: when the gradient norm is large, use Capped CG only to compute search directions; Phase Two: when the gradient norm is small, use standard CG to estimate the smallest eigenvalue. We no longer compute λ min ( 2 f ) at the beginning of the iteration. Nonconvex optimization via Newton-CG 35

59 Newton-Capped CG Inputs: x 0 R n, θ (0, 1), η > 0, ɛ g (0, 1), ɛ H (0, 1), δ (0, 1]. For k=0, 1, 2,... 1 If f (x k ) > ɛ g, compute d k via Capped CG. 2 Otherwise, apply CG as an eigenvalue oracle. If this oracle returns a certificate that 2 f (x k ) ɛ H I terminate, otherwise use its output as d k. 3 Backtracking line search (unchanged) 4 Set x k+1 = x k + α k d k. Probabilistic analysis We may terminate at a non-stationary point....yet the method is always well defined. Nonconvex optimization via Newton-CG 36

60 Complexity results (order one) Theorem - Number of iterations The method finds x k such that f (x k ) ɛ g (ɛ g -point) in at most O ( max { ɛ 3 g ɛ 3 H, }) ɛ 3 H iterations; Each iteration corresponds to at most Õ(min{n, ɛ 1/2 H }) Capped CG iterations. Theorem - Computational complexity The method reaches an ɛ g -point using at most ( Õ min{n, ɛ 1/2 H } max { ɛ 3 g ɛ 3 H, } ) ɛ 3. gradients/hessian-vector products. ɛ H = ɛ 1/2 g Õ ( max{nɛ 3/2 g, ɛ 7/4 g } ). Best known bound without direct Hessian calculation. H Nonconvex optimization via Newton-CG 37

61 Second-order complexity results (general) Theorem Goal: Reach an (ɛ g, ɛ H )-point x k such that f (x k ) ɛ g, λ k = λ min ( 2 f (x k )) ɛ H. Use CG as eigenvalue oracle with δ [0, 1). An (ɛ g, ɛ H )-point is reached using at most O(max{ɛ 3 g ɛ 3 H, ɛ 3 }) iterations, H Õ ( min{n, ɛ 1/2 } max{ɛ 3 g ɛ 3 H, ɛ 3 H }) gradients/hessian-vector products, with probability at least (1 δ) O(min{ɛ3 g ɛ 3 H,ɛ3 H }). Nonconvex optimization via Newton-CG 38

62 Second-order complexity results (specific) Goal: Reach an (ɛ, ɛ)-point x k such that f (x k ) ɛ g, λ k = λ min ( 2 f (x k )) ɛ. Use CG as eigenvalue oracle with δ [0, 1). Theorem An (ɛ, ɛ)-point is reached using at most O(ɛ 3/2 ) iterations, Õ ( min{nɛ 3/2, ɛ 7/4 } ) gradients/hessian-vector products, with probability at least (1 δ) O(ɛ 3/2). Nonconvex optimization via Newton-CG 39

63 Outline 1 Newton-type methods with negative curvature 2 Newton-Capped Conjugate Gradient 3 Numerical results Nonconvex optimization via Newton-CG 40

64 Testing framework Part of an ongoing numerical study; Focus on Newton+Capped CG (best performance among our variants); Comparison with other methods popular in: Large-scale optimization: Nonlinear CG, L-BFGS; Data science: Variants of (accelerated) gradient descent. Nonconvex optimization via Newton-CG 41

65 A classical optimization benchmark Setup 61 nonconvex problems from CUTEst, dimensions from 2 to 500; ɛ g = 10 5, ɛ H = ɛ g ; Algorithms Newton-Capped CG; Nonlinear CG (Polak-Ribière); Gradient descent + negative curvature (2 versions). Nonconvex optimization via Newton-CG 42

66 A classical optimization benchmark Setup 61 nonconvex problems from CUTEst, dimensions from 2 to 500; ɛ g = 10 5, ɛ H = ɛ g ; Nonconvex optimization via Newton-CG 42

67 A nonconvex estimation problem Fonction de perte de Tukey (Carmon et al, ICML 2017) min f (x) = 30 h(a x R n i x b i ) where h(θ) = θ 2 /(1 + θ 2 ), i=1 with a i N (0, I n ) and b i = ai x + bruit non Gaussien. Stopping criterion: f (x) ɛ g = Four algorithms Newton+Capped CG; Nonlinear CG (Polak-Ribière); L-BFGS; Heavy ball. Nonconvex optimization via Newton-CG 43

68 Nonconvex estimation problem: results Nonconvex optimization via Newton-CG 44

69 A matrix optimization problem Matrix problem min U,V 1 2 P Ω (UV M) 2, F avec M R m n, U R m r, V R n r, Ω = 15% mn. Fraction of MNIST dataset (0-1 digits): find first principal component (r = 1). Comparison Generic purpose optimization methods: Newton+Capped CG; Nonlinear CG (Polak-Ribière); Dedicated solvers: Alternated gradient descent (Tanner and Wei 2016); LMaFit (Wen et al. 2012). Nonconvex optimization via Newton-CG 45

70 A matrix completion problem: results Nonconvex optimization via Newton-CG 46

71 Conclusion Newton-CG: standard wisdom Useful for large-scale optimization; No specific complexity guarantees......nor justification for its ability to detect negative curvature through CG. Newton-CG: our point of view Conjugate gradient: Analysis for indefinite quadratics; Can be used as eigenvalue oracle; Newton + Capped CG: Best known complexity bounds; First order: deterministic Õ(ɛ 7/4 g ) complexity; Probabilistic results for second order. Nonconvex optimization via Newton-CG 47

72 To be continued For more information... Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization, C. W. Royer and S. J. Wright, SIAM J. Optim. 28(2): , A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization, C. W. Royer, M. O Neill and S. J. Wright, arxiv: Accepted in Math. Prog. Ongoing work Numerical study; Trust-region framework; Extension to constrained problems. Nonconvex optimization via Newton-CG 48

73 To be continued For more information... Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization, C. W. Royer and S. J. Wright, SIAM J. Optim. 28(2): , A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization, C. W. Royer, M. O Neill and S. J. Wright, arxiv: Accepted in Math. Prog. Ongoing work Numerical study; Trust-region framework; Extension to constrained problems. Thank you for your attention! croyer2@wisc.edu Nonconvex optimization via Newton-CG 48

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,