Second Order Optimization Algorithms I

Similar documents
Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

More First-Order Optimization Algorithms

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

MATH 4211/6211 Optimization Quasi-Newton Method

Nonlinear Programming

8 Numerical methods for unconstrained problems

FIXED POINT ITERATIONS

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Unconstrained Optimization

Line Search Methods for Unconstrained Optimisation

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Unconstrained minimization of smooth functions

This manuscript is for review purposes only.

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Higher-Order Methods

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

MATH 4211/6211 Optimization Basics of Optimization Problems

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Lecture 8 : Eigenvalues and Eigenvectors

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

The Steepest Descent Algorithm for Unconstrained Optimization

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

Cubic regularization of Newton s method for convex problems with constraints

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Algorithms for constrained local optimization

On Lagrange multipliers of trust-region subproblems

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Conditional Gradient (Frank-Wolfe) Method

On Lagrange multipliers of trust region subproblems

Programming, numerics and optimization

Chapter 4. Unconstrained optimization

SWFR ENG 4TE3 (6TE3) COMP SCI 4TE3 (6TE3) Continuous Optimization Algorithm. Convex Optimization. Computing and Software McMaster University

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Examination paper for TMA4180 Optimization I

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

New Global Algorithms for Quadratic Programming with A Few Negative Eigenvalues Based on Alternative Direction Method and Convex Relaxation

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Multilevel Optimization Methods: Convergence and Problem Structure

Convex Optimization M2

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Numerical Optimization

Computing regularization paths for learning multiple kernels

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

1 Numerical optimization

A PROJECTED HESSIAN GAUSS-NEWTON ALGORITHM FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS AND INEQUALITIES

5. Duality. Lagrangian

Unconstrained optimization

ECE580 Partial Solution to Problem Set 3

Convex Functions and Optimization

Constrained Optimization and Lagrangian Duality

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Introduction to Nonlinear Optimization Paul J. Atzberger

Optimization methods

Convex Optimization Boyd & Vandenberghe. 5. Duality

Equality constrained minimization

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

1 Numerical optimization

Algorithms for Constrained Optimization

Miscellaneous Nonlinear Programming Exercises

Convex Optimization. Problem set 2. Due Monday April 26th

On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search

Lagrange Relaxation and Duality

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

5 Handling Constraints

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Nonlinear Optimization: What s important?

The Newton Bracketing method for the minimization of convex functions subject to affine constraints

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720

On the iterate convergence of descent methods for convex optimization

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

The Conjugate Gradient Method

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

An interior-point trust-region polynomial algorithm for convex programming

c 2000 Society for Industrial and Applied Mathematics

Stochastic Optimization Algorithms Beyond SG

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture 3. Optimization Problems and Iterative Algorithms

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O(ɛ 3/2 ) Complexity

You should be able to...

Sequential Unconstrained Minimization: A Survey

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Module 04 Optimization Problems KKT Conditions & Solvers

Optimisation in Higher Dimensions

Newton s Method. Javier Peña Convex Optimization /36-725

THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Lecture 10: September 26

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Numerical optimization

Constrained Optimization

Introduction and Math Preliminaries

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Numerical Optimization

Image restoration: numerical optimisation

Transcription:

Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1

The 1.5-Order Algorithm: Conjugate Gradient Method I The second-order information is used but no need to inverse it. 0) Initialization: Given initial solution x 0. Let g 0 = f(x 0 ), d 0 = g 0 and k = 0. 1) Iterate Update: x k+1 = x k + α k d k, where α k = (g k ) T d k (d k ) T 2 f(x k )d k. 2) Compute Conjugate Direction: Compute g k+1 = f(x k+1 ). Unless k = n 1: d k+1 = g k+1 + β k d k where β k = (gk+1 ) T 2 f(x k )d k (d k ) T 2 f(x k )d k and set k = k + 1 and go to Step 1. 3) Restart: Replace x 0 by x n and go to Step 0. For convex quadratic minimization, this process end in no more than 1 round. 2

The 1.5 Order Algorithm: Conjugate Gradient Method II The information of the Hessian is learned (more on this later): 0) Initialization: Given initial solution x 0. Let g 0 = f(x 0 ), d 0 = g 0 and k = 0. 1) Iterate Update: x k+1 = x k + α k d k where one-dimensional search of α k is applied. 2) Compute Conjugate Direction: Compute g k+1 = f(x k+1 ). Unless k = n 1: d k+1 = g k+1 + β k d k and set k = k + 1 and go to Step 1. 3) Restart: Replace x 0 by x n and go to Step 0. where β k = gk+1 2 g k 2 or β k = (gk+1 g k ) T g k+1 g k 2. 3

Bisection Method: First Order Method For a one variable problem, an KKT point is the root of g(x) := f (x) = 0. Assume we know an interval [a b] such that a < b, and g(a)g(b) < 0. Then we know there exists an x, a < x < b, such that g(x ) = 0; that is, interval [a b] contains a root of g. How do we find x within an error tolerance ϵ, that is, x x ϵ? 0) Initialization: let x l = a, x r = b. 1) Let x m = (x l + x r )/2, and evaluate g(x m ). 2) If g(x m ) = 0 or x r x l < ϵ stop and output x = x m. Otherwise, if g(x l )g(x m ) > 0 set x l = x m ; else set x r = x m ; and return to Step 1. The length of the new interval containing a root after one bisection step is 1/2 which gives the linear convergence rate is 1/2. 4

Figure 1: Illustration of Bisection 5

Golden Section Method: Zero Order Method Assume that the one variable function f(x) is Unimodel in interval [a b], that is, for any point x [a r b l ] such that a a r < b l b, we have that f(x) max{f(a r ), f(b l )}. How do we find x within an error tolerance ϵ? 0) Initialization: let x l = a, x r = b, and choose a constant 0 < r < 0.5; 1) Let two other points ˆx l = x l + r(x r x l ) and ˆx r = x l + (1 r)(x r x l ), and evaluate their function values. 2) Update the triple points x r = ˆx r, ˆx r = ˆx l, x l = x l if f(ˆx l ) < f(ˆx r ); otherwise update the triple points x l = ˆx l, ˆx l = ˆx r, x r = x r ; and return to Step 1. In either cases, the length of the new interval after one golden section step is (1 r). If we set (1 2r)/(1 r) = r, then only one point is new in each step and needs to be evaluated. This give r = 0.382 and the linear convergence rate is 0.618. 6

Figure 2: Illustration of Golden Section 7

Newton s Method: A Second Order Method For functions of a single real variable x, the KKT condition is g(x) := f (x) = 0. When f is twice continuously differentiable then g is once continuously differentiable, Newton s method can be a very effective way to solve such equations and hence to locate a root of g. Given a starting point x 0, Newton s method for solving the equation g(x) = 0 is to generate the sequence of iterates x k+1 = x k g(xk ) g (x k ). The iteration is well defined provided that g (x k ) 0 at each step. For multi-variables, Newton s method for minimizing f(x) is defined as x k+1 = x k ( 2 f(x k )) 1 f(x k ). We now introduce the second-order β-lipschitz condition: for any point x and direction vector d f(x + d) f(x) 2 f(x)d β d 2. In the following, for notation simplicity, we use g(x) = f(x) and g(x) = 2 f(x). 8

Local Convergence Theorem of Newton s Method Theorem 1 Let f(x) be β-lipschitz and the smallest absolute eigenvalue of its Hessian uniformly bounded below by λ min > 0. Then, provided that x 0 x is sufficiently small, the sequence generated by Newton s method converges quadratically to x that is a KKT solution with g(x ) = 0. Thus, when x k+1 x = x k x g(x k ) 1 g(x k ) = g(x k ) 1 ( g(x k ) g(x k )(x k x ) ) = g(x k ) 1 ( g(x k ) g(x ) g(x k )(x k x ) ) g(x k ) 1 g(x k ) g(x ) g(x k )(x k x ) g(x k ) 1 β x k x 2 β λ min x 0 x < 1, the quadratic convergence takes place: β λ min x k+1 x β λ min x k x 2. ( ) 2 β x k x. λ min (1) Such a starting solution x 0 is called an approximate root of g(x). 9

How to Check a Point being an Approximate Root Theorem 2 (Smale 86). Let g(x) be an analytic function. Then, if x in the domain of g satisfies sup k>1 Then, x is an approximate root of g. g (k) (x) k!g (x) 1/(k 1) (1/8) In the following, for simplicity, let the root be in interval [0 R]. g (x) g(x) Corollary 1 (Y. 92). Let g(x) be an analytic function in R ++ and let g be convex and monotonically decreasing. Furthermore, for x R ++ and k > 1 let g (k) (x) k!g (x) 1/(k 1) α 8 x 1 for some constant α > 0. Then, if the root x [ˆx, (1 + 1/α)ˆx] R ++, ˆx is an approximate root of g.. 10

Hybrid of Bisection and Newton I Note that the interval becomes wider and wider at geometric rate when ˆx is increased. Thus, we may symbolically construct a sequence of points: ˆx 0 = ϵ, ˆx 1 = (1 + 1/α)ˆx 0,..., and ˆx j = (1 + 1/α)ˆx j 1,... until ˆx j = ˆx J R. Obviously the total number of points, J, of these points is bounded by O(log(R/ϵ)). Moreover, define a sequence of intervals I j = [ˆx j 1, ˆx j ] = [ˆx j 1, (1 + 1/α)ˆx j 1 ]. Then, if the root x of g is in any one of these intervals, say in I j, then the front point ˆx j 1 of the interval is an approximate root of g so that starting from it Newton s method generates an x with x x ϵ in O(log log(1/ϵ)) iterations. 11

Hybrid of Bisection and Newton II Now the question is how to identify the interval that contains x? This time, we bisect the number of intervals, that is, evaluate function value at point ˆx jm where j m = [J/2]. Thus, each bisection reduces the total number of the intervals by a half. Since the total number of intervals is O(log(R/ϵ)), in at most O(log log(r/ϵ)) bisection steps we shall locate the interval that contains x. Then the total number iterations, including both bisection and Newton methods, is O(log log(r/ϵ)) iterations. Here we take advantage of the global convergence property of Bisection and local quadratic convergence property of Newton, and we would see more of these features later... 12

Spherical Constrained Nonconvex Quadratic Minimization I min 1 2 xt Qx + c T x, s.t. x 2 = 1. where Q S n is any symmetric data matrix. If c = 0 this problem becomes finding the least eigenvalue of Q. The necessary and sufficient condition (can be proved using SDP) for x being a global minimizer of the problem is (Q + λi)x = c, (Q + λi) 0, x 2 2 = 1, which implies λ λ min (Q) > 0 where λ min (Q) is the least eigenvalue of Q. If the optimal λ = λ min (Q), then c must be orthogonal to the λ min (Q)-eigenvector, and it can be checked using the power algorithm. The minimal objective value: 1 2 xt Qx + c T x = 1 2 xt (Q + λi)x 1 2 λ x 2 = λ 2, (2) 13

Sphere Constrained Nonconvex Quadratic Minimization II WLOG, Let us assume that the least eigenvalue is 0. Then we must have λ 0. If the optimal λ = 0, then c must be a 0-eigenvector of Q, and it can be checked using the power algorithm to find it. Therefore, we assume that the optimal λ > 0. Furthermore, there is an upper bound on λ: λ λ x 2 x T (Q + λi)x = c T x c x = c. Now let x(λ) = (Q + λi) 1 c, the problem becomes finding the root of x(λ) 2 = 1. Lemma 1 The analytic function x(λ) 2 is convex monotonically decreasing with α = 12 in Corollary 1. Theorem 3 The 1-spherical constrained quadratic minimization can be computed in O(log log( c /ϵ)) iterations where each iteration costs O(n 3 ) arithmetic operations. What about 2-spherical constrained quadratic minimization, that is, quadratic minimization with 2 ellipsoidal constraints? 14

Second Order Method for Minimizing Lipschitz f(x) Recall the second-order β-lipschitz condition: for any two points x and d g(x + d) g(x) g(x)d β d 2, which further implies f(x + d) f(x) g(x) T d + 1 2 dt g(x)d + β 3 d 3. The second-order method, at the kth iterate, would let x k+1 = x k + d k where d k = arg min d (c k ) T d + 1 2 dt Q k d + β 3 α3 s.t. d α, with c k = g(x k ) and Q k = g(x k ). One typically fixed α to a trusted radius α k so that it becomes a sphere-constrained problem (the inequality is normally active if the Hessian is non PSD): (Q k + λ k I)d k = c k, (Q k + λ k I) 0, d k 2 2 = (α k ) 2. For fixed α k, the method is generally called trust-region method. 15

Convergence Speed of the Second Order Method Is there a trusted radius such that the method converging? A naive choice would be α k = ϵ/β. Then from reduction (2) Also f(x k+1 ) f(x k ) λk 2 dk 2 + β 3 (αk ) 3 = λk (α k ) 2 2 g(x k+1 ) = g(x k+1 ) (c k + Q k d k ) + (c k + Q k d k ) g(x k+1 ) (c k + Q k d k ) + (c k + Q k d k ) + β 3 (αk ) 3 = λk ϵ 2β 2 + ϵ3/2 3β 2. β d k 2 + λ k d k = β(α k ) 2 + λ k α k = ϵ β + λk ϵ β. Thus, one can stop the algorithm as soon as λ k = ϵ so that the inequality becomes g(x k+1 ) 2ϵ β. Furthermore, λ min( g(x k )) λ k = ϵ. Theorem 4 Let the objective function p = inf f(x) be finite. Then in O(β2 (f(x 0 ) p )) ϵ 1.5 second-order method, the norm of the gradient vector is less than ϵ and the Hessian is ϵ-positive iterations of the semidefinite, where each iteration solves a spherical-constrained quadratic minimization discussed earlier. 16

Adaptive Trust-Region Second Order Method One can treat α as a variable in d k = arg min (d,α) (c k ) T d + 1 2 dt Q k d + β 3 α3 s.t. d α. Then besides (Q k + λi)d k = c k, (Q k + λi) 0, d 2 2 = α 2 ; the optimality conditions also include α = λ β. Thus, let d(λ) = (Q k + λi) 1 c k, the problem becomes finding the root of d(λ) 2 = λ2 β 2, where λ λ min (Q k ) > 0 (assume that the current Hessian is not PSD yet), as in the Hybrid of Bisection and Newton method discussed earlier. 17

Would Convexity Help? Before we answer this question, let s summarize a generic form one iteration of the Second Order Method for solving f(x) = g(x) = 0: ( g(x k ) + λi)(x x k ) = γg(x k ), Many interpretations: when g(x k ) + g(x k )(x x k ) + λ(x x k ) = (1 γ)g(x k ). γ = 1, λ = 0: pure Newton; γ and λ are sufficiently large: SDM; γ = 1 and λ decreases to 0: Homotopy or path-following method. The Quasi-Newton Method More generally: x = x k α k S k g(x k ), for a symmetric matrix S k with a step-size α k. 18 or

The Quasi-Newton Method ( ) λmax (S For convex qudratic minimization, the convergnece rate becomes k Q) λ min (S k 2 Q) λ max (S k Q)+λ min (S k where Q) λ max and λ min represent the largest and smallest eigenvalues of a matrix. S k can be viewed as a Preconditioner typically an approximation of the Hessian matrix inverse, and can be learned from a regression model: q k := g(x k+1 ) g(x k ) = Q(x k+1 x k ) = Qd k, k = 0, 1,... We actually learn Q 1 from Q 1 q k = d k, k = 0, 1,... The process start with H k, k = 0, 1,..., where the rank of H k is k, that is, we each step lean a rank-one update: given H k 1, q k, d k we solve (H k 1 + h k (h k ) T )q k = d k for vector h k. Then after n iterations, we build up H n = Q 1. You also learnig while doing : x k+1 = x k α k ( n k n Conjugate Gradient method. I + k n Hk )g(x k ), which is similar to the We now give a confirmation answer: convexity helps a lot in Second-Order methods. 19

A Path-Following Algorithm for Unconstrained Optimization I We assume that f is convex and meet a local Lipschitz condition: for any point x and a β 1 g(x + d) g(x) g(x)d βd T g(x)d, whenever d O(1) (3) and x + d in the function domain. We start from a solution x k that approximately satisfies g(x) + λx = 0, with λ = λ k > 0. (4) Such a solution x(λ) exists for any λ > 0 because it is the (unique) optimal solution for problem x(λ) = arg min f(x) + λ 2 x 2, and they form a path down to x(0). Let the approximation path error at x k with λ = λ k be g(x k ) + λ k x k 1 2β λk. Then, we like to compute a new iterate x k+1 such that g(x k+1 ) + λ k+1 x k+1 1 2β λk+1, where 0 λ k+1 < λ k. 20

A Path-Following Algorithm for Unconstrained Optimization II When λ k is replaced by λ k+1, say (1 η)λ k for some η (0, 1], we aim to find a solution x such that we start from x k and apply the Newton iteration: g(x) + (1 η)λ k x = 0, g(x k ) + g(x k )d + (1 η)λ k (x k + d) = 0, g(x k )d + (1 η)λ k d = g(x k ) (1 η)λ k x k. or (5) From the second expression, we have g(x k )d + (1 η)λ k d = g(x k ) (1 η)λ k x k = g(x k ) λ k x k + ηλ k x k g(x k ) λ k x k + ηλ k x k 1 2β λk + ηλ k x k. (6) 21

On the other hand g(x k )d + (1 η)λ k d 2 = g(x k )d 2 + 2(1 η)λ k d T g(x k )d + ((1 η)λ k ) 2 d 2. From convexity, d T g(x k )d 0, together with (6) we have ((1 η)λ k ) 2 d 2 ( 1 2β + η xk ) 2 (λ k ) 2 2(1 η)λ k d T g(x k )d ( 1 2β + η xk ) 2 (λ k ) 2. and The first inequality implies d 2 1 ( 2β(1 η) + η 1 η xk ) 2. Let the new iterate be x + = x k + d. The second inequality implies g(x + ) + (1 η)λ k x + = g(x + ) (g(x k ) + g(x k )d) + (g(x k ) + g(x k )d) + (1 η)λ k (x k + d) = g(x + ) g(x k ) + g(x k )d βd T g(x k )d β 2(1 η) ( 1 2β + η xk ) 2 λ k. 22

We now just need to choose η (0, 1) such that 1 ( 2β(1 η) + η 1 η xk ) 2 1 and βλ k 2(1 η) ( 1 2β + η xk ) 2 1 2β (1 η)λk = 1 2β λk+1. For example, given β 1, would suffice. η = 1 2β(1 + x k ) This would give a linear convergence since x k is typically bounded following the path to the optimality, while the covergence in non-convex case is only arithmetic. Convexity, together with some types of second-order methods, make convex optimization solvers into practical technoloies. 23