Line Search Methods for Unconstrained Optimisation

Similar documents
Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

The Conjugate Gradient Method

Nonlinear Optimization for Optimal Control

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

8 Numerical methods for unconstrained problems

Numerical optimization

Optimization Methods. Lecture 19: Line Searches and Newton s Method

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Higher-Order Methods

Convex Optimization. Problem set 2. Due Monday April 26th

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

The Steepest Descent Algorithm for Unconstrained Optimization

5 Quasi-Newton Methods

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Programming, numerics and optimization

Numerical Optimization

Unconstrained minimization of smooth functions

Unconstrained optimization

Lecture 3: Linesearch methods (continued). Steepest descent methods

Quasi-Newton Methods

Lecture 5: September 12

10. Unconstrained minimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimization Tutorial 1. Basic Gradient Descent

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Written Examination

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Unconstrained minimization: assumptions

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Math 164: Optimization Barzilai-Borwein Method

2. Quasi-Newton methods

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Selected Topics in Optimization. Some slides borrowed from

CPSC 540: Machine Learning

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Image restoration: numerical optimisation

1 Numerical optimization

NonlinearOptimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Optimization: What s important?

Optimization Methods

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

1 Numerical optimization

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

A Second-Order Method for Strongly Convex l 1 -Regularization Problems

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Gradient Descent. Dr. Xiaowei Huang

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Gradient-Based Optimization

Sub-Sampled Newton Methods

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

You should be able to...

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Numerical Optimization

Unconstrained minimization

Unconstrained Optimization

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

An introduction to algorithms for nonlinear optimization 1,2

Trajectory-based optimization

5 Handling Constraints

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Numerical Optimization: Basic Concepts and Algorithms

Conditional Gradient (Frank-Wolfe) Method

Lecture 5: September 15

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Linear Solvers. Andrew Hazel

GRADIENT = STEEPEST DESCENT

An Alternative Three-Term Conjugate Gradient Algorithm for Systems of Nonlinear Equations

Computational Finance

Numerical Optimization

Scientific Computing: Optimization

Second Order Optimization Algorithms I

Lecture 7 Unconstrained nonlinear programming

Chapter 4. Unconstrained optimization

A derivative-free nonmonotone line search and its application to the spectral residual method

REPORTS IN INFORMATICS

Nonlinear Programming

Newton s Method. Javier Peña Convex Optimization /36-725

Optimization and Root Finding. Kurt Hornik

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Solving linear equations with Gaussian Elimination (I)

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract

FALL 2018 MATH 4211/6211 Optimization Homework 4

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Matrix Derivatives and Descent Optimization Methods

MATH 4211/6211 Optimization Basics of Optimization Problems

Bellman s Curse of Dimensionality

Unconstrained optimization I Gradient-type methods

Transcription:

Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk)

The Generic Framework For the purposes of this lecture we consider the unconstrained minimisation problem (UCM) min x R n f(x), where f C 1 (R n, R) with Lipschitz continous gradient g(x). In practice, these smoothness assumptions are sometimes violated, but the algorithms we will develop are still observed to work well. The algorithms we will construct have the common feature that, starting from an initial educated guess x 0 R n for a solution of (UCM), a sequence of solutions (x k ) N R n is produced such that x k x R n such that the first and second order necessary optimality conditions g(x ) = 0, H(x ) 0 (positive semidefiniteness) are satisfied.

We usually wish to make progress towards solving (UCM) in every iteration, that is, we will construct x k+1 so that (descent methods). f(x k+1 ) < f(x k ) In practice we cannot usually compute x precisely (i.e., give a symbolic representation of it, see the LP lecture!), but we have to stop with a x k sufficiently close to x. Optimality conditions are still useful, in that they serve as a stopping criterion when they are satisfied to within a predetermined error tolerance. Finally, we wish to construct (x k ) N such that convergence to x takes place at a rapid rate, so that few iterations are needed until the stopping criterion is satisfied. This has to be counterbalanced with the computational cost per iteration, as there typically is a tradeoff faster convergence higher computational cost per iteration. We write f k = f(x k ), g k = g(x k ), and H k = H(x k ).

Generic Line Search Method: 1. Pick an initial iterate x 0 by educated guess, set k = 0. 2. Until x k has converged, i) Calculate a search direction p k from x k, ensuring that this direction is a descent direction, that is, [g k ] T p k < 0 if g k 0, so that for small enough steps away from x k in the direction p k the objective function will be reduced. ii) Calculate a suitable steplength α k > 0 so that f(x k + α k p k ) < f k. The computation of α k is called line search, and this is usually an inner iterative loop. iii) Set x k+1 = x k + α k p k. Actual methods differ from one another in how steps i) and ii) are computed.

Computing a Step Length α k The challenges in finding a good α k are both in avoiding that the step length is too long, 3 2.5 2 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 (the objective function f(x) = x 2 and the iterates x k+1 = x k + α k p k generated by the descent directions p k = ( 1) k+1 and steps α k = 2+3/2 k+1 from x 0 = 2)

or too short, 3 2.5 2 1.5 1 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 (the objective function f(x) = x 2 and the iterates x k+1 = x k + α k p k generated by the descent directions p k = 1 and steps α k = 1/2 k+1 from x 0 = 2).

Exact Line Search: In early days, α k was picked to minimize (ELS) min α f(x k + αp k ) s.t. α 0. Although usable, this method is not considered cost effective. Inexact Line Search Methods: Formulate a criterion that assures that steps are neither too long nor too short. Pick a good initial stepsize. Construct sequence of updates that satisfy the above criterion after very few steps.

Backtracking Line Search: 1. Given α init > 0 (e.g., α init = 1), let α (0) = α init and l = 0. 2. Until f(x k + α (l) p k ) < f k, i) set α (l+1) = τα (l), where τ (0,1) is fixed (e.g., τ = 1 2 ), ii) increment l by 1. 3. Set α k = α (l). This method prevents the step from getting too small, but it does not prevent steps that are too long relative to the decrease in f. To improve the method, we need to tighten the requirement f(x k + α (l) p k ) < f k.

To prevent long steps relative to the decrease in f, we require the Armijo condition f(x k + α k p k ) f(x k ) + α k β [g k ] T p k for some fixed β (0,1) (e.g., β = 0.1 or even β = 0.0001). That is to say, we require that the achieved reduction if f be at least a fixed fraction β of the reduction promised by the first-oder Taylor approximation of f at x k. 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.02 0.04 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Backtracking-Armijo Line Search: 1. Given α init > 0 (e.g., α init = 1), let α (0) = α init and l = 0. 2. Until f(x k + α (l) p k ) f(x k ) + α (l) β [g k ] T p k, i) set α (l+1) = τα (l), where τ (0,1) is fixed (e.g., τ = 1 2 ), ii) increment l by 1. 3. Set α k = α (l).

Theorem 1 (Termination of Backtracking-Armijo). Let f C 1 with gradient g(x) that is Lipschitz continuous with constant γ k at x k, and let p k be a descent direction at x k. Then, for fixed β (0,1), i) the Armijo condition f(x k + αp k ) f k + αβ [g k ] T p k is satisfied for all α [0, α k max], where α k max = 2(β 1)[gk ] T p k γ k p k 2, 2 ii) and furthermore, for fixed τ (0,1) the stepsize generated by the backtracking- Armijo line search terminates with α k min (α init, 2τ(β ) 1)[gk ] T p k γ k p k 2. 2 We remark that in practice γ k is not known. Therefore, we cannot simply compute α k max and αk via the explicit formulas given by the theorem, and we still need the algorithm on the previous slide.

Theorem 2 (Convergence of Generic LSM with B-A Steps). Let the gradient g of f C 1 be uniformly Lipschitz continuous on R n. Then, for the iterates generated by the Generic Line Search Method with Backtracking-Armijo step lengths, one of the following situations occurs, i) g k = 0 for some finite k, ii) lim k f k =, iii) lim k min ( [g k ] T p k, [g k ] T p k ) p k 2 = 0.

Computing a Search Direction p k Method of Steepest Descent: The most straight-forward choice of a search direction, p k = g k, is called steepest-descent direction. p k is a descent direction. p k solves the problem p k is cheap to compute. min p R n m L k (xk + p) = f k + [g k ] T p s.t. p 2 = g k 2. Any method that uses the steepest-descent direction as a search direction is a method of steepest descent. Intuitively, it would seem that p k is the best search-direction one can find. If that were true then much of optimisation theory would not exist!

Theorem 3 (Global Convergence of Steepest Descent). Let the gradient g of f C 1 be uniformly Lipschitz continuous on R n. Then, for the iterates generated by the Generic LSM with B-A steps and steepest-descent search directions, one of the following situations occurs, i) g k = 0 for some finite k, ii) lim k f k =, iii) lim k g k = 0.

Advantages and disadvantages of steepest descent: Globally convergent (converges to a local minimiser from any starting point x 0 ). Many other methods switch to steepest descent when they do not make sufficient progress. Not scale invariant (changing the inner product on R n changes the notion of gradient!). Convergence is usually very (very!) slow (linear). Numerically, it is often not convergent at all.

1.5 1 0.5 0 0.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Contours for the objective function f(x, y) = 10(y x 2 ) 2 + (x 1) 2 (Rosenbrock function), and the iterates generated by the generic line search steepest-descent method.

More General Descent Methods: Let B k be a symmetric, positive definite matrix, and define the search direction p k as the solution to the linear system B k p k = g k p k is a descent direction, since [g k ] T p k = [g k ] T [B k ] 1 g k < 0. p k solves the problem min p R n mq k (xk + p) = f k + [g k ] T p + 1 2 pt B k p.

p k corresponds to the steepest descent direction if the norm x B k := x T B k x is used on R n instead of the canonical Euclidean norm. This change of metric can be seen as preconditioning that can be chosen so as to speed up the steepest descent method. If the Hessian H k of f at x k is positive definite, and B k = H k, this is Newton s method. If B k changes at every iterate x k, a method based on the search direction p k is called variable metric method. In particular, Newton s method is a variable metric method.

Theorem 4 (Global Convergence of More General Descent Direction Methods). Let the gradient g of f C 1 be uniformly Lipschitz continuous on R n. Then, for the iterates generated by the Generic LSM with B-A steps and search directions defined by B k p k = g k, one of the following situations occurs, i) g k = 0 for some finite k, ii) lim k f k =, iii) lim k g k = 0, provided that the eigenvalues of B k are uniformly bounded above, and uniformly bounded away from zero.

Theorem 5 (Local Convergence of Newton s Method). Let the Hessian H of f C 2 be uniformly Lipschitz continuous on R n. Let iterates x k be generated via the Generic LSM with B-A steps using α init = 1 and β < 1, and using 2 the Newton search direction n k, defined by H k n k = g k. If (x k ) N has an accumulation point x where H(x ) 0 (positive definite) then i) α k = 1 for all k large enough, ii) lim k x k = x, iii) the sequence converges Q-quadratically, that is, there exists κ > 0 such that x k+1 x lim k x k x κ. 2 The mechanism that makes Theorem 5 work is that once the sequence (x k ) N enters a certain domain of attraction of x, it cannot escape again and quadratic convergence to x commences. Note that this is only a local convergence result, that is, Newton s method is not guaranteed to converge to a local minimiser from all starting points.

The fast convergence of Newton s method becomes apparent when we apply it to the Rosenbrock function: 1.5 1 0.5 0 0.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Contours for the objective function f(x, y) = 10(y x 2 ) 2 +(x 1) 2, and the iterates generated by the Generic Linesearch Newton method.

Modified Newton Methods: The use of B k = H k makes only sense at iterates x k where H k 0. Since this is usually not guaranteed to always be the case, we modify the method as follows, Choose M k 0 so that H k + M k is sufficiently positive definite, with M k = 0 if H k itself is sufficiently positive definite. Set B k = H k + M k and solve B k p k = g k.

The regularisation term M k is typically chosen as one of the following, If H k has the spectral decomposition H k = Q k Λ k [Q k ] T, then M k = max(0, λ min (H k ))I. Modified Cholesky method: H k + M k = Q k max(εi, D k )[Q k ] T. 1. Compute a factorisation PH k P T = LBL T, where P is a permutation matrix, L a unit lower triangular matrix, and B a block diagonal matrix with blocks of size 1 or 2. 2. Choose a matrix F such that B + F is sufficiently positive definite. 3. Let H k + M k = P T L(B + F)L T P.

Other Modifications of Newton s Method: 1. Build a cheap approximation B k to H k : Quasi-Newton approximation (BFGS, SR1, etc.), or use finite-difference approximation. 2. Instead of solving B k p k = g k for p k, if B k 0 approximately solve the convex quadratic programming problem (QP) p k arg min p R n fk + p T g k + 1 2 pt Bp.

The conjugate gradient method is a good solver for step 2: 1. Set p (0) = 0, g (0) = g k, d (0) = g k, and i = 0. 2. Until g (i) is sufficiently small or i = n, repeat i) α (i) = g(i) 2 2 [d (i) ] T B k d (i), ii) p (i+1) = p (i) + α (i) d (i), iii) g (i+1) = g (i) + α (i) B k d (i), iv) β (i) = g(i+1) 2 2, g (i) 2 2 v) d (i+1) = g (i+1) + β (i) d (i), vi) increment i by 1. 3. Output p k p (i).

Important features of the conjugate gradient method: [g k ] T p (i) < 0 for all i, that is, the algorithm always stops with a descent direction as an approximation to p k. Each iteration is cheap, as it only requires the computation of matrix-vector and vector-vector products. Usually, p (i) is a good approximation of p k well before i = n.