Nonlinear Programming

Similar documents
Chapter 4. Unconstrained optimization

Quasi-Newton methods for minimization

Nonlinear Optimization: What s important?

5 Quasi-Newton Methods

Unconstrained optimization

MATH 4211/6211 Optimization Quasi-Newton Method

Optimization and Root Finding. Kurt Hornik

Programming, numerics and optimization

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Nonlinear Programming

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

8 Numerical methods for unconstrained problems

1 Numerical optimization

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

1 Numerical optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Chapter 10 Conjugate Direction Methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Lecture 7 Unconstrained nonlinear programming

Convex Optimization. Problem set 2. Due Monday April 26th

Statistics 580 Optimization Methods

Quasi-Newton Methods

FALL 2018 MATH 4211/6211 Optimization Homework 4

Convex Optimization CMU-10725

Algorithms for Constrained Optimization

Gradient-Based Optimization

5 Handling Constraints

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Static unconstrained optimization

Lecture 14: October 17

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Optimization Methods

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization II: Unconstrained Multivariable

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Lecture 10: September 26

The Steepest Descent Algorithm for Unconstrained Optimization

2. Quasi-Newton methods

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Improving the Convergence of Back-Propogation Learning with Second Order Methods

On nonlinear optimization since M.J.D. Powell

Optimization Tutorial 1. Basic Gradient Descent

4TE3/6TE3. Algorithms for. Continuous Optimization

Written Examination

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

Numerical Methods I Solving Nonlinear Equations

Multivariate Newton Minimanization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Optimization II: Unconstrained Multivariable

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Higher-Order Methods

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

The Conjugate Gradient Method

Second Order Optimization Algorithms I

IE 5531: Engineering Optimization I

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Optimization Methods for Machine Learning

Lecture V. Numerical Optimization

Gradient Descent. Dr. Xiaowei Huang

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction

Unconstrained Multivariate Optimization

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

NonlinearOptimization

Interior-Point Methods for Linear Optimization

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Newton s Method. Javier Peña Convex Optimization /36-725

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Convex Optimization M2

MVE165/MMG631 Linear and integer optimization with applications Lecture 13 Overview of nonlinear programming. Ann-Brith Strömberg

Algorithms for constrained local optimization

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

Scientific Computing: Optimization

Review of Classical Optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

Numerical solutions of nonlinear systems of equations

Optimization for Machine Learning

arxiv: v3 [math.na] 23 Mar 2016

Nonlinear Optimization for Optimal Control

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Iterative Methods for Solving A x = b

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Scientific Computing: An Introductory Survey

Optimization Methods

Chapter 8 Gradient Methods

Lecture 18: November Review on Primal-dual interior-poit methods

Lecture 3: Linesearch methods (continued). Steepest descent methods

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Chapter 2. Optimization. Gradients, convexity, and ALS

Optimality Conditions for Constrained Optimization

Transcription:

Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1

Outline for week 7: Algorithms for unconstrained minimization A generic algorithm Rate of convergence Line search methods Dichotomous and golden section search Bisection Newton s method Search directions Gradient method Newton s method Methods of conjugate directions Powell s method Fletcher-Reeves method Quasi-Newton s method DFP update BFGS update Stopping criteria Optimization Group 2

Input: ǫ > 0 is the accuracy parameter; x 0 is a given (relative interior) feasible point; Generic algorithm for min x C f(x) Step 0: x := x 0, k = 0; Step 1: Find search direction s k s.t. δf(x k, s k ) < 0 (This should be a descending feasible direction in the constrained case.) Step 1a: If no such direction exists STOP, optimum found. Step 2: Line search : find λ k = argmin λ f(x k + λs k ); Step 3: x k+1 = x k + λ k s k, k = k + 1; Step 4: If stopping criteria satisfied STOP, else GOTO Step 1. Optimization Group 3

Algorithms: rate of convergence Definition: Let α 1, α 2,..., α k,... α be a convergent sequence. The rate of convergence is: p = sup { p : limsup k The larger p is, the faster the convergence. Let The rate of convergence is: β = limsup k α k+1 α α k α p < α k+1 α α k α p. }. linear if p = 1 and 0 < β < 1; super-linear if p = 1 and β = 0; quadratic if p = 2; sub-linear if β = 1. Optimization Group 4

Examples: order of convergence Example 1: The sequence α k = a k, where 0 < a < 1 converges linearly to zero while β = a. Example 2: The sequence α k = a (2k), where 0 < a < 1 converges quadratically to zero. Example 3: The sequence α k = 1 k converges sub-linearly to zero. Example 4: The sequence α k = ( 1 k )k converges super-linearly to zero. Optimization Group 5

Linesearch methods We assume throughout that f is a convex function. We are given a (feasible) search direction s at a feasible point x and we want to find So we are minimizing for λ 0. λ = argmin λ 0 f(x + λs). φ(λ) := f(x + λs) This is a one-dimensional problem. We deal with four different line search methods, that require different levels of information about φ(λ): The Dichotomous search and Golden section methods, that use only function evaluations of φ; Bisection, that evaluates φ (λ) (φ has to be continuously differentiable); Newton s method, that evaluates both φ (λ) and φ (λ). Optimization Group 6

Linesearch: Dichotomous search We assume that φ is convex and has a minimizer on the interval [a, b]. Our aim is to reduce the size of this interval of uncertainty by evaluating φ at points in [a, b]. Lemma 1 (Exercise 4.7) Let a < ā < b < b. If φ(ā) < φ( b) then the minimum of φ occurs in the interval [a, b]; if φ(ā) φ( b) the minimum of φ occurs in the interval [ā, b]. The lemma suggest a simple algorithm to reduce the interval of uncertainty. Optimization Group 7

Input: ǫ > 0 is the accuracy parameter; Linesearch: Dichotomous search a 0, b 0 are given such that [a 0, b 0 ] contains the minimizer of φ(λ), k = 0. Step 1: If a k b k < ǫ STOP. Step 2: Choose ā k (a k, b k ) and b k (a k, b k ), such that ā k < b k ; Step 3a: If φ(ā k ) < φ( b k ) set a k+1 = a k, b k+1 = b k ; Step 3b: If φ(ā k ) φ( b k ) set a k+1 = ā k, b k+1 = b k ; Step 4: Set k = k + 1. GOTO Step 1. We have not specified yet how we should choose the values ā k and b k in iteration k (Step 2 of the algorithm). There are many ways to do this. One is to choose ā k = 1 2 (a k + b k ) δ and b k = 1 2 (a k + b k ) + δ where δ > 0 is a (very) small fixed constant. Then the interval of uncertainty is reduced by a factor ( 1 2 + δ)t/2 after t function evaluations (Exercise 4.8). Optimization Group 8

Linesearch: Golden section search This is a variant of the Dichotomous search method where the constant δ is not constant but depends on k. In the k-th iteration we take δ = δ k, where δ k = ( α 1 2 ) (b k a k ), α = 1 2 ( 5 1 ) 0.618. Here α is the Golden ratio, i.e., the root of α 2 + α 1 = 0 with α [0,1]. We now have ā k = 1 2 (a k + b k ) δ k = 1 2 (a k + b k ) ( α 1 2 (b k a k ) = b k α(b k a k ) b k = 1 2 (a k + b k ) + δ k = 1 2 (a k + b k ) + ( α 1 ) 2 (b k a k ) = a k + α(b k a k ). If φ(ā k ) < φ( b k ), then we set b k+1 = b k and a k+1 = a k. In that case b k+1 = a k+1 + α ( b k+1 a k+1 ) = ak + α ( b k a k ) = ak + α 2 (b k a k ) = a k + (1 α)(b k a k ) = b k α(b k a k ) = ā k. So in the next iteration we only need to compute φ(ā k+1 ). Similarly, if φ(ā k ) φ( b k ), then we set a k+1 = ā k and b k+1 = b k, and it follows in a similar way that ā k+1 = b k. So in the next iteration we only need to compute φ( b k+1 ). In both cases one need to evaluate φ only once. See the course notes for graphical illustrations. When using Golden section search each iteration reduces the interval of uncertainty by a factor α 0.618 (Exercise 4.9). ) Optimization Group 9

Linesearch: Golden section search Suppose a k = 0 and b k = 1. We choose fixed α ( 1 2,1) and define ā k+1 = a k + (1 α)(b k a k ) = 1 α, b k+1 = b k (1 α)(b k a k ) = α. Suppose φ(1 α) < φ(α). Then we set a k+1 = 0 and b k+1 = α and ā k+2 = a k+1 + (1 α)(b k+1 a k+1 ) = (1 α)α b k+2 = b k+1 (1 α)(b k+1 a k+1 ) = α (1 α)α. We want one of these two points to be ā k+1 = 1 α, because we already know φ(1 α). This gives either or, equivalently, (1 α)α = 1 α or α α(1 α) = 1 α, α = 1 or α 2 + α 1 = 0. Since α ( 1 2,1), the only possible value for α is which is the Golden ratio! α = 1 2 ( 5 1 ) 0.618, Optimization Group 10

Linesearch: Bisection (or Bolzano s method) We assume that φ(λ) is differentiable (and convex). We wish to find a λ so that φ ( λ) = 0. Input: ǫ > 0 is the accuracy parameter; a 0, b 0 are given such that φ (a 0 ) < 0 & φ (b 0 ) > 0, k = 0; Step 1: If b k a k < ǫ STOP. Step 2: Let λ = 1 2 (a k + b k ); Step 3a: If φ (λ) < 0 then set a k+1 = λ, b k+1 = b k ; Step 3b: If φ (λ) > 0 then set a k+1 = a k, b k+1 = λ; Step 4: Set k = k + 1. GOTO Step 1. The algorithm needs log 2 b 0 a 0 ǫ evaluations of φ (Exercise 4.11). Optimization Group 11

Linesearch using the Newton-Raphson method The quadratic approximation of φ at λ k is q(λ) = φ(λ k ) + φ (λ k )(λ λ k ) + 1 2 φ (λ k )(λ λ k ) 2. The minimum of q is attained if q (λ) = 0, which gives Input: ǫ > 0 is the accuracy parameter; λ 0 is the given initial point; k = 0; λ k+1 = λ k φ (λ k ) φ (λ k ). Step 1: Let λ k+1 = λ k φ (λ k ) φ (λ k ) ; Step 2: If λ k+1 λ k < ǫ STOP. Step 3: k := k + 1, GOTO Step 1. Optimization Group 12

The Newton-Raphson method: Example 4.3 Let φ(λ) = λ log(1+λ). The domain of φ is ( 1, ). The first and second derivatives of φ are φ (λ) = λ 1 + λ, φ (λ) = 1 (1 + λ) 2. This makes clear that φ is strictly convex on its domain, and minimal at λ = 0. The iterates satisfy the recursive relation λ k+1 = λ k φ (λ k ) φ (λ k ) = λ k λ k (1 + λ k ) = λ 2 k. This implies quadratic convergence if λ 0 < 1 (see Exercise 4.12). On the other hand, Newton s method fails if λ 0 1. For example, if λ 0 = 1 then λ 1 = 1, which is not in the domain of φ! In general the method converges quadratically if the following conditions are met: 1. the starting point is sufficiently close to the minimizer; 2. in addition to being convex, the function φ has a property called self-concordance, which is introduced later. Optimization Group 13

Search Directions: The Gradient method Search direction: s = f(x k ) Steepest descent direction! δf(x, f(x)) = f(x) T f(x) = min s = f(x) { f(x) T s}. The (negative) gradient is orthogonal to the level curves (Exercise 4.14). The gradient method is not a finite algorithm, not even for linear or quadratic functions. Slow convergence ( zigzagging ) (Figure 4.4). The order of convergence is only linear. Optimization Group 14

Let f be continuously differentiable. Convergence of the gradient method Start from the initial point x 0 using exact line search; the gradient method produces a sequence { x k} such that f(x k ) > f(x k+1 ) for k = 0,1,2,. Assume that the level set D = {x : f(x) f(x 0 )} is compact, then any accumulation point x of the sequence { x k} is a stationary point (i.e. f( x) = 0) of f. If the function f is a convex function, then x is a global minimizer of f. If the function f is not convex, then x is a local minimizer of f. Optimization Group 15

Newton s method Newton s method is based on minimizing the second order approximation of f at x k. q(x) := f(x k ) + f(x k ) T (x x k ) + 1 2 (x xk ) T 2 f(x k )(x x k ). We assume that q(x) is strictly convex. So the Hessian 2 f(x k ) is positive definite. Hence the minimum is attained when q(x) = f(x k ) + 2 f(x k )(x x k ) = 0. We can solve x from 2 f(x k )(x x k ) = f(x k ) which gives the next iterate: x k+1 = x k ( 2 f(x k )) 1 f(x k ), So the Newton direction is s k = ( 2 f(x k )) 1 f(x k ). Exact when f is quadratic. Local quadratic convergence with full Newton steps (α = 1, so without any line search!). Good starting point is essential. Optimization Group 16

Trust region method If the function f(x) is not strictly convex, or if the Hessian is ill-conditioned, then the Hessian is not (or hardly) invertible. Remedy: trust region method. 2 f(x) is replaced by ( 2 f(x) + αi); s k = ( 2 f(x k ) + αi) 1 f(x k ). α is dynamically increased and decreased in order to avoid exact line search. If α = 0 then we have the Newton step, if α then we approach a (small) multiple of the negative gradient. Optimization Group 17

Newton s method for solving nonlinear equations Find solution of F(x) = 0, where F : R m R n Linearize at x k : F(x) F(x k ) + JF(x k )(x x k ) Jacobian of F: JF(x) ij = F i(x) x j. Solve x k+1 from JF(x k )(x k+1 x k ) = F(x k ). Minimizing f(x) is equivalent to solving f(x) = 0. 2 f(x k )(x k+1 x k ) = f(x k ). The Jacobian of the gradient is exactly the Hessian of the function f(x) hence it is positive definite and we have as we have seen above. x k+1 = x k ( 2 f(x k )) 1 f(x k ) Conclusion: Newton s optimization method Newton s method for nonlinear equations applied to the system f(x) = 0. Optimization Group 18

Methods using conjugate directions (1) Let A be an n n symmetric positive definite matrix and b R n. We consider min { q(x) = 1 2 xt Ax b T x : x R n}. The minimizer is uniquely determined by q(x) = Ax b = 0. But to find the minimizer we need to invert the matrix A. If n is large this is computationally expensive, and we want to avoid this. This can be done by using so-called conjugate search directions. If the subsequent search directions are s 0,..., s k then the iterates have the form x k+1 = x k + λ k s k, k = 1, 2,.... If we use exact line search, then we automatically have q(x k+1 ) T s k = 0, k = 1, 2,.... By requiring a little more, namely that the search vectors s i are linearly independent and q(x k+1 ) T s i = 0, 0 i k = 1, 2,... we can guarantee termination of the algorithm in a finite number of steps. Because then q(x n ) T s i = 0, i < n, whence, since the vectors s i are linearly independent, q(x n ) = 0. So no more than n steps are required. Optimization Group 19

Methods using conjugate directions (2) We denote q(x k ) as g k. Note that when using exact line search, we have (automatically) g j+1t s j = 0, j = 1, 2,.... Lemma 2 Let k {1,..., n}. The following two statement are equivalent: (i) g j+1t s i = 0, 0 i < j k; (ii) s it As j = 0, 0 i < j k. Proof: Since q(x) = Ax b, we have q(x j+1 ) = Ax j+1 b = A ( x j + λ j s j) b = q(x j ) + λ j As j = g j + λ j As j, j = 0, 1,... Therefore, for each i 0, g j+1t s i = q(x j+1 ) T s i = g jt s i + λ j s it As j, j = 0, 1,.... The proof can now be easily completed by induction on k, since λ j > 0 for each j. If (ii) holds then the vectors s 0,..., s k R n are called conjugate (or A conjugate). Note that if A = I then conjugate means orthogonal. Then s 0,..., s k are linearly independent. This also holds for A-conjugate vectors, since A is positive definite (Ex. 4.20). As we established before, if one uses A-conjugate directions to minimize the quadratic form q, then the minimizer of q is found in at most n iterations. Optimization Group 20

Easy method to generate conjugate directions Let s 0 = q(x 0 ) = g 0. Then we can get subsequent conjugate directions by taking s k = g k + α k s k 1, k = 1, 2,... for suitable values of α k. In order to make s k and s k 1 A-conjugate, we must have s kt As k 1 = 0 for k 1. This determines the coefficients α k uniquely: α k = gkt As k 1 k 1. s k 1T Ask 1, We proceed with induction on k. So we assume that s 0,..., s k 1 are conjugate. Using g kt s k 1 = 0 we find g kt s k = g kt ( g k + α k s k 1) = g k 2 < 0, proving that s k is a descent direction, provided g k 0. Our choice of α k implies s kt As k 1 = 0. So it remains to show that s kt As i = 0 for i < k 1. The induction hypothesis implies s kt As i = ( g k + α k s k 1) T As i = g kt As i. Since g i = q(x i ) = Ax i b, and x i+1 = x i + λ i s i, we have Hence, due to Lemma 2(ii), λ i As i = g i+1 g i = α i+1 s i s i+1 ( α i s i 1 s i). λ i g kt As i = g kt ( αi+1 s i s i+1 α i s i 1 + s i) = 0. This proves that s 0,..., s k are conjugate, provided g k 0 (otherwise x k is optimal!). Optimization Group 21

The case of nonquadratic functions In the case where f is (convex) quadratic then finite termination is guaranteed if ( ) ( ) α k = gkt As k 1 s k 1T As = gkt gk gk 1 k 1 s k 1T (g k g k 1 ) = gkt gk gk 1 g k 2 g k 1 2 = g k 1 2, k 1. Here we used that g k g k 1, as follows from g k 1 = s k 1 +α k s k 2 and g kt s k 2 = 0, by Lemma 2 (i) and g kt s k 1 = 0 by the choice of λ k. The algorithm is Step 0. Let s 0 = f(x 0 ) and x 1 := argmin f(x 0 + λs 0 ). Step k. Set s k = f(x k ) + α k s k 1 and x k+1 := argmin f(x k + λs k ). If f is not quadratic there is no guarantee that the method stops after a finite number of steps. Several choices for α k have been proposed (which are equivalent in the quadratic case): ( Hestenes-Stiefel (1952): α k = gkt g k g k 1) s k 1T (g k g k 1 ). g k 2 Fletcher-Reeves (1964): α k = g k 1 2. Polak-Ribière (1969): α k = gkt ( g k g k 1) g k 1 2. Optimization Group 22

Solving a linear system with the conjugate gradient method Assume we want to solve Ax = b with A positive definite. The solution is precisely the minimizer of q(x) = 1 2 xt Ax b T x, and hence can be solve by the conjugate gradient method. If A is not positive definite, but nonsingular, then A T A is positive definite. Hence we can solve Ax = b by minimizing q(x) = Ax b 2 = x T A T Ax 2b T Ax + b T b. Optimization Group 23

Powell s method We now deal with a conjugate direction method using only function values (no gradients!). Input A starting point x 0, a set of linearly independent vectors t 1,..., t n. Initialization Set t (1,i) = t i, i = 1,..., n. For k = 1,..., n do (Cycle k) Let z (k,1) = x k 1 and z (k,i+1) := argmin q ( z (k,i) + λt (k,i)), i = 1,..., n. Let x k := argmin q(z (k,n+1) + λs k ), where s k := z (k,n+1) x k 1. Let t (k+1,i) = t (k,i+1), i = 1,, n 1 and t (k+1,n) := s k. The algorithm consists of n cycles and terminates at the minimizer of q(x). Each cycle consists of n + 1 line searches and yields a search direction s k. The k-th direction s k is constructed at the end of cycle k. The search directions s 1,..., s n are conjugate (for a proof: see the course notes). Note that only function values are evaluated (no derivatives are used, unless the line searches do so). The number of line searches is n(n + 1). Therefore, Powell s method is attractive for minimizing black box functions where gradient and Hessian information is not available (or too expensive to compute). Optimization Group 24

Illustration of Powell s method 2 x 0 1.5 t 1 1 0.5 x 1 x 2 0 x 2 optimal 0.5 1 t 2 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 1 Iterates of Powell s algorithm for f(x) = 5x 2 1 + 2x 1x 2 + x 2 2 + 7, starting at x0 = [1;2]. Optimization Group 25

Quasi-Newton methods Recall that the Newton direction at iteration k is given by: s k = [ 2 f(x k ) ] 1 f(x k ) = [ 2 f(x k ) ] 1 g k. Quasi-Newton methods use a positive definite approximation H k to [ 2 f(x k ) ] 1. The approximation H k is updated at each iteration, say H k+1 = H k + D k, where D k denotes the update. The algorithm has the following generic form. Step 0. Let x 0 be given and set H 0 = I. Step k. s k = H k g k and x k+1 = argmin λ f(x k + λs k ) = x k + λ k s k ; H k+1 = H k + D k and k = k + 1. Defining we require for each k that y k := g k+1 g k, σ k = x k+1 x k = λ k s k. I H k+1 is symmetric positive definite; II σ k = H k+1 y k (quasi-newton property); III σ i = H k+1 y i (i = 0,..., k 1) (hereditary property), Optimization Group 26

The quasi-newton property and hereditary property Let A be an n n symmetric PD matrix, and let f be the strictly convex quadratic function Then g k = q(x k ) = Ax k b, and hence q(x) = 1 2 xt Ax b T x. y k = g k+1 g k = q(x k+1 ) q(x k ) = A ( x k+1 x k) = Aσ k, whence σ k = A 1 y k. Recall that each H k should be a good approximation of the inverse of 2 q(x k ), which is A 1. Therefore we require that σ k = H k+1 y k, which is the quasi-newton property, and even more, that our approximation H k satisfies σ i = H k+1 y i, i = 0,..., k. which is the hereditary property. Note that the hereditary property implies σ i = H n y i, i = 0,..., n 1. If the σ i (i = 0,..., n 1) are linearly independent, this implies H n = A 1. But then the n + 1-th iteration is simply the Newton step at x n. Since q is quadratic, this yields the minimizer of q, and hence we find the minimum of q no more than n + 1 iterations. Optimization Group 27

A generic update D k (1) First consider the case where D k is a (possibly indefinite) matrix of rank 2, whence D k = αuu T + βvv T for suitable vectors u and v and scalars α, β. Then the quasi-newton property implies H k+1 y k = H k y k + αuu T y k + βvv T y k = σ k. Davidon, Fletcher and Powell (1963) recognized that this condition is satisfied if u = σ k = λ k s k, α = 1, v = H k y k 1, β = σ kt y k y kt H k y k, which yields the so-called DFP update: D k = λ k s k s kt s kt y k H kyk ykt Hk y kt H k y k. In the following we consider a slight more general update, namely D k = αuu T + βvv T + µ ( uv T + vu T ) = [ ] u v α µ µ β Exercise A Show that D k has rank at most 2. ut v T. Optimization Group 28

A generic update D k (2) D k = αuu T + βvv T + µ ( uv T + vu T) = [ u v ] [ α µ µ β ] [ u T v T ], u = σ k, v = H k y k. Lemma 3 If the above update D k satisfies the quasi-newton property then the subsequent directions are conjugate. (So a quasi-newton method is a conjugate gradient method!) Proof: We show by induction on k that H k y i = σ i = λ i s i, s kt As i = g kt s i = 0, 0 i < k. (1) This trivially holds if k = 0 (the condition is void). Assuming the quasi-newton property and (1) for k 0, and using y i = Aσ i and σ i = λ i s i for all i, we write for i < k: y kt H k+1 y i = σ kt y i = σ kt Aσ i = λ k λ i s kt As i = 0. Also σ T k y i = σ T k Aσ i = 0. Hence we obtain for all i < k, D k y i = [ u v ] [ ] [ ] α µ u T y i µ β v T = [ σ k H k y ] [ k α µ y i µ β ] [ σ kt y i y kt H k y i whence H k+1 y i = H k y i +D k y i = H k y i = σ i. Together with the quasi-newton property this gives H k+1 y i = σ i for 0 i < k + 1. Because λ i 0, s k+1 = H k+1 g k+1 and H k+1 y i = σ i we observe next that λ i s k+1t As i = s k+1t Aσ i = s k+1t y i = g k+1t H k+1 y i = g k+1t σ i. Hence it suffices for the rest of the proof if g k+1t s i = 0 for 0 i < k + 1. This certainly holds if i = k, because we use exact line search. For i < k we use the induction hypothesis again, and g k+1 = A(x k + λ k s k ) b = g k + λ k As k, which gives This completes the proof. g k+1t s i = g kt s i + λ k s kt As = 0. ] = 0, Optimization Group 29

The Broyden family of updates (1) D k = αuu T + βvv T + µ ( uv T + vu T) = [ u v ] [ α µ µ β ] [ u T v T ], u = σ k, v = H k y k. We now determine conditions on the parameters α, β and µ that guarantee that each H k is positive definite. The quasi-newton property (σ k = H k+1 y k = H k y k + D k y k for k 0) amounts to u = v + αuu T y k + βvv T y k + µ ( uv T + vu T ) y k To satisfy this condition it suffices if = v ( 1 + βv T y k + µu T y k) + u ( αu T y k + µv T y k). αu T y k + µv T y k = 1 βv T y k + µu T y k = 1. This linear system has multiple solutions. Introducing ρ = µu T y k R the solution is α = 1 u T y k Since v T y k = y T H k y k > 0 and ( 1 + ρ vt y k u T y k ), β = ρ 1 v T y k, ρ R. u T y k = σ k T ( g k+1 g k) = λ k s k T ( g k+1 g k) = λ k s k T g k = λ k g k T H k g k > 0, the above expressions are well defined. Optimization Group 30

The Broyden family of updates (2) D k = αuu T +βvv T +µ ( uv T + vu T), u = σ k, v = H k y k, α = 1 u T y k Substituting the values of u, v, α and β we find D k = λ k s k s kt s kt y k H kyk ykt Hk y kt H k y k +ρwwt, w = y kt H k y k ) (1 + ρ vt y k sk u T y k s kt y k, β = ρ 1 v T y k. H ky k y kt H k y k This class of updates is known as the Broyden family. Note that if ρ = 0, we get the DFP update that we have seen before. Lemma 4 If ρ 0 then H k is positive definite for each k 0. Proof: It suffices if H k H ky k y kt H k y kt H k y k is positive definite, since the other two terms forming H k+1 are positive semidefinite. This however is an (almost) immediate consequence of the inequality of Cauchy-Schwarz. If ρ = 0 we get the DFP update. The choice ρ = 1 was proposed by Broyden, Fletcher, Goldfarb and Shanno (1970). It is the most popular BFGS update.. Optimization Group 31

Stopping criteria The stopping criterion is a relatively simple but essential part of all algorithms. If both primal and dual feasible solutions are generated then we use the duality gap primal obj. value dual obj. value as a criterion. We then stop the algorithm if the duality gap is smaller than some prescribed accuracy parameter ǫ. In unconstrained optimization one often uses a primal algorithm and then there is no such obvious measure for the distance to the optimum. We then stop if there is no sufficient improvement in the objective value, or if the iterates stay too close to each other or if the length of the gradient or the length of the Newton step in an appropriate norm is small. All these criteria can be scaled (relative to) some characteristic number describing the dimensions of the problem. For example, the relative improvement at two subsequent iterates x k, x k+1 in the objective value is usually measured by f(x k ) f(x k+1 ) 1 + f(x k, ) and we may stop if it smaller than a prescribed accuracy parameter ǫ. Optimization Group 32