Search Directions for Unconstrained Optimization

Similar documents
1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

Matrix Secant Methods

Math 408A: Non-Linear Optimization

2. Quasi-Newton methods

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

5 Quasi-Newton Methods

17 Solution of Nonlinear Systems

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton methods for minimization

Maria Cameron. f(x) = 1 n

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Unconstrained optimization

Lecture 18: November Review on Primal-dual interior-poit methods

Optimization and Root Finding. Kurt Hornik

Quasi-Newton Methods

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Chapter 4. Unconstrained optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Notes on Numerical Optimization

8 Numerical methods for unconstrained problems

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Convex Optimization CMU-10725

The Conjugate Gradient Method

Numerical solutions of nonlinear systems of equations

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Quasi-Newton Methods

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

1. Introduction and motivation. We propose an algorithm for solving unconstrained optimization problems of the form (1) min

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Nonlinear Optimization: What s important?

Solutions Preliminary Examination in Numerical Analysis January, 2017

Nonlinear Programming

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

ABSTRACT 1. INTRODUCTION

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

MATH 4211/6211 Optimization Quasi-Newton Method

Numerical Methods for Large-Scale Nonlinear Systems

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Lecture 3. Optimization Problems and Iterative Algorithms

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Higher-Order Methods

Optimality Conditions for Constrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Algorithms for Constrained Optimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Unconstrained Multivariate Optimization

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

Line Search Methods for Unconstrained Optimisation

A derivative-free nonmonotone line search and its application to the spectral residual method

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Statistics 580 Optimization Methods

Jim Lambers MAT 419/519 Summer Session Lecture 11 Notes

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for

Cubic regularization in symmetric rank-1 quasi-newton methods

Lecture 14: October 17

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem

Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms

arxiv: v3 [math.na] 23 Mar 2016

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract

Newton s Method. Javier Peña Convex Optimization /36-725

Preconditioned conjugate gradient algorithms with column scaling

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 7 Unconstrained nonlinear programming

A COMBINED CLASS OF SELF-SCALING AND MODIFIED QUASI-NEWTON METHODS

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Unconstrained minimization of smooth functions

Math and Numerical Methods Review

Programming, numerics and optimization

MATH 3795 Lecture 13. Numerical Solution of Nonlinear Equations in R N.

Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method

Conditional Gradient (Frank-Wolfe) Method

Written Examination

Stochastic Quasi-Newton Methods

Math 273a: Optimization Netwon s methods

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

FIXED POINT ITERATIONS

A new ane scaling interior point algorithm for nonlinear optimization subject to linear equality and inequality constraints

Math 411 Preliminaries

Nonlinear Optimization for Optimal Control

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Optimization II: Unconstrained Multivariable

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Numerical Algorithms as Dynamical Systems

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Research Article A Two-Step Matrix-Free Secant Method for Solving Large-Scale Systems of Nonlinear Equations

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

c 2007 Society for Industrial and Applied Mathematics

Optimization Tutorial 1. Basic Gradient Descent

Transcription:

8

CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All of the search directions considered can be classified as Newton-lie since they are all of the form d = H f(x ), where H is a symmetric n n matrix. If H = µ I for all, the resulting search directions a a scaled steepest descent direction with scale factors µ. More generally, we choose H to approximate f(x ) 1 in order to approximate Newton s method for optimization. The Newton is important since it possesses rapid local convergence properties, and can be shown to be scale independent. We precede our discussion of search directions by maing precise a useful notion of speed or rate of convergence. 1. Rate of Convergence We focus on notions of quotient rates convergence, or Q-convergence rates. Let {x ν } R n and x R n be such that x ν x. We say that x ν x at a linear rate if x ν+1 x lim sup ν x ν x < 1. The convergence is said to be superlinear if this limsup is 0. The convergence is said to be quadratic if x ν+1 x lim sup ν x ν x <. For example, given γ (0, 1) the sequence {γ ν } converges linearly to zero, but not superlinearly. The sequence {γ ν } converges superlinearly to 0, but not quadratically. Finally, the sequence {γ ν } converges quadratically to zero. Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence.. Newton s Method for Solving Equations Newton s method is an iterative scheme designed to solve nonlinear equations of the form (89) g(x) = 0, where g : R n R n is assumed to be continuously differentiable. Many problems of importance can be posed in this way. In the context of the optimization problem P, we wish to locate critical points, that is, points at which f(x) = 0. We begin our discussion of Newton s method in the usual context of equation solvng. Assume that the function g in (89) is continuously differentiable and that we have an approximate solution x 0 R n. We now wish to improve on this approximation. If x is a solution to (89), then 0 = g(x) = g(x 0 ) + g (x 0 )(x x 0 ) + o x x 0. Thus, if x 0 is close to x, it is reasonable to suppose that the solution to the linearized system (90) 0 = g(x 0 ) + g (x 0 )(x x 0 ) is even closer. This is Newton s method for finding the roots of the equation g(x) = 0. It has one obvious pitfall. Equation (90) may not be consistent. That is, there may not exist a solution to (90). 83

84 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION For the sae of the present argument, we assume that (3) holds, i.e. g (x 0 ) 1 exists. Under this assumption (90) defines the iteration scheme, (91) x +1 := x [g (x )] 1 g(x ), called the Newton iteration. The associated direction (9) d := [g (x )] 1 g(x ). is called the Newton direction. We analyze the convergence behavior of this scheme under the additional assumption that only an approximation to g (x ) 1 is available. We denote this approximation by J. The resulting iteration scheme is (93) x +1 := x J g(x ). Methods of this type are called Newton-Lie methods. Theorem.1. Let g : R n R n be differentiable, x 0 R n, and J 0 R n n. Suppose that there exists x, x 0 R n, and ɛ > 0 with x 0 x < ɛ such that (1) g(x) = 0, () g (x) 1 exists for x B(x; ɛ) := {x R n : x x < ɛ} with sup{ g (x) 1 : x B(x; ɛ)] M 1 (3) g is Lipschitz continuous on clb(x; ɛ) with Lipschitz constant L, and (4) θ 0 := LM1 x0 x + M 0 K < 1 where K (g (x 0 ) 1 J 0 )y 0, y 0 := g(x 0 )/ g(x 0 ), and M 0 = max{ g (x) : x B(x; ɛ)}. Further suppose that iteration (93) is initiated at x 0 where the J s are chosen to satisfy one of the following conditions; (i) (g (x ) 1 J )y K, (ii) (g (x ) 1 J )y θ 1K for some θ 1 (0, 1), (iii) (g (x ) 1 J )y min{m 3 x x 1, K}, for some M > 0, or (iv) (g (x ) 1 J )y min{m g(x ), K}, for some M 3 > 0, where for each = 1,,..., y := g(x )/ g(x ). These hypotheses on the accuracy of the approximations J yield the following conclusions about the rate of convergence of the iterates x. (a) If (i) holds, then x x linearly. (b) If (ii) holds, then x x superlinearly. (c) If (iii) holds, then x x two step quadratically. (d) If (iv) holds, then x x quadratically. Proof. We begin by inductively establishing the basic inequalities (94) x +1 x LM 1 x x + (g (x ) 1 J )g(x ), and (95) x +1 x θ 0 x x as well as the inclusion (96) x +1 B(x; ɛ) for = 0, 1,,.... For = 0 we have x 1 x = x 0 x g (x 0 ) 1 g(x 0 ) + [ g (x 0 ) 1 ] J 0 g(x 0 ) = g (x 0 ) 1 [ g(x) (g(x 0 ) + g (x 0 )(x x 0 )) ] + [ g (x 0 ) 1 J 0 ] g(x 0 ),

. NEWTON S METHOD FOR SOLVING EQUATIONS 85 since g (x 0 ) 1 exists by the hypotheses. Consequently, the hypothese (1) (4) plus the quadratic bound lemma imply that x +1 x g (x 0 ) 1 ( g(x) g(x 0 ) + g (x 0 )(x x 0 ) ) + ( g (x 0 ) 1 ) J 0 g(x 0 ) M 1L x 0 x + K g(x 0 ) g(x) M 1L x 0 x + M 0 K x 0 x θ 0 x 0 x < ɛ, whereby (94) (96) are established for = 0. Next suppose that (94) (96) hold for = 0, 1,..., s 1. We show that (94) (96) hold at = s. Since x s B(x, ɛ), hypotheses () (4) hold at x s, one can proceed exactly as in the case = 0 to obtain (94). Now if any one of (i) (iv) holds, then (i) holds. Thus, by (94), we find that x s+1 x M 1L x s x + (g (x s ) 1 J s )g(x s ) [ M 1L θs 0 [ M 1L = θ 0 x s x. x 0 x + M0 K ] x s x x 0 x + M0 K ] x s x Hence x s+1 x θ 0 x s x θ 0 ɛ < ɛ and so x s+1 B(x, ɛ). We now proceed to establish (a) (d). (a) This clearly holds since the induction above established that x +1 x θ0 x x. (b) From (94), we have x +1 x LM 1 x x + (g (x ) 1 J )g(x ) LM 1 x x + θ 1K g(x ) [ LM1 θ 0 x 0 x + θ x 1 M 0 K] x Hence x x superlinearly. (c) From (94) and the fact that x x, we eventually have x +1 x LM 1 x x + (g (x ) 1 J )g(x ) Hence x x two step quadratically. LM 1 x x + M x x 1 g(x ) [ LM1 x x [ + M0 M x 1 x + x x ]] x x [ LM1 θ 0 x 1 x + M0 M (1 + θ 0 ) x 1 x ] = θ 0 x 1 x [ ] LM1 θ 0 + M 0 M (1 + θ 0 ) θ 0 x 1 x.

86 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION (d) Again by (94) and the fact that x x, we eventually have x +1 x LM 1 x x + (g (x ) 1 J )g(x ) LM 1 x x + M g(x ) [ ] LM1 x + M M0 x. Note that the conditions required for the approximations to the Jacobian matrices g (x ) 1 given in (i) (ii) do not imply that J g (x) 1. The stronger conditions (i) g (x ) 1 J g (x 0 ) 1 J 0, (ii) g (x +1 ) 1 J +1 θ1 g (x ) 1 J for some θ1 (0, 1), (iii) g (x ) 1 J min{m x +1 x, g (x 0 ) 1 J 0 } for some M > 0, or (iv) g (x ) 1 = J, which imply the conditions (i) through (iv) of Theorem.1 respectively, all imply the convergence of the inverse Jacobian approximates to g (x) 1. The conditions (i) (iv) are less desirable since they require greater expense and care in the construction of the inverse Jacobian approximates. 3. Newton s Method for Minimization We now translate the results of previous section to the optimization setting. The underlying problem is The Newton-lie iterations tae the form P min f(x). x R n x +1 = x H f(x ), where H is an approximation to the inverse of the Hessian matrix f(x ). Theorem 3.1. Let f : R n R be twice continuously differentiable, x 0 R n, and H 0 R n n. Suppose that (1) there exists x R n and ɛ > x 0 x such that f(x) f(x) whenever x x ɛ, () there is a δ > 0 such that δ z zt f(x)z for all x B(x, ɛ), (3) f is Lipschitz continuous on cl (B) (x; ɛ) with Lipschitz constant L, and (4) θ 0 := L δ x 0 x + M 0 K < 1 where M 0 > 0 satisfies z T f(x)z M 0 z for all x B(x, ɛ) and K ( f(x 0 ) 1 H 0 )y 0 with y 0 = f(x 0 )/ f(x 0 ). Further, suppose that the iteration (97) x +1 := x H f(x ) is initiated at x 0 where the H s are chosen to satisfy one of the following conditions: (i) ( f(x ) 1 H )y K, (ii) ( f(x ) 1 H )y θ 1 K for some θ 1 (0, 1), (iii) ( f(x ) 1 H )y min{m x x 1, K}, for some M > 0, or (iv) ( f(x ) 1 H )y min{m3 f(x ), K}, for some M3 > 0, where for each = 1,,... y := f(x )/ f(x ). These hypotheses on the accuracy of the approximations H yield the following conclusions about the rate of convergence of the iterates x. (a) If (i) holds, then x x linearly. (b) If (ii) holds, then x x superlinearly. (c) If (iii) holds, then x x two step quadratically. (d) If (iv) holds, then x x quadradically.

4. MATRIX SECANT METHODS 87 To more fully understand the convergence behavior described in this theorem, let us examine the nature of the controling parameters L, M 0, and M 1. Since L is a Lipschitz constant for f it loosely corresponds to a bound on the third order behavior of f. Thus the assumptions for convergence mae implicit demands on the third derivative. The constant δ is a local lower bound on the eigenvalues of f near x. That is, f behaves locally as if it were a strongly convex function (see exercises) with modulus δ. Finally, M 0 can be interpreted as a local Lipschitz constant for f and only plays a role when f is approximated inexactly by H s. We now illustrate the performance differences between the method of steepest descent and Newton s method on a simple one dimensional problem. Let f(x) = x + e x. Clearly, f is a strongly convex function with f(x) = x + e x f (x) = x + e x f (x) = + e x > f (x) = e x. If we apply the steepest descent algorithm with bactracing (γ = 1/, c = 0.01) initiated at x 0 = 1, we get the following table x f(x ) f (x ) s 0 1.3718818 4.718818 0 1 0 1 1 0.5.8565307 0.3934693 1 3.5.8413008 0.788008 4.375.879143.067107 3 5.34075.873473.097367 5 6.356375.87131.0154 6 7.348565.871976.0085768 7 8.354688.871848.001987 8 9.35149.871841.000658 10 10.3517364.87184.000007 1 If we apply Newton s method from the same starting point and tae a unit step at each iteration, we obtain a dramatically different table. x f (x) 1 4.718818 0 1 1/3.0498646.3516893.0001.3517337.00000000064 In addition, one more iteration gives f (x 5 ) 10 0. This is a stunning improvement in performance and shows why one always uses Newton s method (or an approximation to it) whenever possible. Our next objective is to develop numerically viable methods for approximating Jacobians and Hessians in Newton-lie methods. 4. Matrix Secant Methods Let us return to the problem of finding x R n such that g(x) = 0 where g : R n R n is continuously differentiable. In this section we consider Newton-Lie methods of a special type. Recall that in a Newton-Lie method the iteration scheme taes the form (98) x +1 := x J g(x ), where J is meant to approximate the inverse of g (x ). In the one dimensional case, a method proposed by the Babylonians 3700 years ago is of particular significance. Today we call it the secant method: (99) J = x x 1 g(x ) g(x 1 ).

88 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION With this approximation one has g (x ) 1 J = g(x 1 ) [g(x ) + g (x )(x 1 x )] g (x )[g(x 1 ) g(x. )] Near a point x at which g (x ) 0 one can use the MVT to show there exists an α > 0 such that α x y g(x) g(y). Consequently, by the Quadratic Bound Lemma, L g (x ) 1 J x 1 x α g (x ) x 1 x K x 1 x for some constant K > 0 whenever x and x 1 are sufficiently close to x. Therefore, by our convergence Theorem for Newton Lie methods, the secant method is locally two step quadratically convergent to a non singular solution of the equation g(x) = 0. An additional advantage of this approach is that no extra function evaluations are required to obtain the approximation J. 4.0.. Matrix Secant Methods for Equations. Unfortunately, the secant approximation (99) is meaningless if the dimension n is greater than 1 since division by vectors is undefined. But this can be rectified by multiplying (99) on the right by (g(x 1 ) g(x )) and writing (100) J (g(x ) g(x 1 )) = x x 1. Equation (100) is called the Quasi-Newton equation (QNE), or matrix secant equation (MSE), at x. Here the matrix J is unnown, but is required to satisfy the n linear equations of the MSE. These equations determine an n dimensional affine manifold in R n n. Since J contains n unnowns, the n linear equations in (100) are not sufficient to uniquely determine J. To nail down a specific J further conditions on the update J must be given. What conditions should these be? To develop sensible conditions on J, let us consider an overall iteration scheme based on (98). For convenience, let us denote J 1 by B (i.e. B = J 1 ). Using the B s, the MSE (100) becomes (101) B (x x 1 ) = g(x ) g(x 1 ). At every iteration we have (x, B ) and compute x +1 by (98). Then B +1 is constructed to satisfy (101). If B is close to g (x ) and x +1 is close to x, then B +1 should be chosen not only to satisfy (101) but also to be as close to B as possible. With this in mind, we must now decide what we mean by close. From a computational perspective, we prefer close to mean easy to compute. That is, B +1 should be algebraically close to B in the sense that B +1 is only a ran 1 modification of B. Since we are assuming that B +1 is a ran 1 modification to B, there are vectors u, v R n such that (10) B +1 = B + uv T. We now use the matrix secant equation (101) to derive conditions on the choice of u and v. In this setting, the MSE becomes B +1 s = y, where s := x +1 x and y := g(x +1 ) g(x ). Multiplying (??) by s gives y = B +1 s = B s + uv T s. Hence, if v T s 0, we obtain u = y B s v T s and ( y B s ) v T (103) B +1 = B + v T s. Equation (103) determines a whole class of ran one updates that satisfy the MSE where one is allowed to choose v R n as long as v T s 0. If s 0, then an obvious choice for v is s yielding the update ( y B s ) s T (104) B +1 = B =. s T s

4. MATRIX SECANT METHODS 89 This is nown as Broyden s update. It turns out that the Broyden update is also analytically close. and Theorem 4.1. Let A R n n, s, y R n, s 0. Then for any matrix norms and such that the solution to AB A B vv T v T v 1, (105) min{ B A : Bs = y} is (y As)sT (106) A + = A + s T. s In particular, (106) solves (105) when is the l matrix norm, and (106) solves (105) uniquely when is the Frobenius norm. Proof. Let B {B R n n : Bs = y}, then A + A = (y As)s T s T s = (B A) sst s T s B A ss T s T s B A. Note that if =, then { vv T vv T } v T v = sup v T v x x = 1 { } (v = sup T x) v x = 1 = 1, so that the conclusion of the result is not vacuous. For uniqueness observe that the Frobenius norm is strictly convex and A B F A F B. Therefore, the Broyden update (104) is both algebraically and analytically close to B. These properties indicate that it should perform well in practice and indeed it does. Algorithm: Broyden s Method Initialization: x 0 R n, B 0 R n n Having (x, B ) compute (x +1, B x+1 ) as follows: Solve B s = g(x ) for s and set x +1 : = x + s y : = g(x +1 ) g(x ) B +1 : = B + (y B s )s T s T s. We would prefer to write the Broyden update in terms of the matrices J = B 1 so that we can write the step computation as s = J g(x ) avoiding the need to solve an equation. To obtain the formula for J we use the the following important lemma for matrix inversion. Lemma 4.1. (Sherman-Morrison-Woodbury) Suppose A R n n, U R n, V R n are such that both A 1 and (I + V T A 1 U) 1 exist, then (A + UV T ) 1 = A 1 A 1 U(I + V T A 1 U) 1 V T A 1

90 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION The above lemma verifies that if B 1 = J exists and s T J y = s T B 1 y 0, then [ ] 1 (107) J +1 = B + (y B s )s T = B 1 s T s + (s B 1 y )s T B 1 s T B 1 y = J + (s J y )st J. s T J y In this case, it is possible to directly update the inverses J. It should be cautioned though that this process can become numerically unstable if s T J y is small. Therefore, in practise, the value s T J y must be monitored to avoid numerical instability. Although we do not pause to establish the convergence rates here, we do give the following result due to Dennis and Moré (1974). Theorem 4.. Let g : R n R n be continuously differentiable in an open convex set D R n. Assume that there exists x R n and r, β > 0 such that x + rb D, g(x ) = 0, g (x ) 1 exists with g (x ) 1 β, and g is Lipschitz continuous on x + rb with Lipschitz constant γ > 0. Then there exist positive constants ɛ and δ such that if x 0 x ɛ and B 0 g (x 0 ) δ, then the sequence {x } generated by the iteration [ x +1 := x + s where s solves 0 = g(x ) + B s is well-defined with x x superlinearly. B +1 := B + (y B s )s T s T s where y = g(x +1 ) g(x ) 4.0.3. Matrix Secant Methods for Minimization. We now extend these matrix secant ideas to optimization, specifically minimization. The underlying problem we consider is P : minimize x R n f(x), where f : R n R is assumed to be twice continuously differentiable. In this setting, we wish to solve the equation f(x) = 0 and the MSE (101) becomes (108) H +1 y = s, where s := x +1 x and y := f(x +1 ) f(x ). Here the matrix H is intended to be an approximation to the inverse of the hessian matrix f(x ). Writing M = H 1, a straightforward application of Broyden s method gives the update M +1 = M + (y M s )s T. s T s However, this is unsatisfactory for two reasons: (1) Since M approximates f(x ) it must be symmetric. () Since we are minimizing, then M must be positive definite to insure that s = M 1 f(x ) is a direction of descent for f at x. To address problem 1 above, one could return to equation (103) an find an update that preserves symmetry. Such an update is uniquely obtained by setting v = (y M s ). This is called the symmetric ran 1 update or SR1. Although this update can on occasion exhibit problems with numerical stability, it has recently received a great deal of renewed interest. The stability problems occur whenever (109) v T s = (y M s ) T s s has small magnitude. The inverse SR1 update is given by H +1 = H + (s H y )(s H y ) T (s H y ) T y which exists whenever (s H y ) T y 0. We now approach the question of how to update M in a way that addresses both the issue of symmetry and positive definiteness while still using the Broyden updating ideas. Given a symmetric positive definite matrix M and two vectors s and y, our goal is to find a symmetric positive definite matrix M such that Ms = y. Since M is symmertic and positive definite, there is a non-singular n n matrix L such that M = LL T. Indeed, L can be

4. MATRIX SECANT METHODS 91 chosen to be the lower triangular Cholesy factor of M. If M is also symmetric and positive definite then there is a matrix J R n n such that M = JJ T. The MSE (??) implies that if (110) J T s = v then (111) Jv = y. Let us apply the Broyden update technique to (111), J, and L. That is, suppose that (11) J = L + Then by (110) (y Lv)vT v T v (113) v = J T s = L T s + v(y Lv)T s v T. v This expression implies that v must have the form for some α R. Substituting this bac into (113) we get Hence (114) α = v = αl T s αl T s = L T s + αlt s(y αll T s) T s α s T LL T. s [ s T ] y s T. Ms Consequently, such a matrix J satisfying (113) exists only if s T y > 0 in which case with yielding J = L + (y αms)st L αs T, Ms α = [ s T ] 1/ y s T, Ms (115) M = M + yyt y T s MssT M s T Ms. Moreover, the Cholesy factorization for M can be obtained directly from the matrices J. Specifically, if the QR factorization of J T is J T = QR, we can set L = R yielding M = JJ T = R T Q T QR = LL T. The formula for updating the inverses is again given by applying the Sherman-Morrison-Woodbury formula to obtain (116) H = H + (s + Hy)T yss T (s T y) HysT + sy T H s T, y where H = M 1. The update (115) is called the BFGS update and (116) the inverse BFGS update. The letter BFGS stand for Broyden, Flethcher, Goldfarb, and Shanno. We have shown that beginning with a symmetric positive definite matrix M we can obtain a symmetric and positive definite update M +1 that satisfies the MSE M +1 s = y by applying the formula (115) whenever s T y > 0. We must now address the question of how to choose x +1 so that s T y > 0. Recall that and where y = y = f(x +1 ) f(x ) s = x +1 x = t d, d = t H f(x ).

9 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION is the matrix secant search direction and t is the stepsize. Hence y T s = f(x +1 ) T s f(x ) T s = t ( f(x + t d ) T d f(x ) T d ), where d := H f(x ). Since H is positive definite the direction d is a descent direction for f at x and so t > 0. Therefore, to insure that s T y > 0 we need only show that t > 0 can be choosen so that (117) f(x + t d ) T d β f(x ) T d for some β (0, 1) since in this case f(x + t d ) T d f(x ) T d (β 1) f(x ) T d > 0. But this precisely the second condition in the wea Wolfe conditions with β = c. Hence a successful BFGS update can always be obtained. The BFGS update and is currently considered the best matrix secant update for minimization. BFGS Updating σ := s T y ŝ := s /σ ŷ := y /σ H +1 := H + (ŝ H ŷ )(ŝ ) T + ŝ (ŝ H ŷ ) T (ŝ H ŷ ) T ŷ ŝ (ŝ ) T