8
CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All of the search directions considered can be classified as Newton-lie since they are all of the form d = H f(x ), where H is a symmetric n n matrix. If H = µ I for all, the resulting search directions a a scaled steepest descent direction with scale factors µ. More generally, we choose H to approximate f(x ) 1 in order to approximate Newton s method for optimization. The Newton is important since it possesses rapid local convergence properties, and can be shown to be scale independent. We precede our discussion of search directions by maing precise a useful notion of speed or rate of convergence. 1. Rate of Convergence We focus on notions of quotient rates convergence, or Q-convergence rates. Let {x ν } R n and x R n be such that x ν x. We say that x ν x at a linear rate if x ν+1 x lim sup ν x ν x < 1. The convergence is said to be superlinear if this limsup is 0. The convergence is said to be quadratic if x ν+1 x lim sup ν x ν x <. For example, given γ (0, 1) the sequence {γ ν } converges linearly to zero, but not superlinearly. The sequence {γ ν } converges superlinearly to 0, but not quadratically. Finally, the sequence {γ ν } converges quadratically to zero. Superlinear convergence is much faster than linear convergences, but quadratic convergence is much, much faster than superlinear convergence.. Newton s Method for Solving Equations Newton s method is an iterative scheme designed to solve nonlinear equations of the form (89) g(x) = 0, where g : R n R n is assumed to be continuously differentiable. Many problems of importance can be posed in this way. In the context of the optimization problem P, we wish to locate critical points, that is, points at which f(x) = 0. We begin our discussion of Newton s method in the usual context of equation solvng. Assume that the function g in (89) is continuously differentiable and that we have an approximate solution x 0 R n. We now wish to improve on this approximation. If x is a solution to (89), then 0 = g(x) = g(x 0 ) + g (x 0 )(x x 0 ) + o x x 0. Thus, if x 0 is close to x, it is reasonable to suppose that the solution to the linearized system (90) 0 = g(x 0 ) + g (x 0 )(x x 0 ) is even closer. This is Newton s method for finding the roots of the equation g(x) = 0. It has one obvious pitfall. Equation (90) may not be consistent. That is, there may not exist a solution to (90). 83
84 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION For the sae of the present argument, we assume that (3) holds, i.e. g (x 0 ) 1 exists. Under this assumption (90) defines the iteration scheme, (91) x +1 := x [g (x )] 1 g(x ), called the Newton iteration. The associated direction (9) d := [g (x )] 1 g(x ). is called the Newton direction. We analyze the convergence behavior of this scheme under the additional assumption that only an approximation to g (x ) 1 is available. We denote this approximation by J. The resulting iteration scheme is (93) x +1 := x J g(x ). Methods of this type are called Newton-Lie methods. Theorem.1. Let g : R n R n be differentiable, x 0 R n, and J 0 R n n. Suppose that there exists x, x 0 R n, and ɛ > 0 with x 0 x < ɛ such that (1) g(x) = 0, () g (x) 1 exists for x B(x; ɛ) := {x R n : x x < ɛ} with sup{ g (x) 1 : x B(x; ɛ)] M 1 (3) g is Lipschitz continuous on clb(x; ɛ) with Lipschitz constant L, and (4) θ 0 := LM1 x0 x + M 0 K < 1 where K (g (x 0 ) 1 J 0 )y 0, y 0 := g(x 0 )/ g(x 0 ), and M 0 = max{ g (x) : x B(x; ɛ)}. Further suppose that iteration (93) is initiated at x 0 where the J s are chosen to satisfy one of the following conditions; (i) (g (x ) 1 J )y K, (ii) (g (x ) 1 J )y θ 1K for some θ 1 (0, 1), (iii) (g (x ) 1 J )y min{m 3 x x 1, K}, for some M > 0, or (iv) (g (x ) 1 J )y min{m g(x ), K}, for some M 3 > 0, where for each = 1,,..., y := g(x )/ g(x ). These hypotheses on the accuracy of the approximations J yield the following conclusions about the rate of convergence of the iterates x. (a) If (i) holds, then x x linearly. (b) If (ii) holds, then x x superlinearly. (c) If (iii) holds, then x x two step quadratically. (d) If (iv) holds, then x x quadratically. Proof. We begin by inductively establishing the basic inequalities (94) x +1 x LM 1 x x + (g (x ) 1 J )g(x ), and (95) x +1 x θ 0 x x as well as the inclusion (96) x +1 B(x; ɛ) for = 0, 1,,.... For = 0 we have x 1 x = x 0 x g (x 0 ) 1 g(x 0 ) + [ g (x 0 ) 1 ] J 0 g(x 0 ) = g (x 0 ) 1 [ g(x) (g(x 0 ) + g (x 0 )(x x 0 )) ] + [ g (x 0 ) 1 J 0 ] g(x 0 ),
. NEWTON S METHOD FOR SOLVING EQUATIONS 85 since g (x 0 ) 1 exists by the hypotheses. Consequently, the hypothese (1) (4) plus the quadratic bound lemma imply that x +1 x g (x 0 ) 1 ( g(x) g(x 0 ) + g (x 0 )(x x 0 ) ) + ( g (x 0 ) 1 ) J 0 g(x 0 ) M 1L x 0 x + K g(x 0 ) g(x) M 1L x 0 x + M 0 K x 0 x θ 0 x 0 x < ɛ, whereby (94) (96) are established for = 0. Next suppose that (94) (96) hold for = 0, 1,..., s 1. We show that (94) (96) hold at = s. Since x s B(x, ɛ), hypotheses () (4) hold at x s, one can proceed exactly as in the case = 0 to obtain (94). Now if any one of (i) (iv) holds, then (i) holds. Thus, by (94), we find that x s+1 x M 1L x s x + (g (x s ) 1 J s )g(x s ) [ M 1L θs 0 [ M 1L = θ 0 x s x. x 0 x + M0 K ] x s x x 0 x + M0 K ] x s x Hence x s+1 x θ 0 x s x θ 0 ɛ < ɛ and so x s+1 B(x, ɛ). We now proceed to establish (a) (d). (a) This clearly holds since the induction above established that x +1 x θ0 x x. (b) From (94), we have x +1 x LM 1 x x + (g (x ) 1 J )g(x ) LM 1 x x + θ 1K g(x ) [ LM1 θ 0 x 0 x + θ x 1 M 0 K] x Hence x x superlinearly. (c) From (94) and the fact that x x, we eventually have x +1 x LM 1 x x + (g (x ) 1 J )g(x ) Hence x x two step quadratically. LM 1 x x + M x x 1 g(x ) [ LM1 x x [ + M0 M x 1 x + x x ]] x x [ LM1 θ 0 x 1 x + M0 M (1 + θ 0 ) x 1 x ] = θ 0 x 1 x [ ] LM1 θ 0 + M 0 M (1 + θ 0 ) θ 0 x 1 x.
86 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION (d) Again by (94) and the fact that x x, we eventually have x +1 x LM 1 x x + (g (x ) 1 J )g(x ) LM 1 x x + M g(x ) [ ] LM1 x + M M0 x. Note that the conditions required for the approximations to the Jacobian matrices g (x ) 1 given in (i) (ii) do not imply that J g (x) 1. The stronger conditions (i) g (x ) 1 J g (x 0 ) 1 J 0, (ii) g (x +1 ) 1 J +1 θ1 g (x ) 1 J for some θ1 (0, 1), (iii) g (x ) 1 J min{m x +1 x, g (x 0 ) 1 J 0 } for some M > 0, or (iv) g (x ) 1 = J, which imply the conditions (i) through (iv) of Theorem.1 respectively, all imply the convergence of the inverse Jacobian approximates to g (x) 1. The conditions (i) (iv) are less desirable since they require greater expense and care in the construction of the inverse Jacobian approximates. 3. Newton s Method for Minimization We now translate the results of previous section to the optimization setting. The underlying problem is The Newton-lie iterations tae the form P min f(x). x R n x +1 = x H f(x ), where H is an approximation to the inverse of the Hessian matrix f(x ). Theorem 3.1. Let f : R n R be twice continuously differentiable, x 0 R n, and H 0 R n n. Suppose that (1) there exists x R n and ɛ > x 0 x such that f(x) f(x) whenever x x ɛ, () there is a δ > 0 such that δ z zt f(x)z for all x B(x, ɛ), (3) f is Lipschitz continuous on cl (B) (x; ɛ) with Lipschitz constant L, and (4) θ 0 := L δ x 0 x + M 0 K < 1 where M 0 > 0 satisfies z T f(x)z M 0 z for all x B(x, ɛ) and K ( f(x 0 ) 1 H 0 )y 0 with y 0 = f(x 0 )/ f(x 0 ). Further, suppose that the iteration (97) x +1 := x H f(x ) is initiated at x 0 where the H s are chosen to satisfy one of the following conditions: (i) ( f(x ) 1 H )y K, (ii) ( f(x ) 1 H )y θ 1 K for some θ 1 (0, 1), (iii) ( f(x ) 1 H )y min{m x x 1, K}, for some M > 0, or (iv) ( f(x ) 1 H )y min{m3 f(x ), K}, for some M3 > 0, where for each = 1,,... y := f(x )/ f(x ). These hypotheses on the accuracy of the approximations H yield the following conclusions about the rate of convergence of the iterates x. (a) If (i) holds, then x x linearly. (b) If (ii) holds, then x x superlinearly. (c) If (iii) holds, then x x two step quadratically. (d) If (iv) holds, then x x quadradically.
4. MATRIX SECANT METHODS 87 To more fully understand the convergence behavior described in this theorem, let us examine the nature of the controling parameters L, M 0, and M 1. Since L is a Lipschitz constant for f it loosely corresponds to a bound on the third order behavior of f. Thus the assumptions for convergence mae implicit demands on the third derivative. The constant δ is a local lower bound on the eigenvalues of f near x. That is, f behaves locally as if it were a strongly convex function (see exercises) with modulus δ. Finally, M 0 can be interpreted as a local Lipschitz constant for f and only plays a role when f is approximated inexactly by H s. We now illustrate the performance differences between the method of steepest descent and Newton s method on a simple one dimensional problem. Let f(x) = x + e x. Clearly, f is a strongly convex function with f(x) = x + e x f (x) = x + e x f (x) = + e x > f (x) = e x. If we apply the steepest descent algorithm with bactracing (γ = 1/, c = 0.01) initiated at x 0 = 1, we get the following table x f(x ) f (x ) s 0 1.3718818 4.718818 0 1 0 1 1 0.5.8565307 0.3934693 1 3.5.8413008 0.788008 4.375.879143.067107 3 5.34075.873473.097367 5 6.356375.87131.0154 6 7.348565.871976.0085768 7 8.354688.871848.001987 8 9.35149.871841.000658 10 10.3517364.87184.000007 1 If we apply Newton s method from the same starting point and tae a unit step at each iteration, we obtain a dramatically different table. x f (x) 1 4.718818 0 1 1/3.0498646.3516893.0001.3517337.00000000064 In addition, one more iteration gives f (x 5 ) 10 0. This is a stunning improvement in performance and shows why one always uses Newton s method (or an approximation to it) whenever possible. Our next objective is to develop numerically viable methods for approximating Jacobians and Hessians in Newton-lie methods. 4. Matrix Secant Methods Let us return to the problem of finding x R n such that g(x) = 0 where g : R n R n is continuously differentiable. In this section we consider Newton-Lie methods of a special type. Recall that in a Newton-Lie method the iteration scheme taes the form (98) x +1 := x J g(x ), where J is meant to approximate the inverse of g (x ). In the one dimensional case, a method proposed by the Babylonians 3700 years ago is of particular significance. Today we call it the secant method: (99) J = x x 1 g(x ) g(x 1 ).
88 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION With this approximation one has g (x ) 1 J = g(x 1 ) [g(x ) + g (x )(x 1 x )] g (x )[g(x 1 ) g(x. )] Near a point x at which g (x ) 0 one can use the MVT to show there exists an α > 0 such that α x y g(x) g(y). Consequently, by the Quadratic Bound Lemma, L g (x ) 1 J x 1 x α g (x ) x 1 x K x 1 x for some constant K > 0 whenever x and x 1 are sufficiently close to x. Therefore, by our convergence Theorem for Newton Lie methods, the secant method is locally two step quadratically convergent to a non singular solution of the equation g(x) = 0. An additional advantage of this approach is that no extra function evaluations are required to obtain the approximation J. 4.0.. Matrix Secant Methods for Equations. Unfortunately, the secant approximation (99) is meaningless if the dimension n is greater than 1 since division by vectors is undefined. But this can be rectified by multiplying (99) on the right by (g(x 1 ) g(x )) and writing (100) J (g(x ) g(x 1 )) = x x 1. Equation (100) is called the Quasi-Newton equation (QNE), or matrix secant equation (MSE), at x. Here the matrix J is unnown, but is required to satisfy the n linear equations of the MSE. These equations determine an n dimensional affine manifold in R n n. Since J contains n unnowns, the n linear equations in (100) are not sufficient to uniquely determine J. To nail down a specific J further conditions on the update J must be given. What conditions should these be? To develop sensible conditions on J, let us consider an overall iteration scheme based on (98). For convenience, let us denote J 1 by B (i.e. B = J 1 ). Using the B s, the MSE (100) becomes (101) B (x x 1 ) = g(x ) g(x 1 ). At every iteration we have (x, B ) and compute x +1 by (98). Then B +1 is constructed to satisfy (101). If B is close to g (x ) and x +1 is close to x, then B +1 should be chosen not only to satisfy (101) but also to be as close to B as possible. With this in mind, we must now decide what we mean by close. From a computational perspective, we prefer close to mean easy to compute. That is, B +1 should be algebraically close to B in the sense that B +1 is only a ran 1 modification of B. Since we are assuming that B +1 is a ran 1 modification to B, there are vectors u, v R n such that (10) B +1 = B + uv T. We now use the matrix secant equation (101) to derive conditions on the choice of u and v. In this setting, the MSE becomes B +1 s = y, where s := x +1 x and y := g(x +1 ) g(x ). Multiplying (??) by s gives y = B +1 s = B s + uv T s. Hence, if v T s 0, we obtain u = y B s v T s and ( y B s ) v T (103) B +1 = B + v T s. Equation (103) determines a whole class of ran one updates that satisfy the MSE where one is allowed to choose v R n as long as v T s 0. If s 0, then an obvious choice for v is s yielding the update ( y B s ) s T (104) B +1 = B =. s T s
4. MATRIX SECANT METHODS 89 This is nown as Broyden s update. It turns out that the Broyden update is also analytically close. and Theorem 4.1. Let A R n n, s, y R n, s 0. Then for any matrix norms and such that the solution to AB A B vv T v T v 1, (105) min{ B A : Bs = y} is (y As)sT (106) A + = A + s T. s In particular, (106) solves (105) when is the l matrix norm, and (106) solves (105) uniquely when is the Frobenius norm. Proof. Let B {B R n n : Bs = y}, then A + A = (y As)s T s T s = (B A) sst s T s B A ss T s T s B A. Note that if =, then { vv T vv T } v T v = sup v T v x x = 1 { } (v = sup T x) v x = 1 = 1, so that the conclusion of the result is not vacuous. For uniqueness observe that the Frobenius norm is strictly convex and A B F A F B. Therefore, the Broyden update (104) is both algebraically and analytically close to B. These properties indicate that it should perform well in practice and indeed it does. Algorithm: Broyden s Method Initialization: x 0 R n, B 0 R n n Having (x, B ) compute (x +1, B x+1 ) as follows: Solve B s = g(x ) for s and set x +1 : = x + s y : = g(x +1 ) g(x ) B +1 : = B + (y B s )s T s T s. We would prefer to write the Broyden update in terms of the matrices J = B 1 so that we can write the step computation as s = J g(x ) avoiding the need to solve an equation. To obtain the formula for J we use the the following important lemma for matrix inversion. Lemma 4.1. (Sherman-Morrison-Woodbury) Suppose A R n n, U R n, V R n are such that both A 1 and (I + V T A 1 U) 1 exist, then (A + UV T ) 1 = A 1 A 1 U(I + V T A 1 U) 1 V T A 1
90 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION The above lemma verifies that if B 1 = J exists and s T J y = s T B 1 y 0, then [ ] 1 (107) J +1 = B + (y B s )s T = B 1 s T s + (s B 1 y )s T B 1 s T B 1 y = J + (s J y )st J. s T J y In this case, it is possible to directly update the inverses J. It should be cautioned though that this process can become numerically unstable if s T J y is small. Therefore, in practise, the value s T J y must be monitored to avoid numerical instability. Although we do not pause to establish the convergence rates here, we do give the following result due to Dennis and Moré (1974). Theorem 4.. Let g : R n R n be continuously differentiable in an open convex set D R n. Assume that there exists x R n and r, β > 0 such that x + rb D, g(x ) = 0, g (x ) 1 exists with g (x ) 1 β, and g is Lipschitz continuous on x + rb with Lipschitz constant γ > 0. Then there exist positive constants ɛ and δ such that if x 0 x ɛ and B 0 g (x 0 ) δ, then the sequence {x } generated by the iteration [ x +1 := x + s where s solves 0 = g(x ) + B s is well-defined with x x superlinearly. B +1 := B + (y B s )s T s T s where y = g(x +1 ) g(x ) 4.0.3. Matrix Secant Methods for Minimization. We now extend these matrix secant ideas to optimization, specifically minimization. The underlying problem we consider is P : minimize x R n f(x), where f : R n R is assumed to be twice continuously differentiable. In this setting, we wish to solve the equation f(x) = 0 and the MSE (101) becomes (108) H +1 y = s, where s := x +1 x and y := f(x +1 ) f(x ). Here the matrix H is intended to be an approximation to the inverse of the hessian matrix f(x ). Writing M = H 1, a straightforward application of Broyden s method gives the update M +1 = M + (y M s )s T. s T s However, this is unsatisfactory for two reasons: (1) Since M approximates f(x ) it must be symmetric. () Since we are minimizing, then M must be positive definite to insure that s = M 1 f(x ) is a direction of descent for f at x. To address problem 1 above, one could return to equation (103) an find an update that preserves symmetry. Such an update is uniquely obtained by setting v = (y M s ). This is called the symmetric ran 1 update or SR1. Although this update can on occasion exhibit problems with numerical stability, it has recently received a great deal of renewed interest. The stability problems occur whenever (109) v T s = (y M s ) T s s has small magnitude. The inverse SR1 update is given by H +1 = H + (s H y )(s H y ) T (s H y ) T y which exists whenever (s H y ) T y 0. We now approach the question of how to update M in a way that addresses both the issue of symmetry and positive definiteness while still using the Broyden updating ideas. Given a symmetric positive definite matrix M and two vectors s and y, our goal is to find a symmetric positive definite matrix M such that Ms = y. Since M is symmertic and positive definite, there is a non-singular n n matrix L such that M = LL T. Indeed, L can be
4. MATRIX SECANT METHODS 91 chosen to be the lower triangular Cholesy factor of M. If M is also symmetric and positive definite then there is a matrix J R n n such that M = JJ T. The MSE (??) implies that if (110) J T s = v then (111) Jv = y. Let us apply the Broyden update technique to (111), J, and L. That is, suppose that (11) J = L + Then by (110) (y Lv)vT v T v (113) v = J T s = L T s + v(y Lv)T s v T. v This expression implies that v must have the form for some α R. Substituting this bac into (113) we get Hence (114) α = v = αl T s αl T s = L T s + αlt s(y αll T s) T s α s T LL T. s [ s T ] y s T. Ms Consequently, such a matrix J satisfying (113) exists only if s T y > 0 in which case with yielding J = L + (y αms)st L αs T, Ms α = [ s T ] 1/ y s T, Ms (115) M = M + yyt y T s MssT M s T Ms. Moreover, the Cholesy factorization for M can be obtained directly from the matrices J. Specifically, if the QR factorization of J T is J T = QR, we can set L = R yielding M = JJ T = R T Q T QR = LL T. The formula for updating the inverses is again given by applying the Sherman-Morrison-Woodbury formula to obtain (116) H = H + (s + Hy)T yss T (s T y) HysT + sy T H s T, y where H = M 1. The update (115) is called the BFGS update and (116) the inverse BFGS update. The letter BFGS stand for Broyden, Flethcher, Goldfarb, and Shanno. We have shown that beginning with a symmetric positive definite matrix M we can obtain a symmetric and positive definite update M +1 that satisfies the MSE M +1 s = y by applying the formula (115) whenever s T y > 0. We must now address the question of how to choose x +1 so that s T y > 0. Recall that and where y = y = f(x +1 ) f(x ) s = x +1 x = t d, d = t H f(x ).
9 8. SEARCH DIRECTIONS FOR UNCONSTRAINED OPTIMIZATION is the matrix secant search direction and t is the stepsize. Hence y T s = f(x +1 ) T s f(x ) T s = t ( f(x + t d ) T d f(x ) T d ), where d := H f(x ). Since H is positive definite the direction d is a descent direction for f at x and so t > 0. Therefore, to insure that s T y > 0 we need only show that t > 0 can be choosen so that (117) f(x + t d ) T d β f(x ) T d for some β (0, 1) since in this case f(x + t d ) T d f(x ) T d (β 1) f(x ) T d > 0. But this precisely the second condition in the wea Wolfe conditions with β = c. Hence a successful BFGS update can always be obtained. The BFGS update and is currently considered the best matrix secant update for minimization. BFGS Updating σ := s T y ŝ := s /σ ŷ := y /σ H +1 := H + (ŝ H ŷ )(ŝ ) T + ŝ (ŝ H ŷ ) T (ŝ H ŷ ) T ŷ ŝ (ŝ ) T