Quasi-Newton Methods Werner C. Rheinboldt These are excerpts of material relating to the boos [OR00 and [Rhe98 and of write-ups prepared for courses held at the University of Pittsburgh. Some further references are [Kel95, [Kel99, [DS98. 1 Broyden s Method Let F (x) = 0, F : R n R n, (1) be a given system of nonlinear equations defined by a sufficiently smooth function F. A linearization method for the numerical solution of (1) has the general form x +1 = x F (x ), = 0, 1,... () where the n n matrices B are suitably chosen. The integral mean value theorem for F states that [ 1 DF (x + t(y x))dt (y x) = F (y) F (x), x, y R n, 0 The matrix in bracets can be interpreted as the average of the Jacobian matrix on the line segment between the points x and y. This suggests requiring the matrices B to satisfy the so called quasi-newton condition B +1 (x +1 x ) = F (x +1 ) F (x ). In several papers around 1967, C. G. Broyden suggested that it is numerically advantageous to choose the matrices in () such that ran(b +1 B ) is small. This led to the development of the so-called quasi-newton methods, which can be characterized by the following three properties: (a) B (x +1 x ) + F (x ) = 0, (b) B +1 (x +1 x ) = F (x +1 ) F (x ) (c) B +1 = B + B, ran B = m 1 = 0, 1,.... (3) Up to now only the values m = 1 or m = have been used in the design of quasi-newton methods. From (3) we obtain some frequently used relations (a) (B +1 B )s = F (x +1 ), s = x +1 x ), (b) F (x +1 ) = y B s, y = F (x +1 ) F (x ). (4) 1
C. G. Broyden himself developed two quasi-newton methods with m = 1 and called one of them his good method. This terminology has persisted. The good method uses B +1 := B + F (x+1 )(s ) (s ) s, (5) or, in view of (4)(b), B +1 := B + (y B s )(s ) (s ) s. (6) As for all standard linearization methods, the matrices B should be invertible. Recall the well nown Sherman-Morrison formula: 1.1. For u, v R n the matrix I + uv is invertible if and only if 1 + u v 0, and in that case ( I + uv ) 1 = I 1 1 + u v uv If in the Broyden method the matrix B is nonsingular, then 1.1 shows that [ B +1 = B I + (B 1 F (x+1 ))(s ) s (7) is again nonsingular, provided that in which case, the inverse is With H = +1 = [I (s + (B 1 F (x+1 )) s 0, ( F (x+1 ))((s ) F (x+1 )) s s + (B 1 and H +1 = +1 this can be written in the form. (8) H +1 = H + (s H y )(s ) (s ) H y H. (9) Various convergence results for Broyden s method have been proved. We refer to the cited references and cite only a simplified version of such a result: 1.. Let F : Ω R n R n be continuously differentiable on an open set Ω. Suppose that x Ω a solution of F (x) = 0 where DF (x ) is invertible and DF (x) DF (x ) γ x x, x Ω. Then there exist δ, η > 0 such that for x 0 x < δ and B 0 DF (x ) < η Broyden s method converges to x and the rate of convergence is superlinear in the sense that x + x +1 lim x +1 x = 0. (10)
Recursive Implementation With the notation the inverse formula (8) is +1 = [I w = F (x+1 ), (11) w (s ) s + (w ) s and the next step equals [ s +1 = +1 F (x+1 ) = I From (13) it follows that, (1) w (s ) s + (w ) s s = s + (w ) s w. s +1 (s ) w (s ) s = s + (w ) s j=0 w (13) whence (1) becomes [ +1 = I + s+1 (s ) s = [I + sj+1 (s j ) s j B0 1. (14) while (13) can be written as x +1 = [I + s+1 (s ) s w, (15) that is, [1 + (s ) w s s +1 = w. (16) Suppose now that the steps s j, j = 0, 1,..., and their norms have been stored. Then (14) and (16) imply that w = 1 j=0 [I + sj+1 (s j ) s j s +1 = 1 w, τ = (s ) w 1 + τ s, which can be evaluated by the recursive algorithm w := 0 F (x+1 ); for j = 0,..., 1 τ := [(s j ) w/ s j ; w := w + τs j+1 ; τ := [(s ) w/ s ; s +1 := [1/(1 + τ)w; 3 w, w := 0 F (x+1 ) (17) (18)
In order to complete this algorithm, we need some divergence and convergence criteria. In the convergence proof a controlling quantity is the quotient Θ := B 1 F (x+1 ) s, 0, (19) and it turns out, that we should declare divergence if the condition Θ < 1 (0) is violated. In view of the superlinear convergence it suffices to declare convergence as soon as s +1 tol. Altogether the Broyden algorithm can now be formulated as follows, where in contrast to (18) we wor with v = w: input: x 0, B 0, max, tol; solve B 0 s 0 = F (x 0 ); ξ 0 := s 0 ; store ξ 0, s 0 ; for = 0, 1,..., max x +1 := x + s ; solve B 0 v = F (x +1 ); if > 0 for j = 1,..., τ := [(s j 1 ) v/ξj 1 ; v := v + τs j ; endif τ := [(s ) v/ξ ; Θ := v /ξ ; if Θ 1/ then return {divergence}; s +1 := v/(1 τ); ξ +1 := s +1 ; store ξ +1, s +1 ; if ξ +1 tol then return {x := x +1 + s +1 }; return {maximal number of steps} An implementation of this algorithm is the FORTRAN program NLEQ1 of P. Deuflhard, U. Nowa, and L. Weimann available in the ZIB-Elib library. There exists also a Matlab version. A somewhat different Matlab program is brsol.m by C. T. Kelley [Kel03. The recursive form of the Broyden method has shown itself to be very economical in practice. But it has been observed occasionally, that the condition of the matrices may deteriorate over several steps causing the method to become instable. For any matrix A = I + uv v, u, v R n, κ = u < 1, v 4
we have uv u v and therefore 1 κ 1 uv v A 1 + uv v 1 + κ, This shows that A 1 (1 κ) 1 and cond (A) := A A 1 1 + κ 1 κ, Hence, for the Broyden matrices (7) it follows from the convergence condition (19), (0), that cond (B +1 ) 1 + Θ 1 Θ cond (B ) < 3cond (B ), and hence that the growth of the condition numbers is not unduly fast and can be controlled by means of these estimates. 3 Linear Equations The recursive form of the Broyden method also provides a very useful iterative method for linear problems In that case (5) has the form Ax = b, A GL(R n ). B +1 := B + (b Ax+1 )(s ) s and with (B A)s + b Ax +1 = 0 it follows that B +1 A = (B A)(I P ), P = s (s ) (s ) s. (1) Here I P is the orthogonal projection onto the orthogonal complement of the linear space spanned by s. We introduce now the matrices E j = A 1 B j I. Then it follows from (1) that E j+1 E j, j 0. Moreover, implies that and hence that B j s j = Ax j b = A(x j x ), x = A 1 b, x j x = A 1 B j s j = (E j + I)s j, ( 1 E js j ) ( s j s j x j x 1 + E js j ) s j s j. 5
Under the conditions of the local convergence theorem 1. one can show that lim j E j s j / s j = 0. This leads to the asymptotic error estimate x j x s j In order to smooth any possible erratic behavior, it is here useful to wor with the average of several steps and to declare convergence if ɛ := 1 [ s j 1 + s j + s j+1 1/ η x j tol, () with some given safety factor β < 1. Then the algorithm has the form: input: A, b, y, B 0, max, τ min, tol; r := b Ay; s 0 = B0 1 0 := (s 0 ) s 0 ; for = 0, 1,..., max store s 0, η 0 ; q := As ; z = B0 1 if > 0 for j = 1,..., z := z + [(s j 1 ) z / η j (s j s j 1 ); endif τ := η /[(s ) z; if τ < τ min then return {restart}; x := x + s ; s +1 := τ(s z); η +1 := (s +1 ) s +1 ; Store s +1, η +1 ; ɛ := (1/)[s 1 + s + s +1 1/ ; if ɛ β x tol then return {x := x + s +1 }; return {maximal number of steps} A FORTRAN implementation is the GBITR program in the ZIB-Elib library. Note that for the matrix A only a facility for computing the product Ax, x R n, has to be provided. 4 Ran-Two Updates The variety of possible methods increases considerably in the ran-two case. Many of these methods have been developed for application in optimization problems. In that case the interest centers on update formulas, which preserve the symmetry of the matrices. Evidently, the direct updates should then have the form B +1 = B + ( b c ) ( ) ( ) b σ1 σ Σ, Σ =, b, c R n. (3) σ σ 3 c Some examples show that here the matrix Σ should be nonsingular with a negative determinant, otherwise there may be convergence problems. 6
Since the vectors b, c are essentially free, some suitable basis in R may be chosen in which Σ assumes a simpler form. In particular, because of det Σ < 0 we may transform Σ such that either σ 1 or σ 3 is zero. In fact, if, say, σ 3 0 then a simple calculation shows that Σ = ( 1 µ 0 1 ) ( 0 δ ) ( 1 0 δ σ 3 µ 1 ), µ = σ δ σ 3, δ = det Σ. Thus, there is no loss of generality to assume that σ 1 = 0 in (3). As before we use the abbreviations s = x +1 x, y = F (x +1 ) F (x ). (4) The condition (4) requires y B s to be in the subspace spanned by b and c and hence it is no restriction to set b = y B s. Then, for any c R n such that c s 0 it follows that σ = 1/c s and σ 3 = (y B s ) s /(c s ). In other words, all symmetric direct update formulas with nonpositive determinant can be written in the form B +1 = B + (y B s )c + c(y B s ) c s provided, of course, that c s 0. (y B s ) s (c s ) cc, (5) For c = s (5) becomes the Powell-symmetric-Broyden (PSB) update formula B +1 = B + (y B s )(s ) + s (y B s ) (s ) s (y B s ) s ((s ) s ) s (s ) (6) of M. J. D. Powell, while for c = y we obtain the Davidon Fletcher Powell (DFP) update formula B +1 = B + (y B s )(y ) + y (y B s ) (y ) s (y B s ) s ((y ) s ) y (y ) (7) given by D. Davidon and independently by R. Fletcher and M. J. D. Powell. Instead of woring with the direct update (3) we may consider updating the inverses H = such that H +1 H has ran two. Here we can begin with H +1 H in a form analogous with (3) and then proceed as before. We will not go into details, but mention only one of the formulas that can be obtained in this way. It was independently suggested by C. G. Broyden, R. Fletcher, D. Goldfarb and D. F. Shanno, and is generally called the BFGS formula reflecting the first letters of the four authors. H +1 = (I s (y ) )H (y ) s (I y (s ) ) (y ) s + s (s ) (y ) s. (8) This is widely considered the most effective update formula for minimization problems. As before, we can apply here the Sherman-Morrison formula 1.1 and then obtain the direct-update form of the BFGS update B +1 = B + y (y ) (y ) s B s (B s ) (s ) B s. (9) 7
5 The BFGS Method in Optimization Extremal problems are of foremost importance in almost all applications of mathematics. Many boundary value problems of mathematical physics may be phrased as variational problems. For instance, holonomic equilibrium problem in Lagrangian mechanics derive from the minimization of a suitable energy function. Similarly, the determination of a geodesic between two points on a manifold is a minimization problem, and so are optimal control problems in engineering, or problems involving the optimal determination of unnown parameters of a technical process. There are close connections between such extremal problems and the solution of nonlinear equations, as is readily seen in the finite dimensional case. Let g : Ω R n R 1 be some functional on some set Ω. A point x Ω is a local minimizer of g in Ω if there exists an open neighborhood U of x in R n such that g(x) g(x ), x U Ω, (30) and a global minimizer on Ω if the inequality (30) holds for all x Ω. A point x in the interior int(ω) of Ω is a critical point of g if g has a derivative at x and Dg(x ) = 0. A well-nown result states that if x int(e) is a local minimizer where g is differentiable, then x is a critical point of g. Of course, a critical point need not be local minimizer. But if g has a continuous second derivative at a critical point x intω and the Hessian matrix D g(x ) is positive definite then x is a proper local minimizer; that is, strict inequality holds in (30) for all x U Ω, x x. Conversely, at a local minimizer x, D g(x ) is positive semi-definite. For a differentiable functional g : Ω R n R 1 we call the transposed first derivative g(x) = Dg(x) R n the gradient of g at x Ω. The problem of finding critical points of g is precisely that of solving the gradient system g(x) = 0, x Ω. (31) Conversely, a differentiable mapping F : Ω R n R n is called a gradient or potential mapping on Ω if there exists a differentiable functional g : Ω R n R 1 such that F (x) = g(x) for all x Ω. A continuously differentiable mapping F on an open convex set Ω is a gradient mapping on Ω if and only if DF (x) is symmetric for all x Ω. This is called Kerner s theorem. For any gradient mapping the problem of solving F (x) = 0 may be replaced by that of minimizing the functional g, provided, of course, we eep in mind that a local minimizer of g need not be a critical point, nor that a critical point is necessarily a minimizer. Let g : Ω R n R 1 be a (sufficiently smooth) functional for which we want to compute a minimizer. Many of the iterative methods for this purpose have the general form x +1 = x λ d, = 0, 1,, (3) involving a direction vector d R n and a steplength λ 0 chosen such that g(x ) > g(x +1 ), = 0, 1,, (33) 8
Obviously, it will not suffice to ensure only a decrease of the value of g, but to require that the decrease (33) is sufficiently large. Thus, at the -th step of the methods the major tass are the selection of a suitable direction vector d and the construction of an appropriate steplength λ. The literature in this area is very extensive, see, e.g., [Kel99 and [Rhe98 for an introduction and further references. Clearly, given a current point x Ω, we want to use a (nonzero direction vector d such that for some δ > 0 we have g(x td) g(x) for t [0, δ). From lim t 0 (1/t)[g(x) g(x tp) = Dg(x)p it follows that, in order for this to hold it is sufficient that Dg(x)d > 0 and necessary that Dg(x)d 0. Accordingly, we call a vector d 0 an admissible direction of g at a point x if Dg(x)d > 0. In accordance with the linearization methods () we consider now methods of the form x +1 = x + λ g(x ), = 0, 1,. (34) Hence the direction vectors are here d := g(x ), = 0, 1,. (35) If the matrices B are assumed to be symmetric, positive definite, then we have Dg(x)d = Dg(x) g(x ) = ( g(x )) g(x ) > 0 if g(x ) 0, (36) that is, the directions (35) are admissible. This is the reason, why in section 4 the emphasis was placed on the construction of update formulas that preserve symmetry. Actually many of these update methods also preserve positive definiteness. In particular, this holds for the BFGS formula: 5.1. With the abbreviations (4) suppose that B is symmetric, positive definite, and that (y ) s > 0. Then B +1 given by (9) is also symmetric, positive definite. Proof. By (8) we have ( +1 = I s (y ) ) (y ) s (I y (s ) ) (y ) s + s (s ) (y ) s. (37) Thus, under the stated conditions we have (z B s ) ((s ) B s ) (z B z) z 0, with equality only if z = 0 and s = 0. Moreover, it follows from (9) that z B +1 z = (z y) y s + z B z (z B s ) (s ) B s whence as claimed. z B +1 z > (z y ) (y ) s 0. 9
A step of a descent method of the form (34) with the BFGS update formula has now the generic form: Compute the search direction d = H g(x ); Determine suitable λ > 0 such that g(x ) g(x + λ d ) > 0; s = λ d ; x +1 = x + s ; y = g(x +1 ) g(x ); If (y ) s 0 then return; Update H to H +1 by means of the BFGS formula. Numerous algorithms have been proposed for constructing an acceptable step λ. One of the simplest is the so called Armijo rule, where we search along the line t > 0 x + td for a point such that g(x ) g(x + td ) > tα g(x ), (38) where, say, α = 10 4. More specifically we use a bactracing approach and test (38) first with t = 1 and then with succesively smaller t = β j, j = 0, 1,..., jmax, where 0 < β < 1. In other words, the algorithm has the generic form: input g, x, d, α, β, j max, g = g(x ); p = g(x ); γ = α p ; t = 1; for j = 0 : j max if g g(x + td ) > tγ then return {λ = t}; t = βt; return {f ailure} For the implementation of the overall algorithm one has to decide on the storage of all needed data and on a strategy for a more effective handling of the error case (y ) s 0. These issues are discussed, e.g., in chapter 4 of [Kel99. The simplest approach is to store the entire matrix H, which then allows for the computation of the update once the vectors s and y are available. Clearly, this is costly in storage for large dimensions. A second possibility is to store the sequences {s } and {y } and then to recompute recursively the matrices by means of (9) when they are needed. It turns out that with only a modest increase in complexity the required storage can be decreased to one vector per iteration step. We will not enter into the details, but refer to the discussion in section 4..1 of [Kel99. There also a Matlab implementation bfgsopt involving the above Armijo algorithm is given. Certainly the BFGS updates are not the only possible choice. In fact, numerous other software pacages exist that implement quasi-newton methods for minimization problems have been written; see, e.g., [MW93. 10
References [Deu04 P. Deuflhard, Newton Methods for Nonlinear Problems, Springer verlag, Heidelberg, New Yor, 004. [DS98 [Kel03 [Kel95 [Kel99 J. E. Dennis and Robert B. Schnabel, Numerical methods for unconstrained optimization and nonlinear equations, Classics in Applied Mathematics, Vol 16, SIAM Publications, Philadelphia, PA, 1998. Originally published by Prentice Hall 1983. C. T. Kelley, Solving Nonlinear Equations with Newton s Method, Fundamentals of Algorithms, SIAM Publications, Philadelphia, PA, 003., Iterative Methods for Linear and Nonlinear equations, Frontiers in Appl. Math., vol. 16, SIAM Publications, Philadelphia, PA, 1995., Iterative Methods for Optimization, Frontiers in Appl. Math., vol. 18, SIAM Publications, Philadelphia, PA, 1999. [MW93 J. J. More and S. J. Wright, Optimization Software Guide, SIAM Publications, Philadelphia, PA, 1993. [OR00 [Rhe98 J. M. Ortega and W. C. Rheinboldt, Iterative Solutions of Nonlinear Equations in Several Variables, Classics in Applied Mathematics, Vol 30, SIAM Publications, Philadelphia, PA, 000. Originally published by Academic Press, 1970, Russian translation 1976, Chinese translation 198. W. C. Rheinboldt, Methods for Solving Systems of Nonlinear Equations, Regional Conf. Series in Appl. Math., Vol. 70, Siam Publications, Philadelphia, PA, 1998. 11