BFGS WITH UPDATE SKIPPING AND VARYING MEMORY. July 9, 1996

Size: px

Start display at page:

Download "BFGS WITH UPDATE SKIPPING AND VARYING MEMORY. July 9, 1996"

Ross Morgan Hoover
5 years ago
Views:

1 BFGS WITH UPDATE SKIPPING AND VARYING MEMORY TAMARA GIBSON y, DIANNE P. O'LEARY z, AND LARRY NAZARETH x July 9, 1996 Abstract. We give conditions under which limited-memory quasi-newton methods with exact line searches will terminate in n steps when minimizing n-dimensional quadratic functions. We show that although all Broyden family methods terminate in n steps in their full-memory versions, only BFGS does so with limited-memory. Additionally, we show that full-memorybroyden family methods with exact line searches terminate in at most n + p steps when p matrix updates are sipped. We introduce new limited-memorybfgs variants and test them on nonquadratic minimization problems. Key words. minimization, quasi-newton, BFGS, limited-memory, update sipping, Broyden family 1. Introduction. The quasi-newton family of algorithms remains a standard worhorse for minimization. Many of these methods share the properties of nite termination on strictly convex quadratic functions, a linear or superlinear rate of convergence on general convex functions, and no need to store or evaluate the second derivative matrix. In general, an approximation to the second derivative matrix is built by accumulating the results of earlier steps. Descriptions of many quasi-newton algorithms can be found in boos by Luenberger [16], Dennis and Schnabel [7], and Golub and Van Loan [11]. Although there are an innite number of quasi-newton methods, one method surpasses the others in popularity: the BFGS algorithm of Broyden, Fletcher, Goldfarb, and Shanno; see, e.g., Dennis and Schnabel [7]. This method exhibits more robust behavior than its relatives. Many attempts have been made to explain this robustness, but a complete understanding is yet to be obtained [23]. One result of the wor in this paper is a small step toward this understanding, since we investigate the question of how much and which information can be dropped in BFGS and other quasi-newton methods without destroying the property of quadratic termination. We answer this question in the context of exact line search methods, those that nd a minimizer on a one-dimensional subspace at every iteration. (In practice, inexact line searches that satisfy side conditions such as those proposed by Wolfe, see x4.3, are substituted for exact line searches.) We focus on modications of well-nown quasi-newton algorithms resulting from limiting the memory, either by discarding the results of early steps (x2) or by sipping some updates to the second derivative approximation (x3). We give conditions under which quasi-newton methods will terminate in n steps when minimizing quadratic functions of n variables. Although all Broyden family methods (see x2) terminate in n steps in their full-memory versions, we show that only BFGS has n-step termination under limited-memory. We also show that the methods from the Broyden family terminate in n + p steps even if p updates are sipped, but termination is lost if we both sip updates and limit the memory. y Applied Mathematics Program, University of Maryland, College Par, MD gibson@math.umd.edu. This wor was supported in part by the National Physical Science Consortium, the National Security Agency, and the University of Maryland. z Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Par, MD oleary@cs.umd.edu. This wor was supported by the National Science Foundation under grant NSF x Department of Pure and Applied Mathematics, Washington State University, Pullman, WA nazareth@amath.washington.edu. 1

2 2 T. Gibson, D. P. O'Leary, L. Nazareth In x4, we report the results of experiments with new limited-memory BFGS variants on problems taen from the CUTE [3] test set, showing that some savings in time can be achieved. Notation. Matrices and vectors are denoted by boldface upper-case and lower-case letters respectively. Scalars are denoted by Gree or Roman letters. The superscript \T" denotes transpose. Subscripts denote iteration number. Products are always taen from left to right. The notation spanfx 1 ; x 2 ; : : : ; x g denotes the subspace spanned by the vectors x 1 ; x 2 ; : : :; x. Whenever we refer to an n-dimensional strictly convex quadratic function, we assume it is of the form f(x) = 1 2 xt Ax x T b; where A is a positive denite n n matrix and b is an n-vector. 2. Limited-Memory Variations of Quasi-Newton Algorithms. In this section we characterize full-memory and limited-memory methods that terminate in n iterations on n-dimensional strictly convex quadratic minimization problems using exact line searches. Most full-memory versions of the methods we will discuss are nown to terminate in n iterations. Limited-memory BFGS (L-BFGS) was shown by Nocedal [22] to terminate in n steps. The preconditioned conjugate gradient method, which can be cast as a limited-memory quasi-newton method, is also nown to terminate in n iterations; see, e.g., Luenberger [16]. Little else is nown about termination of limited-memory methods. Let f(x) denote the strictly convex quadratic function to be minimized, and let g(x) denote the gradient of f. We dene g g(x ), where x is the th iterate. Let s = x +1 x ; denote the change in the current iterate and denote the change in gradient. y = g +1 g ; Let x 0 be the starting point, and let H 0 be the initial inverse Hessian approximation. For = 0; 1; : : : 1. Compute d = H g. 2. Choose > 0 such that f(x + d ) f(x + d ) for all > Set s = d. 4. Set x +1 = x + s. 5. Compute g Set y = g +1 g. 7. Choose H +1. Fig General Quasi-Newton Method We present a general result that characterizes quasi-newton methods, see Figure 2.1, that terminate in n iterations. We restrict ourselves to methods with an update of the form (2.1) Here, H +1 = P T H 0 Q + Xm i=1 w i z T i :

3 L-BFGS Variations 3 1. H 0 is an n n symmetric positive denite matrix that remains constant for all, and is a nonzero scalar that can be thought of as an iterative rescaling of H P is an n n matrix that is the product of projection matrices of the form (2.2) uv I T u T v ; where u 2 spanfy 0 ; : : : ; y g and v 2 spanfs 0 ; : : :; s +1 g, and Q is an n n matrix that is the product of projection matrices of the same form where u is any n-vector and v 2 spanfs 0 ; : : : ; s g, 3. m is a nonnegative integer, w i (i = 1; 2; : : :; m ) is any n-vector, and z i (i = 1; 2; : : :; m ) is any vector in spanfs 0 ; : : : ; s g. We refer to this form as the general form. The general form ts many nown quasi-newton methods, including the Broyden family and the limited-memory BFGS method. We do not assume that these quasi-newton methods satisfy the secant condition, H +1 y = s ; nor that H +1 is positive denite and symmetric. Symmetric positive denite updates are desirable since this guarantees that the quasi-newton method produces descent directions. Note that if the update is not positive denite, we may produce a d such that d T s > 0 in which case we choose over all negative rather than all positive. Example. The method of steepest descent [16] ts the general form (2.1). For each we dene (2.3) = 1; m = 0; and P = Q = H 0 = I: Note that neither w nor z vectors are specied since m = 0. Example. The ( + 1)st update for the conjugate gradient method with preconditioner H 0 ts the general form (2.1) with (2.4) = 1; m = 0; P = I y s T s Ty ; and Q = I: Example. The L-BFGS update, see Nocedal [22], with limited-memory constant m can be written as (2.5) H +1 = V T m +1;H 0 V m+1; + where m = minf + 1; mg and V i = Y j=i I X i= m +1 y i s T i s T : i y i L-BFGS ts the general form (2.1) if at iteration we choose (2.6) = 1; P = Q = V m = minf + 1; mg; m+1;; and w i = z i = (V m +i+1;) T (s p m+i) (s m+i) T (y : m+i) s Vi+1; T i s T i V s T i y i+1; ; i

4 4 T. Gibson, D. P. O'Leary, L. Nazareth Observe that P ; Q and z i all obey the constraints imposed on their construction. BFGS is related to L-BFGS is the following way: if we were to use every (s; y) pair in the formation of each update (i.e. we have unlimited memory), we would be creating the same updates as BFGS. In practice, however, one would never do that because it would tae more memory than storing the BFGS matrix. Example. We will dene limited-memory DFP (L-DFP). Our denition is consistent with the denition of limited-memory BFGS given in Nocedal [22]. Let m 1 and let m = minf + 1; mg. In order to dene the L-DFP update we need to create a sequence of auxiliary matrices for i = 0; : : :; m. ^H (0) +1 = H 0 ; and ^H (i) +1 = ^H (i 1) +1 + U DFP( ^H (i 1) +1 ; s m +i; y m+i); where U DFP (H; s; y) = HyyT H y T Hy + sst s T y : (m) The matrix ^H is the result of applying the DFP update +1 m times to the matrix H 0 with the m most recent (s; y) pairs. Thus, the ( + 1)st L-DFP matrix is given by To simplify our description, note that ^H (i) +1 = I = H +1 = ^H (m) +1 : (i 1) ^H y +1 m +iy T m +i y T m +i ^V (i) 0 T H0 + (i) ^H +1 can be rewritten as ^H (i 1) +1 y m+i ix j=1 ^V (i) j! (i 1) ^H + s m +is T m +i +1 s T y m +i m +i T s m+js T m +j s T m +j y m +j ; for i 1 where ^V (i) j = iy l=j I T y m+l H (l 1) +1 y m+l y T (l 1) m +lh +1 y m+l : Thus H +1 can be written as (2.7) H +1 = V T 0H 0 + Xm i=1 V T i s m+is T m +i s T y m +i m +i! ; where V i = Ym j=i I y m+j ^H (j 1) y T +1 m +j y T m +j ^H (j 1) +1 y m +j :

5 L-BFGS Variations 5 Equation (2.7) loos very much lie the general form given in (2.1). L-DFP ts the general form with the following choices: (2.8) = 1; P = V 0 ; Q = I; w i = Vis T m+i=(s T m +iy m+i); and z i = s m+i: Except for the choice of P, it is trivial to verify that the choices satisfy the general form. To prove that P satises the requirements, we need to show (2.9) (i 1) ^H +1 y m+i 2 spanfs 0 ; : : : ; s +1 g; for i = 1; : : :; m and all : Proposition 2.1. for each value of : For limited-memory DFP, the following two conditions hold (2.10) (2.11) (i 1) ^H y +1 m +i 2 spanfs 0 ; : : : ; s g for i = 1; : : :; m 1 and (i 1) ^H y +1 m +i 2 spanfs 0 ; : : : ; s ; H 0 g +1 g for i = m ; and spanfh 0 g 0 ; : : : ; H 0 g +1 g spanfs 0 ; : : : ; s +1 g: Proof. We will prove this via induction. Suppose = 0. Then m 0 = 1. We have ^H (0) +1y = H 0 y 0 = H 0 g 1 H 0 g 0 2 spanfs 0 ; H 0 g 1 g: (Recall that spanfs 0 g is trivially equal to spanfh 0 g 0 g.) Furthermore, s 1 = 1 H 1 g 1 = 1 H 0 g 1 y T 0 H 0 g 1 y T 0 H 0 y 0 (H 0 g 1 H 0 g 0 ) + st 0 g 1 y T 0 s 0 s 0 : So we can conclude, y T 0 H 0 g 1 1 H y T 0 H 0 y 0 g 1 = s 1 + yt 0 H 0 g 1 y T 0 H 0 y 0 H 0 g 0 + st 0 g 1 y T 0 s 0 s 0 : Hence, H 0 g 1 2 spanfs 0 ; s 1 g, and so the base case holds. Assume that (i 1) ^H y 1 m 1+i 2 spanfs 0 ; : : : ; s 1 g for i = 1; : : :; m 1 1; and (i 1) ^H y 1 m 1+i 2 spanfs 0 ; : : : ; s 1 ; H 0 g g for i = m 1 ; and spanfh 0 g 0 ; : : : ; H 0 g g spanfs 0 ; : : : ; s g: We will use induction on i to show (2.10) for the ( + 1)st case. For i = 1, ^H (0) +1y m+1 = H 0 y m+1 = H 0 g m+2 H 0 g m+1: Using the induction assumptions from the induction on, we get that ^H (0) +1y m+1 2 spanfs 0 ; : : :; s ; H 0 g +1 g; if m = 1; ^H (0) +1 m +1 2 spanfs 0 ; : : :; s g; otherwise:

6 6 T. Gibson, D. P. O'Leary, L. Nazareth Assume that ^H (i 2) +1 y m+i 1 2 spanfs 0 ; : : : ; s g (induction assumption for i). Next, (i 1) ^H y +1 m +i = ^V + (i 1) 0 i 1 X j=1 T H0 y m+i s T m +j 1y m+i s T m +j 1y m+j 1 ^V (i 1) j T s m+j 1: For values of i m 1, ^V (i 1) and so ^H y +1 m +i is in spanfh 0 y m+i; (i 1) j T maps any vector v into spanfv; ^H (0) y (i 2) +1 m +1; : : : ; ^H y +1 2g: (0) (i 2) ^H +1y m+1; : : : ; ^H +1 y 2 ; s m+1; : : : ; s 2 g: Using the induction assumptions for both i and, we get (i 1) ^H +1 y m+i 2 spanfs 0 ; : : : ; s g; and we can continue the induction on i. If i = m, then so (m 1) (0) (m 2) ^H +1 y 2 spanfh 0 y ; ^H +1y m+1; : : :; ^H +1 y 1 ; s m+1; : : :; s 1 g; (m 1) ^H +1 y 2 spanfs 0 ; : : : ; s ; H 0 g +1 g: Hence the induction on i is complete and this proves (2.10) in the ( + 1)st case. Now, consider s +1 = +1 H +1 g +1 = V T 0H 0 g +1 + Xm i=1 Using the structure of V j and (2.10) we see that s T m +ig +1 V s is T T m +iy m+i m+i: H 0 g +1 2 spanfs 0 ; : : : ; s +1 g: Hence, (2.11) also holds in the ( + 1)st case. Example. The Broyden Class or Broyden Family is the class of quasi-newton methods whose matrices are linear combinations of the DFP and BFGS matrices: BF GS H +1 = H + (1 ) H DF P ; 2 R; see, e.g., Luenberger [16, Chap. 9]. The parameter is usually restricted to values that are guaranteed to produce a positive denite update, although recent wor with SR1, a Broyden Class method, by Khalfan, Byrd and Schnabel [14] may change this practice. No restriction on is necessary for the development of our theory. The Broyden class update can be expressed as H +1 = H + s s s y H y y T H y T H y + (y H y ) s s T y H y y T H y s s T y H y y T H y T :

7 L-BFGS Variations 7 We setch the explanation of how the full-memory version ts the general form given in (2.1). The limited-memory case is similar. We can rewrite the Broyden Class update as follows: H +1 = H + ( 1) H y y T y T H y H s y T s T y H + s s T s T y = + yt H y s s T H y s T (s Ty ) 2 s Ty " I + # (1 )s T y H y + y T H y s y T H y T H y s T y s 1 + yt H y s T y s T H y s Ty : Hence, X +1 H +1 = V 0 H 0 + w i z T i ; i=1 where V Q i = j=i I w i = V i h1 + yt i 1Hi 1yi 1 s s T i 1 i 1 yi 1 ((1 )s T yhy+ yt Hys)yT y T HysT y H i 1 y i 1 i ; and z i = st i 1 s T i 1 yi 1 : It is left to the reader to show that H y is in spanfs 0 ; : : : ; s +1 g, and thus the Broyden Class updates t the form in (2.1) Termination of Limited-Memory Methods. In this section we show that methods tting the general form (2.1) produce conjugate search directions (Theorem 2.2) and terminate in n iterations (Corollary 2.3) if and only if P maps spanfy 0 ; : : : ; y g into spanfy 0 ; : : : ; y 1 g for each = 1; 2; : : :; n. Furthermore, this condition on P is satised only if y is used in its formation (Corollary 2.4). Theorem 2.2. Suppose that we apply a quasi-newton method (Figure 2.1) with an update of the form (2.1) to minimize an n-dimensional strictly convex quadratic function. Then for each before termination (i.e. g +1 6= 0), ; (2.12) (2.13) (2.14) g+1s T j = 0; for all j = 0; 1; : : :; ; s T +1As j = 0; for all j = 0; 1; : : :; ; and spanfs 0 ; : : : ; s +1 g = spanfh 0 g 0 ; : : : ; H 0 g +1 g; if and only if (2.15) P j y i 2 spanfy 0 ; : : : ; y j 1 g; for all i = 0; 1; : : :; j; j = 0; 1; : : :; : Proof. (() Assume that (2.15) holds. We will prove (2.12){(2.14) by induction. Since the line searches are exact, g 1 is orthogonal to s 0. Using the fact that P 0 y 0 = 0

8 8 T. Gibson, D. P. O'Leary, L. Nazareth from (2.15), and the fact that z i0 2 spanfs 0 g implies g T 1 z i0 = 0, i = 1; : : : ; m, we see that s 1 is conjugate to s 0 since s T 1 As 0 = 1 d T 1 y 0 = 1 g T 1 H T 1 y 0 = 1 g T 1 0 Q T 0 H 0 P 0 + Xm 0 = 1 0 g T 1 Q T 0 H 0 P 0 y 0 + = 0: i=0 m 0 z i0 w T i0 X i=0! y 0 g T 1 z i0 w T i0y 0! Lastly, spanfs 0 g = spanfh 0 g 0 g, and so the base case is established. We will assume that claims (2.12){(2.14) hold for = 0; 1; : : :; ^ 1 and prove that they also hold for = ^. The vector is orthogonal to g^+1 s^ since the line search is exact. Using the induction hypotheses that g^ is orthogonal to fs 0; : : :; s^ g and 1 s^ is conjugate to fs 0 ; : : : ; s^ g, we see that for j < ^, 1 Hence, (2.12) holds for = ^. To prove (2.13), we note that g T^+1 s j = (g^ + y^ ) T s j = (g^ + As^ ) T s j = 0: s T^+1 As j = ^+1 gt^+1 HT^+1 y j; so it is sucient to prove that g T^+1HT^+1y j = 0 for j = 0; 1; : : :; ^. We will use the following facts: (i) g = since the v in each of the projections used to form T^+1 QT^ gt^+1 Q^ is in spanfs 0 ; : : : ; s^ g and is orthogonal to that span. g^+1 (ii) g T^+1z i^ = 0 for i = 1; : : : ; m^ since each z i^ is in spanfs 0; : : : ; s^ g and g^+1 is orthogonal to that span. (iii) Since we are assuming that (2.15) holds true, for each j = 0; 1; : : :; ^ there exists 0 ; : : : ; ^ 1 such that P^ y j can be expressed as P^ 1 i=0 iy i. (iv) For i = 0; 1; : : :; ^ 1, is orthogonal to H g^+1 0 y i because is orthogonal g^+1 to spanfs 0 ; : : : ; s^ g and H 0y j 2 spanfs 0 ; : : : ; s^ g from (2.14). Thus, g T^+1 HT^+1 y j = g T^+1 X ^ QT^ H 0P^ + m^ = ^ g T^+1 QT^ H 0P^ y j + = ^ gt^+1 H 0P^ y j = ^ g T^+1 H 0 i=1 m^ X i=1 0 1 i y i A i=1 z i^ wt i^! y j g T^+1 z i^ wt i^ y j

9 = ^ i=1 = 0: L-BFGS Variations 9 X^ 1 Thus, (2.13) holds for = ^. Lastly, using (i) and (ii) from above, = s^+1 ^+1H^+1g^+1 i g T^+1 H 0y i = ^+1 ^ P T^ H 0Q^ g^+1 + = ^+1 ^ PT^ H 0g^+1 : X m^ w i^ zt i^ g^+1 Since P T^ maps any vector v into spanfv; s 0; : : : ; g by construction, there exist s^+1 0 ; : : : ; such that ^ X^+1 s^+1 ^+1 ^ H0 + g^+1 i s i A : Hence, so i=1 i=0 H 0 g^+1 2 spanfs 0; : : :; s^+1 g; spanfh 0 g 0 ; : : : ; H 0 g^+1 g spanfs 0 ; : : :; s^+1 g: To show equality of the sets, we will show that H 0 is linearly independent of g^+1 fh 0 g 0 ; : : :; H 0 g^ g. (We already now that the basis fh 0 g 0 ; : : :; H 0 g^ g is linearly independent since it spans the same space as the linearly independent set fs 0 ; : : :; s^ g and has the same number of elements.) Suppose that H 0 is not linearly independent. Then there exist 0 ; : : : ; ^ g^+1, not all zero, such that H 0 g^+1 = ^X i=0 i H 0 g i : Recall that g^+1 implies that is also orthogonal to fh g^+1 0 g 0 ; : : :; H 0 g^ g. Thus for any j between 0 and ^, 0 = g T^+1 H 0g j = is orthogonal to fs 0; : : : ; s^ g. By our induction assumption, this ^X i=0 1T i H 0 g i A g j = ^X i=0! i g T i H 0 g j = j g T j H 0 g j : Since H 0 is positive denite and g j is nonzero, we conclude that j must be zero. Since this is true for every j between zero and, we have a contradiction. Thus, the set fh 0 g 0 ; : : : ; H 0 g is linearly independent. Hence, (2.14) holds for = ^. g^+1 ()) Assume that (2.12){(2.14) hold for all such that g +1 6= 0 but that (2.15) does not hold; i.e., there exist j and such that g +1 6= 0, j is between 0 and, and (2.16) P y j 62 spanfy 0 ; : : : ; y 1 g

10 10 T. Gibson, D. P. O'Leary, L. Nazareth This will lead to a contradiction. By construction of P, there exist 0 ; : : : ; such that (2.17) P y j = X i=0 i y i : By assumption (2.16), must be nonzero. From (2.13), it follows that g+1h T T +1y j = 0. Using facts (i), (ii), and (iv) from before, (2.14) and (2.17), we get 0 = g T +1H T +1y j = g T +1 Q T H 0 P + = g T +1Q T H 0 P y j + = g T +1H 0 P y j = g T +1H 0 X = X i=0 i=0 i g T +1H 0 y i i y i! Xm i=1 m X i=1 z i w T i! y j g T +1z i w T iy j = g T +1H 0 y = g T +1H 0 g +1 g T +1H 0 g = g T +1H 0 g +1 : Thus since neither nor is zero, we must have g T +1H 0 g +1 = 0; but this is a contradiction since H 0 is positive denite and g +1 was assumed to be nonzero. When a method produces conjugate search directions, we can say something about termination. Corollary 2.3. Suppose we have a method of the type described in Theorem 2.2 satisfying (2.15). Suppose further that H j g j 6= 0 whenever g j 6= 0. Then the scheme reproduces the iterates from the conjugate gradient method with preconditioner H 0 and terminates in no more than n iterations. Proof. Let be such that g 0 ; : : : ; g are all nonzero and such that H i g i 6= 0 for i = 0; : : :;. Since we have a method of the type described in Theorem 2.2 satisfying (2.15), conditions (2.12) { (2.14) hold. We claim that the ( + 1)st subspace of search directions, spanfs 0 ; : : :; s g, is equivalent to the ( + 1)st Krylov subspace, spanfh 0 g 0 ; : : : ; (H 0 A) H 0 g 0 g. From (2.14), we now that spanfs 0 ; : : :; s g = spanfh 0 g 0 ; : : : ; H 0 g g. We will show via induction that spanfh 0 g 0 ; : : :; H 0 g g = spanfh 0 g 0 ; : : : ; (H 0 A) H 0 g 0 g. This base case is trivial, so assume that for some i <. Now, spanfh 0 g 0 ; : : : ; H 0 g i g = spanfh 0 g 0 ; : : :; (H 0 A) i H 0 g 0 g; g i+1 = Ax i+1 b = A(x i + s i ) b = As i + g i ;

11 and from (2.14) and the induction hypothesis, which implies that So, L-BFGS Variations 11 s i 2 spanfh 0 g 0 ; : : : ; H 0 g i g = spanfh 0 g 0 ; : : : ; (H 0 A) i H 0 g 0 g; H 0 As i 2 spanf(h 0 A)H 0 g 0 ; : : : ; (H 0 A) i+1 H 0 g 0 g: H 0 g i+1 2 spanfh 0 g 0 ; : : : ; (H 0 A) i+1 H 0 g 0 g: Hence, the search directions span the Krylov subspace. Since the search directions are conjugate (2.13) and span the Krylov subspace, the iterates are the same as those produced by conjugate gradients with preconditioner H 0. Since we produce the same iterates as the conjugate gradient method and the conjugate gradient method is well-nown to terminate within n iterations, we can conclude that this scheme terminates in at most n iterations. Note that we require that H j g j be nonzero whenever g j is nonzero; this requirement is necessary since not all the methods produce positive denite updates and it is possible to construct an update that maps g j to zero. If this were to happen, we would have a breadown in the method. The next corollary denes the role that the latest information (s and y ) plays in the formation of the th H-update. Corollary 2.4. Suppose we have a method of the type described in Theorem 2.2 satisfying (2.15). Suppose further that at the th iteration P is composed of p projections of the form in (2.2). Then at least one of the projections must have u = P i=0 iy i with 6= 0. Furthermore, if P is a single projection (p = 1), then v must be of the form v = s + +1 s +1 with 6= 0. Proof. Consider the case of p = 1. We have P = I uv T v T u ; where u 2 spanfy 0 ; : : :; y g and v 2 spanfs 0 ; : : : ; s +1 g. We will assume that u = X i=0 i y i and v = +1 X i=0 i s i : for some scalars i and i. By (2.15), there exist 0 ; : : : ; Then and so (2.18) y P y = X 1 i=0 i y i : X v T 1 y v T u u = v T y v T u u = y i=0 1 X i=0 i y i ; i y i : 1 such that

12 12 T. Gibson, D. P. O'Leary, L. Nazareth From (2.13), the set fs 0 ; : : :s g is conjugate and thus linearly independent. Since we are woring with a quadratic, y i = As i for all i; and since A is symmetric positive denite, the set fy 0 ; : : : ; y g is also linearly independent. So the coecient of the y on the left-hand side of (2.18) must match that on the right-hand side, thus Hence, (2.19) v T y v T u = 1: 6= 0; and y must mae a nontrivial contribution to P. Next we will show that 0 = 1 = = 1 = 0. Assume that j is between 0 and 1. Then P y j = y j v T y j v T u u T i=1 is i yj u v T u P +1 i=1 = y is T i As j j u v T u = y j P +1 = y j j s T j As j v T u u: Now s j As j is nonzero because A is positive denite. If j is nonzero then the coecient of u is nonzero and so y must mae a nontrivial contribution to P y j, implying that P y j 62 spanfy 0 ; : : : ; y 1 g. This is a contradiction. Hence, j = 0. To show that 6= 0, consider P y. Suppose that = 0. Then and so v T y = +1 y T s +1 + y T s = +1 s T As +1 = 0; P y = y v T y v T u u = y : This contradicts P y 2 spanfy 0 ; : : :; y 1 g, so must be nonzero. Now we will discuss that p > 1 case. Label the u-components of the p projections as u 1 through u p. Then P y = y + for some scalars 1 through p. We now that px i=1 i u i ; P y 2 spanfy 0 ; : : : ; y 1 g;

13 L-BFGS Variations 13 and that Thus y 62 spanfy 0 ; : : : ; y 1 g: y 2 spanfu 1 ; : : : ; u p g; and since u i 2 spanfy 0 ; : : : ; y g for i = 1; : : :; p, we can conclude that at least one u i must have a nontrivial contribution from y Examples of Methods that Reproduce the CG Iterates. Here are some specic examples of methods that t the general form, satisfy condition (2.15) of Theorem 2.2, and thus terminate in at most n iterations. Example. The conjugate gradient method with preconditioner H 0, see (2.4), satises condition (2.15) of Theorem 2.2 since P y j = I y s T s T y y j = 0 for all j = 0; : : : ; : Example. Limited-memory BFGS, see (2.6), satises condition (2.15) of Theorem 2.2 since 0 for j = m + 1; : : : ; ; and P y j = y j for j = 0; : : :; m : Example. DFP (with full memory), see (2.8), satises condition (2.15) of Theorem 2.2. Consider P in the full memory case. We have P = Y j=0 I y i y i H T i yi T : H i y i For full-memory DFP, H i y j = s j for j = 0; : : :; i 1. Using this fact, one can easily verify that P y j = 0 for j = 0; : : :;. Therefore, full-memory DFP satises condition (2.15) of Theorem 2.2. The same reasoning does not apply to the limited-memory case as we shall show in x2.3. The next corollary gives some ideas for other methods that are related to L-BFGS and terminate in at most n iterations on strictly convex quadratics. Corollary 2.5. The L-BFGS (2.5) method will terminate in n iterations on an n-dimensional strictly convex quadratic function even if any combination of the following modications is made to the update: 1. Vary the limited-memory constant, eeping m Form the projections used in V from the most recent (s ; y ) pair along with any set of m 1 other pairs from f(s 0 ; y 0 ); : : : ; (s 1 ; y 1 )g. 3. Form the projections used in V from the most recent (s ; y ) pair along with any m 1 other linear combinations of pairs from f(s 0 ; y 0 ); : : : ; (s 1 ; y 1 )g: 4. Iteratively rescale H 0. Proof. For each variant, we show that the method ts the general form in (2.1), satises condition (2.15) of Theorem 2.2 and hence terminates by Corollary Let m > 0 be any value which may change from iteration to iteration, and dene! Y y j s V i = I T j s T : j y j j=i

14 14 T. Gibson, D. P. O'Leary, L. Nazareth Choose = 1; P = Q = V m = minf + 1; mg; m+1;; and w i = z i = (V m +i+1;) T (s p m+i) (s m+i) T (y : m+i) These choices t the general form. Furthermore, 0 if j = m ; m P y j = + 1; : : : ; ; and y j if j = 0; 1; : : :; m 1; so this variation satises condition (2.15) of Theorem This is a special case of the next variant. 3. At iteration, let (^s (i) ; ^y(i) ) denote the ith (i = 1; : : :; m 1) choice of any linear combination from the span of the set f(s 0 ; y 0 ); : : : ; (s 1 ; y 1 )g; and let (^s (m) Choose ; ^y (m) ) = (s ; y ). Dene V i = my j=i I (^y (i) )(^s(i) )T (^s (i) )T (^y (i) )! : = 1; m = minf + 1; mg; P = Q = V 1; ; and w i = z i = (V i+1;) T (^s (i) q ) : (^s (i) )T (^y (i) ) These choices satisfy the general form (2.1). Furthermore, P y j = 0 y j if yj = y (i) for some i; and otherwise: Hence, this variation satises condition (2.15) of Theorem Let in (2.1) be the scaling constant, and choose the other vectors and matrices as in L-BFGS (2.6). Combinations of variants are left to the reader. Remar. Part 3 of the previous corollary shows that the \accumulated step" method of Gill and Murray [10] terminates on quadratics. Remar. Part 4 of the previous corollary shows that scaling does not aect termination in L-BFGS. In fact, for any method that ts the general form, it is easy to see that scaling will not aect termination on quadratics Examples of Methods that Do Not Reproduce the CG Iterates. We will discuss several methods that t the general form given in (2.1) but do not satisfy the conditions of Theorem 2.2.

15 L-BFGS Variations 15 Example. Steepest descent, see (2.3), does not satisfy condition (2.15) of Theorem 2.2 and thus does not produce conjugate search directions. This fact is wellnown; see, e.g., Luenberger [16]. Example. Limited-memory DFP, see (2.8), with m < n does not satisfy the condition on P (2.15) for all, and so the method will not produce conjugate directions. For example, suppose that we have a convex quadratic with A = ; and b = Using a limited-memory constant of m = 1 and exact arithmetic, it can be seen that the iteration does not terminate within the rst 20 iterations of limited-memory DFP with H 0 = I. The MAPLE noteboo le used to compute this example is available on the World Wide Web [9]. Remar. Using the above example, we can easily see that no limited-memory Broyden class method except limited-memory BFGS terminates within the rst n iterations. 3. Update-Sipping Variations for Broyden Class Quasi-Newton Algorithms. The previous section discussed limited-memory methods that behave lie conjugate gradients on n-dimensional strictly convex quadratic functions. In this section, we are concerned with methods that sip some updates in order to reduce the memory demands. We establish conditions under which nite termination is preserved but delayed for the Broyden Class Termination when Updates are Sipped. It was shown by Powell [26] that if we sip every other update and tae direct prediction steps (i.e. steps of length one) in a Broyden class method, then the procedure will terminate in no more than 2n+1 iterations on an n-dimensional strictly convex quadratic function. An alternate proof of this result is given by Nazareth [21]. We will prove a related result. Suppose that we are doing exact line searches using a Broyden Class quasi-newton method on a strictly convex quadratic function and decide to \sip" p updates to H (i.e. choose H +1 = H on p occasions). Then, the algorithm terminates in no more than n + p iterations. In contrast to Powell's result, it does not matter which updates are sipped or if multiple updates are sipped in a row. Theorem : Suppose that a Broyden Class method using exact line searches is applied to an n-dimensional strictly convex quadratic function and p updates are sipped. Let Then for all = 0; 1; : : : J() = fj : the update at iteration j is not sippedg: (3.1) (3.2) g+1s T j = 0; for all j 2 J(); and s T +1As j = 0; for all j 2 J(): Furthermore, the method terminates in at most n + p iterations at the exact minimizer. Proof. We will use induction on to show (3.1) and (3.3) H +1 y j = s j ; for all j 2 J():

16 16 T. Gibson, D. P. O'Leary, L. Nazareth Then (3.2) follows easily since for all j 2 J(), s T +1As j = +1 g +1 H +1 y j = +1 g +1 s j = 0: Let 0 be the least value of such that J() is nonempty; i.e., J( 0 ) = f 0 g. Then g 0+1 is orthogonal to s 0 since line searches are exact, and H 0+1y 0 = s 0 since all members of the Broyden Family satisfy the secant condition. Hence, the base case is true. Now assume that (3.1) and (3.3) hold for all values of = 0; 1; : : :; ^ 1. We will show that they also hold for = ^. Case I. Suppose that ^ 62 J(^). Then = H^+1 H^ and J(^ 1) = J(^), so for any j 2 J(^), (3.4) g T^+1 s j = (g^ + As^ ) T s j = g T^ s j + s T^ As j = 0; and H^+1 y j = H^ y j = s j : Case II. Suppose that ^ 2 J(^). Then H^+1 satises the secant condition and J(^) = J(^ 1) [ f^g. Now g^+1 is orthogonal to s since the line searches are exact, and it is orthogonal to the older s j by the argument in (3.4). The secant condition guarantees that H^+1 y^ = s^, and for j 2 J(^) but j 6= ^ we have H^+1 y j = H^ y j + s^ st^ s^ y^ y j + (y T^ H^ y^ ) s^ st^ y^ = s j + st^ As j s^ y^ s^ + (y^ H^ y^ ) s^ s = s j : H^ y^ yt^ H^ y y j T^ H^ y^ H^ y^ yt^ s j y T^ H^ y^ T^ y^ H^ y^ y T^ H^ y^ H^ y^ y T^ H^ y^! s^ st^ y^! s T ^ As j s T^ y^ H^ y^ y T^ H^ y^! T y j y T^ s! j yt^ H^ y^ In either case, the induction result follows. Suppose that we sip p updates. Then the set J(n 1 + p) has cardinality n. Without loss of generality, assume that the set fs i g i2j(n 1+p) has no zero elements. From (3.2), the vectors are linearly independent. By (3.1), g T n+ps j = 0; for all j 2 J(n 1 + p); and so g n+p must be zero. This implies that x n+p is the exact minimizer of f.

17 L-BFGS Variations Loss of Termination for Update Sipping with Limited-Memory. Unfortunately, updates that use both limited-memory and repeated update-sipping do not produce n conjugate search directions for n-dimensional strictly convex quadratics, and the termination property is lost. We will show a simple example, limitedmemory BFGS with m = 1, sipping every other update. Note that according to Corollary 2.4, we would still be guaranteed termination if we used the most recent information in each update. Example. Suppose that we have a convex quadratic with A = ; and b = We apply limited-memory BFGS with limited-memory constant m = 1 and H 0 = I and sip every-other update to H. Using exact arithmetic in MAPLE, we observe that the process does not terminate even after 100 iterations [9]. 4. Experimental Results. The results of x2 and x3 lead to a number of ideas for new methods for unconstrained optimization. In this section, we motivate, develop, and test these ideas. We describe the collection of test problems in x4.2. The test environment is described in x4.3. Section outlines the implementation of the L-BFGS method (our base for all comparisons) and xx4.4.2{4.4.7 describe the variations. Pseudo-code for L-BFGS and its variations is given in Appendix B. Complete numerical results, many graphs of the numerical results, and the original FORTRAN code are available [9] Motivation. So far we have only given results for convex quadratic functions. While termination on quadratics is beautiful in theory, it does not necessarily yield insight into how these methods will do in practice. We will not present any new results relating to convergence of these algorithms on general functions; however, many of these can be shown to converge using the convergence analysis presented in x7 of [15]. In [15], Liu and Nocedal show that a limited-memory BFGS method implemented with a line search that satises the strong Wolfe conditions (see x4.3 for a denition) is R-linearly convergent on a convex function that satises a few modest conditions Test Problems. For our test problems, we used the Constrained and Unconstrained Testing Environment (CUTE) by Bongartz, Conn, Gould and Toint. The pacage is documented in [3] and can be obtained via the world wide web [2] or via ftp [1]. The pacage contains a large collection of test problems as well as the interfaces necessary for using the problems. The test problems are stored as \SIF" les. We chose a collection of 22 unconstrained problems. The problems ranged in size from 10 to 10,000 variables, but each too L-BFGS with limited-memory constant m = 5 at least 60 iterations to solve. Table 4.1 enumerates the problems, giving the SIF le name, the dimension (n), and a description for each problem. The CUTE pacage also provides a starting point (x 0 ) for each problem Test Environment. We used FORTRAN77 code on an SGI Indigo 2 to run the algorithms, with FORTRAN BLAS routines from NETLIB. We used the compiler's default optimization level. Figure 2.1 outlines the general quasi-newton implementation that we followed. For the line search, we use the routines cvsrch and cstep written by Jorge J. More 3 5 :

18 18 T. Gibson, D. P. O'Leary, L. Nazareth No. SIF Name n Description & Reference 1 EXTROSNB 10 Extended Rosenbroc function (nonseparable version) [30, Problem 10]. 2 WATSONS 31 Watson problem [17, Problem 20]. 3 TOINTGOR 50 Toint's operations research problem [29]. 4 TOINTPSP 50 Toint's PSP operations research problem [29]. 5 CHNROSNB 50 Chained Rosenbroc function [29]. 6 ERRINROS 50 Nonlinear problem similar to CHNROSNB [28]. 7 FLETCHBV 100 Fletcher's boundary value problem [8, Problem 1]. 8 FLETCHCR 100 Fletcher's chained Rosenbroc function [8, Problem 2]. 9 PENALTY2 100 Second penalty problem [17, Problem 24]. 10 GENROSE 500 Generalized Rosenbroc function [18, Problem 5]. 11 BDQRTIC 1000 Quartic with a banded Hessian with bandwidth=9 [5, Problem 61]. 12 BROYDN7D 1000 Seven diagonal variant of the Broyden tridiagonal system with a band away from diagonal [29]. 13 PENALTY First penalty problem [17, Problem 23]. 14 POWER 1000 Power problem by Oren [25]. 15 MSQRTALS 1024 The dense matrix square root problem by Nocedal and Liu (case 0) seen as a nonlinear equation problem [4, Problem 204]. 16 MSQRTBLS 1025 The dense matrix square root problem by Nocedal and Liu (case 1) seen as a nonlinear equation problem [4, Problem 201]. 17 CRAGGLVY 5000 Extended Cragg & Levy problem [30, Problem 32]. 18 NONDQUAR Nondiagonal quartic test problem [5, Problem 57]. 19 POWELLSG Extended Powell singular function [17, Problem 13]. 20 SINQUAD Another function with nontrivial groups and repetitious elements [12]. 21 SPMSRTLS Liu and Nocedal tridiagonal matrix square root problem [4, Problem 151]. 22 TRIDIA Shanno's TRIDIA quadratic tridiagonal problem [30, Problem 8]. Table 4.1 Test problem collection. Each problems was chosen from the CUTE pacage. and David Thuente from a 1983 version of MINPACK. This line search routine nds an that meets the strong Wolfe conditions, (4.1) (4.2) f(x + d) f(x) +! 1 g(x) T d; jg(x + d) T dj! 2 jg(x) T sj; see, e.g., Nocedal [23]. We used! 1 = 1: and! 2 = 0:9. Except for the rst iteration, we always attempt a step length of 1.0 rst and only use an alternate value if 1.0 does not satisfy the Wolfe conditions. In the rst iteration, we initially try a step length equal to g 0 1. The remaining line search parameters are detailed in Appendix A. We generate the matrix H by either the limited-memory update or one of the variations described in x4.4, storing the matrix implicitly in order to save both memory and computation time. We terminate the iterations if any of the following conditions are met at iteration

19 : 1. The inequality L-BFGS Variations 19 g x < 1: ; is satised, 2. the line search fails, or 3. the number of iterations exceeds We say that the iterates have converged if the rst condition is satised. Otherwise, the method has failed L-BFGS and its variations. We tried a number of variations to the standard L-BFGS algorithm. L-BFGS and these variations are described in this subsection and summarized in Table L-BFGS: Algorithm 0. The limited-memory BFGS update is given in (2.5) and described fully by Byrd, Nocedal and Schnabel [22]. Our implementation and the following description come essentially from [22]. Let H 0 be symmetric and positive denite and assume that the m pairs each satisfy s T i y i > 0. We will let fs i ; y i g 1 i= S = [s m s m+1 s 1 ] and Y = [y m y m+1 y 1 ]; where m = minf + 1; mg, and m is some positive integer. We will assume that H 0 = I and that H 0 is iteratively rescaled by a constant as is commonly done in practice. Then, the matrix H obtained by applications of the limited-memory BFGS update can be expressed as H = I + S Y U T (D + Y T Y )U 1 m U T U 1 0 where U and D are the m m matrices given by (s (U ) ij = m 1+i) T (y m 1+j) if i j; 0 otherwise; and D = diagf(s m ) T (y m ); : : : ; s T 1y 1 g: S T Y T We will describe how to compute d = H g in the case that > 0. Let x be the current iterate. Let m = minf + 1; mg. Given s 1 ; y 1 ; g, the matrices S 1 ; Y 1 ; U 1 ; Y T 1Y 1 ; D 1, and the vectors S T 1g 1 ; Y T 1g 1 : 1. Update the n m 1 matrices S 1 and Y 1 to get the n m matrices S and Y using s 1 and y 1 2. Compute the m -vectors S T g and Y T g. 3. Compute the m -vectors S T y 1 and Y T y 1 by using the fact that y 1 = g g 1 : We already now m 1 components of S g 1 from S 1 g 1, and liewise for Y g 1. We need only compute s T 1g 1 and y T 1g 1 and do the subtractions. ;

20 20 T. Gibson, D. P. O'Leary, L. Nazareth No. Reference Brief Description 0 x4.4.1 L-BFGS with no options. 1 x4.4.2, Variation 1 Allow m to vary iteratively basing the choice of m of g and not allowing m to decrease. 2 x4.4.2, Variation 2 Allow m to vary iteratively basing the choice of m of g and allowing m to decrease. 3 x4.4.2, Variation 3 Allow m to vary iteratively basing the choice of m of g=x and not allowing m to decrease. 4 x4.4.2, Variation 4 Allow m to vary iteratively basing the choice of m of g=x and allowing m to decrease. 5 x4.4.3 Dispose of old information if the step length is greater than one. 6 x4.4.4, Variation 1 Bac-up if the current iteration is odd. 7 x4.4.4, Variation 2 Bac-up if the current iteration is even. 8 x4.4.4, Variation 3 Bac-up if a step length of 1.0 was used in the last iteration. 9 x4.4.4, Variation 4 Bac-up if g > g x4.4.4, Variation 3* Bac-up if a step length of 1.0 was used in the last iteration and we did not bac-up on the last iteration. 11 x4.4.4, Variation 4* Bac-up if g > g 1 and we did not bac-up on the last iteration. 12 x4.4.5, Variation 1 Merge if neither of the two vectors to be merged is itself the result of a merge and the 2nd and 3rd most recent steps taen were of length x4.4.5, Variation 2 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge. 14 x4.4.6, Variation 1 Sip update on odd iterations. 15 x4.4.6, Variation 2 Sip update on even iterations. 16 x4.4.6, Variation 3 Sip update if g +1 > g. 17 Alg. 5 & Alg. 8 Dispose of old information and bac-up on the next iteration if the step length is greater than one. 18 Alg. 13 & Alg. 1 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g and not allowing m to decrease. 19 Alg. 13 & Alg. 3 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g=x and not allowing m to decrease. 20 Alg. 13 & Alg. 2 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g and allowing m to decrease. 21 Alg. 13 & Alg. 2 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g=x and allowing m to decrease. Table 4.2 Description of Numerical Algorithms 4. Compute U 1 1. Rather than recomputing U, we update the matrix from the previous iteration by deleting the leftmost column and topmost row if m = m 1 and appending a new column on the right and a new row on the bottom. Let 1 = 1=s T 1y 1 and let (U 1 1 )0 be the (m 1)(m 1) lower right submatrix of U 1 1 and let (S T y 1 ) 0 be the upper m 1 elements of S T y 1. Then U 1 = (U 1 1 )0 1 (U 1 1 )0 (S T y 1 ) :

21 L-BFGS Variations 21 Note that s T 1y 1 = (S T y 1 ) m and so is already computed. 5. Assemble Y T Y. We have already computed all the components. 6. Update D using D 1 and s T 1y 1 = (S T y 1 ) m. 7. Compute = y T 1s 1 =y T 1y 1 : Note that both y T 1s 1 and y T 1y 1 have already been computed. 8. Compute two intermediate values p 1 = U 1 ST g ; p 2 = U 1 ( Y T Y p 1 + D p 1 Y T g ): 9. Compute d = Y p 1 S p 2 g : The storage costs for this are very low. In order to reconstruct H, we need to store S ; Y ; U 1 ; YT Y, D (a diagonal matrix) and a few m-vectors. This requires only 2mn + 2m 2 + O(m) storage. Assuming m << n, this is much less storage than the n 2 storage required for typical implementation of BFGS. Step Operation Count 2 4mn 2m 3 4n + 2m 2 4 2m 2 4m m 2 + 2m 9 4m 2 + 2m Table 4.3 Operations Count for Computation of H g. Steps with no operations are not shown. The computation of Hg taes at most O(mn) operations assuming n >> m. (See Table 4.3.) This is much less than the O(n 2 ) time normally needed to compute Hg when the whole matrix H is stored. We are using L-BFGS as our basis for comparison. For information on the performance of L-BFGS see Liu and Nocedal [15] and Nash and Nocedal [19] Varying m iteratively: Algorithms 1{4. In typical implementations of L-BFGS, m is xed throughout the iterations: once m updates have accumulated, m updates are always used. We considered the possibility of varying m iteratively, preserving nite termination on convex quadratics. Using an argument similar to that presented in [15], we can also prove that this algorithm has a linear rate of convergence on a convex function that satises a few modest conditions. We tried four dierent variations on this theme. All were based on the following linear formula that scales m in relation to the size of g. Let m be the number of iterates saved at the th iteration, with m 0 = 1. Here, thin of m as the maximum allowable value of m. Let the convergence test be given by g =x <. Then the formula for m at iteration is m = min m 1 + 1; (m 1) log log 0 log 100 log :

22 22 T. Gibson, D. P. O'Leary, L. Nazareth Alg. No. m = 5 m = 10 m = 15 m = Table 4.4 The number of failures of the algorithms on the 22 test problems. An algorithm is said to have \failed" on a particular problem if a line search fails or the maximum allowable number of iterations (3000 in our case) is exceeded. We have two choices for, and a choice of whether or not we will allow m to decrease as well as increase. The four variations are 1. = g and require m m 1, 2. = g, 3. = g =x and require m m 1, and 4. = g =x. We used four values of m: 5,10,15 and 50, for each algorithm. The results are summarized in Tables 4.4 { 4.8. More extensive results can be obtained [9]. Table 4.4 shows that these algorithms had roughly the same number of failures as L-BFGS. Table 4.5 compares each algorithm to L-BFGS in terms of function evaluations. For each algorithm and each value of m, the number of times that the algorithm used as few or fewer function evaluations than L-BFGS is listed relative to the total number of admissible problems. Problems are admissible if at least one of the two methods solved it. We observe that in all but three cases, the algorithm used as few or fewer function evaluations than L-BFGS for over half the test problems. Table 4.6 compares each algorithm to L-BFGS in terms of time. The entries are similar to those in Table 4.5. Observe that Algorithms 1-4 did very well in terms of time, doing as well or better than L-BFGS in nearly every case. For each problem in each algorithm, we computed the ratio of the number of function evaluations for the algorithm to the number of function evaluations for L- BFGS. Table 4.7 lists the means of these ratios. A mean below 1.0 implies that the algorithm does better than L-BFGS on average. The average is better for the algorithms in 6 out of 16 cases for the rst four algorithms. Observe, however, that all the means are close to one.

23 L-BFGS Variations 23 Alg. No. m = 5 m = 10 m = 15 m = /22 10/22 17/22 17/22 2 7/22 13/22 13/22 19/ /21 14/22 12/22 15/ /21 17/22 15/22 16/ /22 20/22 20/22 21/ /21 22/22 22/22 21/21 7 8/22 12/22 10/22 10/ /22 14/22 12/22 15/22 9 6/22 13/22 12/22 16/ /22 14/22 12/22 15/ /22 10/22 11/22 14/ /21 22/22 22/22 21/ /22 4/22 4/22 4/ /21 2/22 2/22 2/ /22 1/22 1/22 1/ /22 1/22 1/22 1/ /22 13/22 12/22 14/ /22 4/22 5/22 4/ /22 3/22 4/22 4/ /22 4/22 4/22 5/ /22 2/22 4/22 4/22 Table 4.5 Function Evaluations Comparison. The rst number in each entry is the number of times the algorithm did as well as or better than normal L-BFGS in terms of function evaluations. The second number is the total number of problems solved by at least one of the two methods (the algorithm and/or L-BFGS). We experience savings in terms of time for the rst four algorithms. These algorithms will tend save fewer vectors than L-BFGS since m is typically less than m; and so less wor is done computing H g in these algorithms. Table 4.8 gives the mean of the ratios of time to solve for each value of m in each algorithm. Note that most of the ratios are far below one in this case. These variations did particularly well on problem 7. See [9] for more information Disposing of old information: Algorithm 5. We may decide that we are storing too much old information and that we should stop using it. For example, we may choose to throw away everything except for the most recent information whenever we tae a big step, since the old information may not be relevant to the new neighborhood. We use the following test: If the last step length was bigger than 1, dispose of the old information. The algorithm performed nearly the same as L-BFGS. There was substantial deviation on only one or two problems for each value of m, and this seemed evenly divided in terms of better and worse. From Table 4.4, we see that this algorithm successfully converged on every problem. Table 4.5 shows that it almost always did as well or better than L-BFGS in terms of function evaluations. However, Table 4.7 shows that the dierences were minor. In terms of time, we observe that the algorithm generally was faster than L-BFGS (Table 4.6), but again, considering the mean ratios of time (Table 4.8), the dierences were minor. The method also does particularly well on problem 7 [9] Bacing Up in the Update to H: Algorithms As discussed in x2.2, if we always use the most recent s and y in the update, we preserve quadratic termination regardless of which older values of s and y we use.

24 24 T. Gibson, D. P. O'Leary, L. Nazareth Alg. No. m = 5 m = 10 m = 15 m = /22 18/22 20/22 18/ /22 19/22 18/22 18/ /21 14/22 15/22 15/ /21 18/22 20/22 18/ /22 13/22 14/22 15/ /21 19/22 15/22 15/ /22 11/22 10/22 7/ /22 7/22 6/22 5/22 9 9/22 10/22 7/22 8/ /22 8/22 5/22 5/ /22 8/22 9/22 5/ /21 12/22 8/22 11/ /22 10/22 13/22 17/ /21 2/22 2/22 3/ /22 6/22 9/22 10/ /22 2/22 3/22 3/ /22 8/22 5/22 4/ /22 14/22 19/22 20/ /22 11/22 17/22 19/ /22 14/22 17/22 19/ /22 16/22 16/22 18/22 Table 4.6 Time Comparison. The rst number in each entry is the number of times the algorithm did as well as or better than normal L-BFGS in terms of time. The second number is the total number of problems solved by at least one of the two methods (the algorithm and/or L-BFGS). Using this idea, we created some algorithms. Under certain conditions, we discard the next most recent values of s and y in the H although we still use the most recent s and y vectors and any other vectors that have been saved from previous iterations. We call this \bacing up" because it as if we bac-up over the next most recent values of s and y. These algorithms used the following four tests to trigger bacing up: 1. The current iteration is odd. 2. The current iteration is even. 3. A step length of 1.0 was used in the last iteration. 4. g > g 1. In two additional algorithms, we varied situations 3 and 4 by not allowing a bac-up if a bac-up was performed on the previous iteration. The bacing up strategy seemed robust in terms of failures. In 4 out of the 6 variations we did for this algorithm, there were no failures at all. See Table 4.4 for more information. It is interesting to observe that bacing up on odd iterations (Algorithm 6) and bacing up on even iterations (Algorithm 7) caused very dierent results. Bacing up on odd iterations seemed to have almost no eect on the number of function evaluations (Table 4.7) and little eect on the time (Table 4.8). However, bacing up on even iterations causes much dierent behavior from L-BFGS. It does worse than L-BFGS on most problems, but better on a few. Algorithms 8 and 10 were two variations of the same idea: bacing up if the previous step length was one. This wipes out the data from the previous iteration after it has been used in one update. Both show improvement over L-BFGS in terms of function evaluations; in fact, these two algorithms have the best function evaluation ratio for the m = 50 case (Table 4.7). Unfortunately, these algorithms did not compete with L-BFGS in terms of time (Table 4.8). There is little dierence between

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

Limited-Memory Matrix Methods with Applications 1 Tamara Gibson Kolda 2 Applied Mathematics Program University of Maryland at College Park Abstract. The focus of this dissertation is on matrix decompositions