BFGS WITH UPDATE SKIPPING AND VARYING MEMORY. July 9, 1996

Size: px
Start display at page:

Download "BFGS WITH UPDATE SKIPPING AND VARYING MEMORY. July 9, 1996"

Transcription

1 BFGS WITH UPDATE SKIPPING AND VARYING MEMORY TAMARA GIBSON y, DIANNE P. O'LEARY z, AND LARRY NAZARETH x July 9, 1996 Abstract. We give conditions under which limited-memory quasi-newton methods with exact line searches will terminate in n steps when minimizing n-dimensional quadratic functions. We show that although all Broyden family methods terminate in n steps in their full-memory versions, only BFGS does so with limited-memory. Additionally, we show that full-memorybroyden family methods with exact line searches terminate in at most n + p steps when p matrix updates are sipped. We introduce new limited-memorybfgs variants and test them on nonquadratic minimization problems. Key words. minimization, quasi-newton, BFGS, limited-memory, update sipping, Broyden family 1. Introduction. The quasi-newton family of algorithms remains a standard worhorse for minimization. Many of these methods share the properties of nite termination on strictly convex quadratic functions, a linear or superlinear rate of convergence on general convex functions, and no need to store or evaluate the second derivative matrix. In general, an approximation to the second derivative matrix is built by accumulating the results of earlier steps. Descriptions of many quasi-newton algorithms can be found in boos by Luenberger [16], Dennis and Schnabel [7], and Golub and Van Loan [11]. Although there are an innite number of quasi-newton methods, one method surpasses the others in popularity: the BFGS algorithm of Broyden, Fletcher, Goldfarb, and Shanno; see, e.g., Dennis and Schnabel [7]. This method exhibits more robust behavior than its relatives. Many attempts have been made to explain this robustness, but a complete understanding is yet to be obtained [23]. One result of the wor in this paper is a small step toward this understanding, since we investigate the question of how much and which information can be dropped in BFGS and other quasi-newton methods without destroying the property of quadratic termination. We answer this question in the context of exact line search methods, those that nd a minimizer on a one-dimensional subspace at every iteration. (In practice, inexact line searches that satisfy side conditions such as those proposed by Wolfe, see x4.3, are substituted for exact line searches.) We focus on modications of well-nown quasi-newton algorithms resulting from limiting the memory, either by discarding the results of early steps (x2) or by sipping some updates to the second derivative approximation (x3). We give conditions under which quasi-newton methods will terminate in n steps when minimizing quadratic functions of n variables. Although all Broyden family methods (see x2) terminate in n steps in their full-memory versions, we show that only BFGS has n-step termination under limited-memory. We also show that the methods from the Broyden family terminate in n + p steps even if p updates are sipped, but termination is lost if we both sip updates and limit the memory. y Applied Mathematics Program, University of Maryland, College Par, MD gibson@math.umd.edu. This wor was supported in part by the National Physical Science Consortium, the National Security Agency, and the University of Maryland. z Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Par, MD oleary@cs.umd.edu. This wor was supported by the National Science Foundation under grant NSF x Department of Pure and Applied Mathematics, Washington State University, Pullman, WA nazareth@amath.washington.edu. 1

2 2 T. Gibson, D. P. O'Leary, L. Nazareth In x4, we report the results of experiments with new limited-memory BFGS variants on problems taen from the CUTE [3] test set, showing that some savings in time can be achieved. Notation. Matrices and vectors are denoted by boldface upper-case and lower-case letters respectively. Scalars are denoted by Gree or Roman letters. The superscript \T" denotes transpose. Subscripts denote iteration number. Products are always taen from left to right. The notation spanfx 1 ; x 2 ; : : : ; x g denotes the subspace spanned by the vectors x 1 ; x 2 ; : : :; x. Whenever we refer to an n-dimensional strictly convex quadratic function, we assume it is of the form f(x) = 1 2 xt Ax x T b; where A is a positive denite n n matrix and b is an n-vector. 2. Limited-Memory Variations of Quasi-Newton Algorithms. In this section we characterize full-memory and limited-memory methods that terminate in n iterations on n-dimensional strictly convex quadratic minimization problems using exact line searches. Most full-memory versions of the methods we will discuss are nown to terminate in n iterations. Limited-memory BFGS (L-BFGS) was shown by Nocedal [22] to terminate in n steps. The preconditioned conjugate gradient method, which can be cast as a limited-memory quasi-newton method, is also nown to terminate in n iterations; see, e.g., Luenberger [16]. Little else is nown about termination of limited-memory methods. Let f(x) denote the strictly convex quadratic function to be minimized, and let g(x) denote the gradient of f. We dene g g(x ), where x is the th iterate. Let s = x +1 x ; denote the change in the current iterate and denote the change in gradient. y = g +1 g ; Let x 0 be the starting point, and let H 0 be the initial inverse Hessian approximation. For = 0; 1; : : : 1. Compute d = H g. 2. Choose > 0 such that f(x + d ) f(x + d ) for all > Set s = d. 4. Set x +1 = x + s. 5. Compute g Set y = g +1 g. 7. Choose H +1. Fig General Quasi-Newton Method We present a general result that characterizes quasi-newton methods, see Figure 2.1, that terminate in n iterations. We restrict ourselves to methods with an update of the form (2.1) Here, H +1 = P T H 0 Q + Xm i=1 w i z T i :

3 L-BFGS Variations 3 1. H 0 is an n n symmetric positive denite matrix that remains constant for all, and is a nonzero scalar that can be thought of as an iterative rescaling of H P is an n n matrix that is the product of projection matrices of the form (2.2) uv I T u T v ; where u 2 spanfy 0 ; : : : ; y g and v 2 spanfs 0 ; : : :; s +1 g, and Q is an n n matrix that is the product of projection matrices of the same form where u is any n-vector and v 2 spanfs 0 ; : : : ; s g, 3. m is a nonnegative integer, w i (i = 1; 2; : : :; m ) is any n-vector, and z i (i = 1; 2; : : :; m ) is any vector in spanfs 0 ; : : : ; s g. We refer to this form as the general form. The general form ts many nown quasi-newton methods, including the Broyden family and the limited-memory BFGS method. We do not assume that these quasi-newton methods satisfy the secant condition, H +1 y = s ; nor that H +1 is positive denite and symmetric. Symmetric positive denite updates are desirable since this guarantees that the quasi-newton method produces descent directions. Note that if the update is not positive denite, we may produce a d such that d T s > 0 in which case we choose over all negative rather than all positive. Example. The method of steepest descent [16] ts the general form (2.1). For each we dene (2.3) = 1; m = 0; and P = Q = H 0 = I: Note that neither w nor z vectors are specied since m = 0. Example. The ( + 1)st update for the conjugate gradient method with preconditioner H 0 ts the general form (2.1) with (2.4) = 1; m = 0; P = I y s T s Ty ; and Q = I: Example. The L-BFGS update, see Nocedal [22], with limited-memory constant m can be written as (2.5) H +1 = V T m +1;H 0 V m+1; + where m = minf + 1; mg and V i = Y j=i I X i= m +1 y i s T i s T : i y i L-BFGS ts the general form (2.1) if at iteration we choose (2.6) = 1; P = Q = V m = minf + 1; mg; m+1;; and w i = z i = (V m +i+1;) T (s p m+i) (s m+i) T (y : m+i) s Vi+1; T i s T i V s T i y i+1; ; i

4 4 T. Gibson, D. P. O'Leary, L. Nazareth Observe that P ; Q and z i all obey the constraints imposed on their construction. BFGS is related to L-BFGS is the following way: if we were to use every (s; y) pair in the formation of each update (i.e. we have unlimited memory), we would be creating the same updates as BFGS. In practice, however, one would never do that because it would tae more memory than storing the BFGS matrix. Example. We will dene limited-memory DFP (L-DFP). Our denition is consistent with the denition of limited-memory BFGS given in Nocedal [22]. Let m 1 and let m = minf + 1; mg. In order to dene the L-DFP update we need to create a sequence of auxiliary matrices for i = 0; : : :; m. ^H (0) +1 = H 0 ; and ^H (i) +1 = ^H (i 1) +1 + U DFP( ^H (i 1) +1 ; s m +i; y m+i); where U DFP (H; s; y) = HyyT H y T Hy + sst s T y : (m) The matrix ^H is the result of applying the DFP update +1 m times to the matrix H 0 with the m most recent (s; y) pairs. Thus, the ( + 1)st L-DFP matrix is given by To simplify our description, note that ^H (i) +1 = I = H +1 = ^H (m) +1 : (i 1) ^H y +1 m +iy T m +i y T m +i ^V (i) 0 T H0 + (i) ^H +1 can be rewritten as ^H (i 1) +1 y m+i ix j=1 ^V (i) j! (i 1) ^H + s m +is T m +i +1 s T y m +i m +i T s m+js T m +j s T m +j y m +j ; for i 1 where ^V (i) j = iy l=j I T y m+l H (l 1) +1 y m+l y T (l 1) m +lh +1 y m+l : Thus H +1 can be written as (2.7) H +1 = V T 0H 0 + Xm i=1 V T i s m+is T m +i s T y m +i m +i! ; where V i = Ym j=i I y m+j ^H (j 1) y T +1 m +j y T m +j ^H (j 1) +1 y m +j :

5 L-BFGS Variations 5 Equation (2.7) loos very much lie the general form given in (2.1). L-DFP ts the general form with the following choices: (2.8) = 1; P = V 0 ; Q = I; w i = Vis T m+i=(s T m +iy m+i); and z i = s m+i: Except for the choice of P, it is trivial to verify that the choices satisfy the general form. To prove that P satises the requirements, we need to show (2.9) (i 1) ^H +1 y m+i 2 spanfs 0 ; : : : ; s +1 g; for i = 1; : : :; m and all : Proposition 2.1. for each value of : For limited-memory DFP, the following two conditions hold (2.10) (2.11) (i 1) ^H y +1 m +i 2 spanfs 0 ; : : : ; s g for i = 1; : : :; m 1 and (i 1) ^H y +1 m +i 2 spanfs 0 ; : : : ; s ; H 0 g +1 g for i = m ; and spanfh 0 g 0 ; : : : ; H 0 g +1 g spanfs 0 ; : : : ; s +1 g: Proof. We will prove this via induction. Suppose = 0. Then m 0 = 1. We have ^H (0) +1y = H 0 y 0 = H 0 g 1 H 0 g 0 2 spanfs 0 ; H 0 g 1 g: (Recall that spanfs 0 g is trivially equal to spanfh 0 g 0 g.) Furthermore, s 1 = 1 H 1 g 1 = 1 H 0 g 1 y T 0 H 0 g 1 y T 0 H 0 y 0 (H 0 g 1 H 0 g 0 ) + st 0 g 1 y T 0 s 0 s 0 : So we can conclude, y T 0 H 0 g 1 1 H y T 0 H 0 y 0 g 1 = s 1 + yt 0 H 0 g 1 y T 0 H 0 y 0 H 0 g 0 + st 0 g 1 y T 0 s 0 s 0 : Hence, H 0 g 1 2 spanfs 0 ; s 1 g, and so the base case holds. Assume that (i 1) ^H y 1 m 1+i 2 spanfs 0 ; : : : ; s 1 g for i = 1; : : :; m 1 1; and (i 1) ^H y 1 m 1+i 2 spanfs 0 ; : : : ; s 1 ; H 0 g g for i = m 1 ; and spanfh 0 g 0 ; : : : ; H 0 g g spanfs 0 ; : : : ; s g: We will use induction on i to show (2.10) for the ( + 1)st case. For i = 1, ^H (0) +1y m+1 = H 0 y m+1 = H 0 g m+2 H 0 g m+1: Using the induction assumptions from the induction on, we get that ^H (0) +1y m+1 2 spanfs 0 ; : : :; s ; H 0 g +1 g; if m = 1; ^H (0) +1 m +1 2 spanfs 0 ; : : :; s g; otherwise:

6 6 T. Gibson, D. P. O'Leary, L. Nazareth Assume that ^H (i 2) +1 y m+i 1 2 spanfs 0 ; : : : ; s g (induction assumption for i). Next, (i 1) ^H y +1 m +i = ^V + (i 1) 0 i 1 X j=1 T H0 y m+i s T m +j 1y m+i s T m +j 1y m+j 1 ^V (i 1) j T s m+j 1: For values of i m 1, ^V (i 1) and so ^H y +1 m +i is in spanfh 0 y m+i; (i 1) j T maps any vector v into spanfv; ^H (0) y (i 2) +1 m +1; : : : ; ^H y +1 2g: (0) (i 2) ^H +1y m+1; : : : ; ^H +1 y 2 ; s m+1; : : : ; s 2 g: Using the induction assumptions for both i and, we get (i 1) ^H +1 y m+i 2 spanfs 0 ; : : : ; s g; and we can continue the induction on i. If i = m, then so (m 1) (0) (m 2) ^H +1 y 2 spanfh 0 y ; ^H +1y m+1; : : :; ^H +1 y 1 ; s m+1; : : :; s 1 g; (m 1) ^H +1 y 2 spanfs 0 ; : : : ; s ; H 0 g +1 g: Hence the induction on i is complete and this proves (2.10) in the ( + 1)st case. Now, consider s +1 = +1 H +1 g +1 = V T 0H 0 g +1 + Xm i=1 Using the structure of V j and (2.10) we see that s T m +ig +1 V s is T T m +iy m+i m+i: H 0 g +1 2 spanfs 0 ; : : : ; s +1 g: Hence, (2.11) also holds in the ( + 1)st case. Example. The Broyden Class or Broyden Family is the class of quasi-newton methods whose matrices are linear combinations of the DFP and BFGS matrices: BF GS H +1 = H + (1 ) H DF P ; 2 R; see, e.g., Luenberger [16, Chap. 9]. The parameter is usually restricted to values that are guaranteed to produce a positive denite update, although recent wor with SR1, a Broyden Class method, by Khalfan, Byrd and Schnabel [14] may change this practice. No restriction on is necessary for the development of our theory. The Broyden class update can be expressed as H +1 = H + s s s y H y y T H y T H y + (y H y ) s s T y H y y T H y s s T y H y y T H y T :

7 L-BFGS Variations 7 We setch the explanation of how the full-memory version ts the general form given in (2.1). The limited-memory case is similar. We can rewrite the Broyden Class update as follows: H +1 = H + ( 1) H y y T y T H y H s y T s T y H + s s T s T y = + yt H y s s T H y s T (s Ty ) 2 s Ty " I + # (1 )s T y H y + y T H y s y T H y T H y s T y s 1 + yt H y s T y s T H y s Ty : Hence, X +1 H +1 = V 0 H 0 + w i z T i ; i=1 where V Q i = j=i I w i = V i h1 + yt i 1Hi 1yi 1 s s T i 1 i 1 yi 1 ((1 )s T yhy+ yt Hys)yT y T HysT y H i 1 y i 1 i ; and z i = st i 1 s T i 1 yi 1 : It is left to the reader to show that H y is in spanfs 0 ; : : : ; s +1 g, and thus the Broyden Class updates t the form in (2.1) Termination of Limited-Memory Methods. In this section we show that methods tting the general form (2.1) produce conjugate search directions (Theorem 2.2) and terminate in n iterations (Corollary 2.3) if and only if P maps spanfy 0 ; : : : ; y g into spanfy 0 ; : : : ; y 1 g for each = 1; 2; : : :; n. Furthermore, this condition on P is satised only if y is used in its formation (Corollary 2.4). Theorem 2.2. Suppose that we apply a quasi-newton method (Figure 2.1) with an update of the form (2.1) to minimize an n-dimensional strictly convex quadratic function. Then for each before termination (i.e. g +1 6= 0), ; (2.12) (2.13) (2.14) g+1s T j = 0; for all j = 0; 1; : : :; ; s T +1As j = 0; for all j = 0; 1; : : :; ; and spanfs 0 ; : : : ; s +1 g = spanfh 0 g 0 ; : : : ; H 0 g +1 g; if and only if (2.15) P j y i 2 spanfy 0 ; : : : ; y j 1 g; for all i = 0; 1; : : :; j; j = 0; 1; : : :; : Proof. (() Assume that (2.15) holds. We will prove (2.12){(2.14) by induction. Since the line searches are exact, g 1 is orthogonal to s 0. Using the fact that P 0 y 0 = 0

8 8 T. Gibson, D. P. O'Leary, L. Nazareth from (2.15), and the fact that z i0 2 spanfs 0 g implies g T 1 z i0 = 0, i = 1; : : : ; m, we see that s 1 is conjugate to s 0 since s T 1 As 0 = 1 d T 1 y 0 = 1 g T 1 H T 1 y 0 = 1 g T 1 0 Q T 0 H 0 P 0 + Xm 0 = 1 0 g T 1 Q T 0 H 0 P 0 y 0 + = 0: i=0 m 0 z i0 w T i0 X i=0! y 0 g T 1 z i0 w T i0y 0! Lastly, spanfs 0 g = spanfh 0 g 0 g, and so the base case is established. We will assume that claims (2.12){(2.14) hold for = 0; 1; : : :; ^ 1 and prove that they also hold for = ^. The vector is orthogonal to g^+1 s^ since the line search is exact. Using the induction hypotheses that g^ is orthogonal to fs 0; : : :; s^ g and 1 s^ is conjugate to fs 0 ; : : : ; s^ g, we see that for j < ^, 1 Hence, (2.12) holds for = ^. To prove (2.13), we note that g T^+1 s j = (g^ + y^ ) T s j = (g^ + As^ ) T s j = 0: s T^+1 As j = ^+1 gt^+1 HT^+1 y j; so it is sucient to prove that g T^+1HT^+1y j = 0 for j = 0; 1; : : :; ^. We will use the following facts: (i) g = since the v in each of the projections used to form T^+1 QT^ gt^+1 Q^ is in spanfs 0 ; : : : ; s^ g and is orthogonal to that span. g^+1 (ii) g T^+1z i^ = 0 for i = 1; : : : ; m^ since each z i^ is in spanfs 0; : : : ; s^ g and g^+1 is orthogonal to that span. (iii) Since we are assuming that (2.15) holds true, for each j = 0; 1; : : :; ^ there exists 0 ; : : : ; ^ 1 such that P^ y j can be expressed as P^ 1 i=0 iy i. (iv) For i = 0; 1; : : :; ^ 1, is orthogonal to H g^+1 0 y i because is orthogonal g^+1 to spanfs 0 ; : : : ; s^ g and H 0y j 2 spanfs 0 ; : : : ; s^ g from (2.14). Thus, g T^+1 HT^+1 y j = g T^+1 X ^ QT^ H 0P^ + m^ = ^ g T^+1 QT^ H 0P^ y j + = ^ gt^+1 H 0P^ y j = ^ g T^+1 H 0 i=1 m^ X i=1 0 1 i y i A i=1 z i^ wt i^! y j g T^+1 z i^ wt i^ y j

9 = ^ i=1 = 0: L-BFGS Variations 9 X^ 1 Thus, (2.13) holds for = ^. Lastly, using (i) and (ii) from above, = s^+1 ^+1H^+1g^+1 i g T^+1 H 0y i = ^+1 ^ P T^ H 0Q^ g^+1 + = ^+1 ^ PT^ H 0g^+1 : X m^ w i^ zt i^ g^+1 Since P T^ maps any vector v into spanfv; s 0; : : : ; g by construction, there exist s^+1 0 ; : : : ; such that ^ X^+1 s^+1 ^+1 ^ H0 + g^+1 i s i A : Hence, so i=1 i=0 H 0 g^+1 2 spanfs 0; : : :; s^+1 g; spanfh 0 g 0 ; : : : ; H 0 g^+1 g spanfs 0 ; : : :; s^+1 g: To show equality of the sets, we will show that H 0 is linearly independent of g^+1 fh 0 g 0 ; : : :; H 0 g^ g. (We already now that the basis fh 0 g 0 ; : : :; H 0 g^ g is linearly independent since it spans the same space as the linearly independent set fs 0 ; : : :; s^ g and has the same number of elements.) Suppose that H 0 is not linearly independent. Then there exist 0 ; : : : ; ^ g^+1, not all zero, such that H 0 g^+1 = ^X i=0 i H 0 g i : Recall that g^+1 implies that is also orthogonal to fh g^+1 0 g 0 ; : : :; H 0 g^ g. Thus for any j between 0 and ^, 0 = g T^+1 H 0g j = is orthogonal to fs 0; : : : ; s^ g. By our induction assumption, this ^X i=0 1T i H 0 g i A g j = ^X i=0! i g T i H 0 g j = j g T j H 0 g j : Since H 0 is positive denite and g j is nonzero, we conclude that j must be zero. Since this is true for every j between zero and, we have a contradiction. Thus, the set fh 0 g 0 ; : : : ; H 0 g is linearly independent. Hence, (2.14) holds for = ^. g^+1 ()) Assume that (2.12){(2.14) hold for all such that g +1 6= 0 but that (2.15) does not hold; i.e., there exist j and such that g +1 6= 0, j is between 0 and, and (2.16) P y j 62 spanfy 0 ; : : : ; y 1 g

10 10 T. Gibson, D. P. O'Leary, L. Nazareth This will lead to a contradiction. By construction of P, there exist 0 ; : : : ; such that (2.17) P y j = X i=0 i y i : By assumption (2.16), must be nonzero. From (2.13), it follows that g+1h T T +1y j = 0. Using facts (i), (ii), and (iv) from before, (2.14) and (2.17), we get 0 = g T +1H T +1y j = g T +1 Q T H 0 P + = g T +1Q T H 0 P y j + = g T +1H 0 P y j = g T +1H 0 X = X i=0 i=0 i g T +1H 0 y i i y i! Xm i=1 m X i=1 z i w T i! y j g T +1z i w T iy j = g T +1H 0 y = g T +1H 0 g +1 g T +1H 0 g = g T +1H 0 g +1 : Thus since neither nor is zero, we must have g T +1H 0 g +1 = 0; but this is a contradiction since H 0 is positive denite and g +1 was assumed to be nonzero. When a method produces conjugate search directions, we can say something about termination. Corollary 2.3. Suppose we have a method of the type described in Theorem 2.2 satisfying (2.15). Suppose further that H j g j 6= 0 whenever g j 6= 0. Then the scheme reproduces the iterates from the conjugate gradient method with preconditioner H 0 and terminates in no more than n iterations. Proof. Let be such that g 0 ; : : : ; g are all nonzero and such that H i g i 6= 0 for i = 0; : : :;. Since we have a method of the type described in Theorem 2.2 satisfying (2.15), conditions (2.12) { (2.14) hold. We claim that the ( + 1)st subspace of search directions, spanfs 0 ; : : :; s g, is equivalent to the ( + 1)st Krylov subspace, spanfh 0 g 0 ; : : : ; (H 0 A) H 0 g 0 g. From (2.14), we now that spanfs 0 ; : : :; s g = spanfh 0 g 0 ; : : : ; H 0 g g. We will show via induction that spanfh 0 g 0 ; : : :; H 0 g g = spanfh 0 g 0 ; : : : ; (H 0 A) H 0 g 0 g. This base case is trivial, so assume that for some i <. Now, spanfh 0 g 0 ; : : : ; H 0 g i g = spanfh 0 g 0 ; : : :; (H 0 A) i H 0 g 0 g; g i+1 = Ax i+1 b = A(x i + s i ) b = As i + g i ;

11 and from (2.14) and the induction hypothesis, which implies that So, L-BFGS Variations 11 s i 2 spanfh 0 g 0 ; : : : ; H 0 g i g = spanfh 0 g 0 ; : : : ; (H 0 A) i H 0 g 0 g; H 0 As i 2 spanf(h 0 A)H 0 g 0 ; : : : ; (H 0 A) i+1 H 0 g 0 g: H 0 g i+1 2 spanfh 0 g 0 ; : : : ; (H 0 A) i+1 H 0 g 0 g: Hence, the search directions span the Krylov subspace. Since the search directions are conjugate (2.13) and span the Krylov subspace, the iterates are the same as those produced by conjugate gradients with preconditioner H 0. Since we produce the same iterates as the conjugate gradient method and the conjugate gradient method is well-nown to terminate within n iterations, we can conclude that this scheme terminates in at most n iterations. Note that we require that H j g j be nonzero whenever g j is nonzero; this requirement is necessary since not all the methods produce positive denite updates and it is possible to construct an update that maps g j to zero. If this were to happen, we would have a breadown in the method. The next corollary denes the role that the latest information (s and y ) plays in the formation of the th H-update. Corollary 2.4. Suppose we have a method of the type described in Theorem 2.2 satisfying (2.15). Suppose further that at the th iteration P is composed of p projections of the form in (2.2). Then at least one of the projections must have u = P i=0 iy i with 6= 0. Furthermore, if P is a single projection (p = 1), then v must be of the form v = s + +1 s +1 with 6= 0. Proof. Consider the case of p = 1. We have P = I uv T v T u ; where u 2 spanfy 0 ; : : :; y g and v 2 spanfs 0 ; : : : ; s +1 g. We will assume that u = X i=0 i y i and v = +1 X i=0 i s i : for some scalars i and i. By (2.15), there exist 0 ; : : : ; Then and so (2.18) y P y = X 1 i=0 i y i : X v T 1 y v T u u = v T y v T u u = y i=0 1 X i=0 i y i ; i y i : 1 such that

12 12 T. Gibson, D. P. O'Leary, L. Nazareth From (2.13), the set fs 0 ; : : :s g is conjugate and thus linearly independent. Since we are woring with a quadratic, y i = As i for all i; and since A is symmetric positive denite, the set fy 0 ; : : : ; y g is also linearly independent. So the coecient of the y on the left-hand side of (2.18) must match that on the right-hand side, thus Hence, (2.19) v T y v T u = 1: 6= 0; and y must mae a nontrivial contribution to P. Next we will show that 0 = 1 = = 1 = 0. Assume that j is between 0 and 1. Then P y j = y j v T y j v T u u T i=1 is i yj u v T u P +1 i=1 = y is T i As j j u v T u = y j P +1 = y j j s T j As j v T u u: Now s j As j is nonzero because A is positive denite. If j is nonzero then the coecient of u is nonzero and so y must mae a nontrivial contribution to P y j, implying that P y j 62 spanfy 0 ; : : : ; y 1 g. This is a contradiction. Hence, j = 0. To show that 6= 0, consider P y. Suppose that = 0. Then and so v T y = +1 y T s +1 + y T s = +1 s T As +1 = 0; P y = y v T y v T u u = y : This contradicts P y 2 spanfy 0 ; : : :; y 1 g, so must be nonzero. Now we will discuss that p > 1 case. Label the u-components of the p projections as u 1 through u p. Then P y = y + for some scalars 1 through p. We now that px i=1 i u i ; P y 2 spanfy 0 ; : : : ; y 1 g;

13 L-BFGS Variations 13 and that Thus y 62 spanfy 0 ; : : : ; y 1 g: y 2 spanfu 1 ; : : : ; u p g; and since u i 2 spanfy 0 ; : : : ; y g for i = 1; : : :; p, we can conclude that at least one u i must have a nontrivial contribution from y Examples of Methods that Reproduce the CG Iterates. Here are some specic examples of methods that t the general form, satisfy condition (2.15) of Theorem 2.2, and thus terminate in at most n iterations. Example. The conjugate gradient method with preconditioner H 0, see (2.4), satises condition (2.15) of Theorem 2.2 since P y j = I y s T s T y y j = 0 for all j = 0; : : : ; : Example. Limited-memory BFGS, see (2.6), satises condition (2.15) of Theorem 2.2 since 0 for j = m + 1; : : : ; ; and P y j = y j for j = 0; : : :; m : Example. DFP (with full memory), see (2.8), satises condition (2.15) of Theorem 2.2. Consider P in the full memory case. We have P = Y j=0 I y i y i H T i yi T : H i y i For full-memory DFP, H i y j = s j for j = 0; : : :; i 1. Using this fact, one can easily verify that P y j = 0 for j = 0; : : :;. Therefore, full-memory DFP satises condition (2.15) of Theorem 2.2. The same reasoning does not apply to the limited-memory case as we shall show in x2.3. The next corollary gives some ideas for other methods that are related to L-BFGS and terminate in at most n iterations on strictly convex quadratics. Corollary 2.5. The L-BFGS (2.5) method will terminate in n iterations on an n-dimensional strictly convex quadratic function even if any combination of the following modications is made to the update: 1. Vary the limited-memory constant, eeping m Form the projections used in V from the most recent (s ; y ) pair along with any set of m 1 other pairs from f(s 0 ; y 0 ); : : : ; (s 1 ; y 1 )g. 3. Form the projections used in V from the most recent (s ; y ) pair along with any m 1 other linear combinations of pairs from f(s 0 ; y 0 ); : : : ; (s 1 ; y 1 )g: 4. Iteratively rescale H 0. Proof. For each variant, we show that the method ts the general form in (2.1), satises condition (2.15) of Theorem 2.2 and hence terminates by Corollary Let m > 0 be any value which may change from iteration to iteration, and dene! Y y j s V i = I T j s T : j y j j=i

14 14 T. Gibson, D. P. O'Leary, L. Nazareth Choose = 1; P = Q = V m = minf + 1; mg; m+1;; and w i = z i = (V m +i+1;) T (s p m+i) (s m+i) T (y : m+i) These choices t the general form. Furthermore, 0 if j = m ; m P y j = + 1; : : : ; ; and y j if j = 0; 1; : : :; m 1; so this variation satises condition (2.15) of Theorem This is a special case of the next variant. 3. At iteration, let (^s (i) ; ^y(i) ) denote the ith (i = 1; : : :; m 1) choice of any linear combination from the span of the set f(s 0 ; y 0 ); : : : ; (s 1 ; y 1 )g; and let (^s (m) Choose ; ^y (m) ) = (s ; y ). Dene V i = my j=i I (^y (i) )(^s(i) )T (^s (i) )T (^y (i) )! : = 1; m = minf + 1; mg; P = Q = V 1; ; and w i = z i = (V i+1;) T (^s (i) q ) : (^s (i) )T (^y (i) ) These choices satisfy the general form (2.1). Furthermore, P y j = 0 y j if yj = y (i) for some i; and otherwise: Hence, this variation satises condition (2.15) of Theorem Let in (2.1) be the scaling constant, and choose the other vectors and matrices as in L-BFGS (2.6). Combinations of variants are left to the reader. Remar. Part 3 of the previous corollary shows that the \accumulated step" method of Gill and Murray [10] terminates on quadratics. Remar. Part 4 of the previous corollary shows that scaling does not aect termination in L-BFGS. In fact, for any method that ts the general form, it is easy to see that scaling will not aect termination on quadratics Examples of Methods that Do Not Reproduce the CG Iterates. We will discuss several methods that t the general form given in (2.1) but do not satisfy the conditions of Theorem 2.2.

15 L-BFGS Variations 15 Example. Steepest descent, see (2.3), does not satisfy condition (2.15) of Theorem 2.2 and thus does not produce conjugate search directions. This fact is wellnown; see, e.g., Luenberger [16]. Example. Limited-memory DFP, see (2.8), with m < n does not satisfy the condition on P (2.15) for all, and so the method will not produce conjugate directions. For example, suppose that we have a convex quadratic with A = ; and b = Using a limited-memory constant of m = 1 and exact arithmetic, it can be seen that the iteration does not terminate within the rst 20 iterations of limited-memory DFP with H 0 = I. The MAPLE noteboo le used to compute this example is available on the World Wide Web [9]. Remar. Using the above example, we can easily see that no limited-memory Broyden class method except limited-memory BFGS terminates within the rst n iterations. 3. Update-Sipping Variations for Broyden Class Quasi-Newton Algorithms. The previous section discussed limited-memory methods that behave lie conjugate gradients on n-dimensional strictly convex quadratic functions. In this section, we are concerned with methods that sip some updates in order to reduce the memory demands. We establish conditions under which nite termination is preserved but delayed for the Broyden Class Termination when Updates are Sipped. It was shown by Powell [26] that if we sip every other update and tae direct prediction steps (i.e. steps of length one) in a Broyden class method, then the procedure will terminate in no more than 2n+1 iterations on an n-dimensional strictly convex quadratic function. An alternate proof of this result is given by Nazareth [21]. We will prove a related result. Suppose that we are doing exact line searches using a Broyden Class quasi-newton method on a strictly convex quadratic function and decide to \sip" p updates to H (i.e. choose H +1 = H on p occasions). Then, the algorithm terminates in no more than n + p iterations. In contrast to Powell's result, it does not matter which updates are sipped or if multiple updates are sipped in a row. Theorem : Suppose that a Broyden Class method using exact line searches is applied to an n-dimensional strictly convex quadratic function and p updates are sipped. Let Then for all = 0; 1; : : : J() = fj : the update at iteration j is not sippedg: (3.1) (3.2) g+1s T j = 0; for all j 2 J(); and s T +1As j = 0; for all j 2 J(): Furthermore, the method terminates in at most n + p iterations at the exact minimizer. Proof. We will use induction on to show (3.1) and (3.3) H +1 y j = s j ; for all j 2 J():

16 16 T. Gibson, D. P. O'Leary, L. Nazareth Then (3.2) follows easily since for all j 2 J(), s T +1As j = +1 g +1 H +1 y j = +1 g +1 s j = 0: Let 0 be the least value of such that J() is nonempty; i.e., J( 0 ) = f 0 g. Then g 0+1 is orthogonal to s 0 since line searches are exact, and H 0+1y 0 = s 0 since all members of the Broyden Family satisfy the secant condition. Hence, the base case is true. Now assume that (3.1) and (3.3) hold for all values of = 0; 1; : : :; ^ 1. We will show that they also hold for = ^. Case I. Suppose that ^ 62 J(^). Then = H^+1 H^ and J(^ 1) = J(^), so for any j 2 J(^), (3.4) g T^+1 s j = (g^ + As^ ) T s j = g T^ s j + s T^ As j = 0; and H^+1 y j = H^ y j = s j : Case II. Suppose that ^ 2 J(^). Then H^+1 satises the secant condition and J(^) = J(^ 1) [ f^g. Now g^+1 is orthogonal to s since the line searches are exact, and it is orthogonal to the older s j by the argument in (3.4). The secant condition guarantees that H^+1 y^ = s^, and for j 2 J(^) but j 6= ^ we have H^+1 y j = H^ y j + s^ st^ s^ y^ y j + (y T^ H^ y^ ) s^ st^ y^ = s j + st^ As j s^ y^ s^ + (y^ H^ y^ ) s^ s = s j : H^ y^ yt^ H^ y y j T^ H^ y^ H^ y^ yt^ s j y T^ H^ y^ T^ y^ H^ y^ y T^ H^ y^ H^ y^ y T^ H^ y^! s^ st^ y^! s T ^ As j s T^ y^ H^ y^ y T^ H^ y^! T y j y T^ s! j yt^ H^ y^ In either case, the induction result follows. Suppose that we sip p updates. Then the set J(n 1 + p) has cardinality n. Without loss of generality, assume that the set fs i g i2j(n 1+p) has no zero elements. From (3.2), the vectors are linearly independent. By (3.1), g T n+ps j = 0; for all j 2 J(n 1 + p); and so g n+p must be zero. This implies that x n+p is the exact minimizer of f.

17 L-BFGS Variations Loss of Termination for Update Sipping with Limited-Memory. Unfortunately, updates that use both limited-memory and repeated update-sipping do not produce n conjugate search directions for n-dimensional strictly convex quadratics, and the termination property is lost. We will show a simple example, limitedmemory BFGS with m = 1, sipping every other update. Note that according to Corollary 2.4, we would still be guaranteed termination if we used the most recent information in each update. Example. Suppose that we have a convex quadratic with A = ; and b = We apply limited-memory BFGS with limited-memory constant m = 1 and H 0 = I and sip every-other update to H. Using exact arithmetic in MAPLE, we observe that the process does not terminate even after 100 iterations [9]. 4. Experimental Results. The results of x2 and x3 lead to a number of ideas for new methods for unconstrained optimization. In this section, we motivate, develop, and test these ideas. We describe the collection of test problems in x4.2. The test environment is described in x4.3. Section outlines the implementation of the L-BFGS method (our base for all comparisons) and xx4.4.2{4.4.7 describe the variations. Pseudo-code for L-BFGS and its variations is given in Appendix B. Complete numerical results, many graphs of the numerical results, and the original FORTRAN code are available [9] Motivation. So far we have only given results for convex quadratic functions. While termination on quadratics is beautiful in theory, it does not necessarily yield insight into how these methods will do in practice. We will not present any new results relating to convergence of these algorithms on general functions; however, many of these can be shown to converge using the convergence analysis presented in x7 of [15]. In [15], Liu and Nocedal show that a limited-memory BFGS method implemented with a line search that satises the strong Wolfe conditions (see x4.3 for a denition) is R-linearly convergent on a convex function that satises a few modest conditions Test Problems. For our test problems, we used the Constrained and Unconstrained Testing Environment (CUTE) by Bongartz, Conn, Gould and Toint. The pacage is documented in [3] and can be obtained via the world wide web [2] or via ftp [1]. The pacage contains a large collection of test problems as well as the interfaces necessary for using the problems. The test problems are stored as \SIF" les. We chose a collection of 22 unconstrained problems. The problems ranged in size from 10 to 10,000 variables, but each too L-BFGS with limited-memory constant m = 5 at least 60 iterations to solve. Table 4.1 enumerates the problems, giving the SIF le name, the dimension (n), and a description for each problem. The CUTE pacage also provides a starting point (x 0 ) for each problem Test Environment. We used FORTRAN77 code on an SGI Indigo 2 to run the algorithms, with FORTRAN BLAS routines from NETLIB. We used the compiler's default optimization level. Figure 2.1 outlines the general quasi-newton implementation that we followed. For the line search, we use the routines cvsrch and cstep written by Jorge J. More 3 5 :

18 18 T. Gibson, D. P. O'Leary, L. Nazareth No. SIF Name n Description & Reference 1 EXTROSNB 10 Extended Rosenbroc function (nonseparable version) [30, Problem 10]. 2 WATSONS 31 Watson problem [17, Problem 20]. 3 TOINTGOR 50 Toint's operations research problem [29]. 4 TOINTPSP 50 Toint's PSP operations research problem [29]. 5 CHNROSNB 50 Chained Rosenbroc function [29]. 6 ERRINROS 50 Nonlinear problem similar to CHNROSNB [28]. 7 FLETCHBV 100 Fletcher's boundary value problem [8, Problem 1]. 8 FLETCHCR 100 Fletcher's chained Rosenbroc function [8, Problem 2]. 9 PENALTY2 100 Second penalty problem [17, Problem 24]. 10 GENROSE 500 Generalized Rosenbroc function [18, Problem 5]. 11 BDQRTIC 1000 Quartic with a banded Hessian with bandwidth=9 [5, Problem 61]. 12 BROYDN7D 1000 Seven diagonal variant of the Broyden tridiagonal system with a band away from diagonal [29]. 13 PENALTY First penalty problem [17, Problem 23]. 14 POWER 1000 Power problem by Oren [25]. 15 MSQRTALS 1024 The dense matrix square root problem by Nocedal and Liu (case 0) seen as a nonlinear equation problem [4, Problem 204]. 16 MSQRTBLS 1025 The dense matrix square root problem by Nocedal and Liu (case 1) seen as a nonlinear equation problem [4, Problem 201]. 17 CRAGGLVY 5000 Extended Cragg & Levy problem [30, Problem 32]. 18 NONDQUAR Nondiagonal quartic test problem [5, Problem 57]. 19 POWELLSG Extended Powell singular function [17, Problem 13]. 20 SINQUAD Another function with nontrivial groups and repetitious elements [12]. 21 SPMSRTLS Liu and Nocedal tridiagonal matrix square root problem [4, Problem 151]. 22 TRIDIA Shanno's TRIDIA quadratic tridiagonal problem [30, Problem 8]. Table 4.1 Test problem collection. Each problems was chosen from the CUTE pacage. and David Thuente from a 1983 version of MINPACK. This line search routine nds an that meets the strong Wolfe conditions, (4.1) (4.2) f(x + d) f(x) +! 1 g(x) T d; jg(x + d) T dj! 2 jg(x) T sj; see, e.g., Nocedal [23]. We used! 1 = 1: and! 2 = 0:9. Except for the rst iteration, we always attempt a step length of 1.0 rst and only use an alternate value if 1.0 does not satisfy the Wolfe conditions. In the rst iteration, we initially try a step length equal to g 0 1. The remaining line search parameters are detailed in Appendix A. We generate the matrix H by either the limited-memory update or one of the variations described in x4.4, storing the matrix implicitly in order to save both memory and computation time. We terminate the iterations if any of the following conditions are met at iteration

19 : 1. The inequality L-BFGS Variations 19 g x < 1: ; is satised, 2. the line search fails, or 3. the number of iterations exceeds We say that the iterates have converged if the rst condition is satised. Otherwise, the method has failed L-BFGS and its variations. We tried a number of variations to the standard L-BFGS algorithm. L-BFGS and these variations are described in this subsection and summarized in Table L-BFGS: Algorithm 0. The limited-memory BFGS update is given in (2.5) and described fully by Byrd, Nocedal and Schnabel [22]. Our implementation and the following description come essentially from [22]. Let H 0 be symmetric and positive denite and assume that the m pairs each satisfy s T i y i > 0. We will let fs i ; y i g 1 i= S = [s m s m+1 s 1 ] and Y = [y m y m+1 y 1 ]; where m = minf + 1; mg, and m is some positive integer. We will assume that H 0 = I and that H 0 is iteratively rescaled by a constant as is commonly done in practice. Then, the matrix H obtained by applications of the limited-memory BFGS update can be expressed as H = I + S Y U T (D + Y T Y )U 1 m U T U 1 0 where U and D are the m m matrices given by (s (U ) ij = m 1+i) T (y m 1+j) if i j; 0 otherwise; and D = diagf(s m ) T (y m ); : : : ; s T 1y 1 g: S T Y T We will describe how to compute d = H g in the case that > 0. Let x be the current iterate. Let m = minf + 1; mg. Given s 1 ; y 1 ; g, the matrices S 1 ; Y 1 ; U 1 ; Y T 1Y 1 ; D 1, and the vectors S T 1g 1 ; Y T 1g 1 : 1. Update the n m 1 matrices S 1 and Y 1 to get the n m matrices S and Y using s 1 and y 1 2. Compute the m -vectors S T g and Y T g. 3. Compute the m -vectors S T y 1 and Y T y 1 by using the fact that y 1 = g g 1 : We already now m 1 components of S g 1 from S 1 g 1, and liewise for Y g 1. We need only compute s T 1g 1 and y T 1g 1 and do the subtractions. ;

20 20 T. Gibson, D. P. O'Leary, L. Nazareth No. Reference Brief Description 0 x4.4.1 L-BFGS with no options. 1 x4.4.2, Variation 1 Allow m to vary iteratively basing the choice of m of g and not allowing m to decrease. 2 x4.4.2, Variation 2 Allow m to vary iteratively basing the choice of m of g and allowing m to decrease. 3 x4.4.2, Variation 3 Allow m to vary iteratively basing the choice of m of g=x and not allowing m to decrease. 4 x4.4.2, Variation 4 Allow m to vary iteratively basing the choice of m of g=x and allowing m to decrease. 5 x4.4.3 Dispose of old information if the step length is greater than one. 6 x4.4.4, Variation 1 Bac-up if the current iteration is odd. 7 x4.4.4, Variation 2 Bac-up if the current iteration is even. 8 x4.4.4, Variation 3 Bac-up if a step length of 1.0 was used in the last iteration. 9 x4.4.4, Variation 4 Bac-up if g > g x4.4.4, Variation 3* Bac-up if a step length of 1.0 was used in the last iteration and we did not bac-up on the last iteration. 11 x4.4.4, Variation 4* Bac-up if g > g 1 and we did not bac-up on the last iteration. 12 x4.4.5, Variation 1 Merge if neither of the two vectors to be merged is itself the result of a merge and the 2nd and 3rd most recent steps taen were of length x4.4.5, Variation 2 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge. 14 x4.4.6, Variation 1 Sip update on odd iterations. 15 x4.4.6, Variation 2 Sip update on even iterations. 16 x4.4.6, Variation 3 Sip update if g +1 > g. 17 Alg. 5 & Alg. 8 Dispose of old information and bac-up on the next iteration if the step length is greater than one. 18 Alg. 13 & Alg. 1 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g and not allowing m to decrease. 19 Alg. 13 & Alg. 3 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g=x and not allowing m to decrease. 20 Alg. 13 & Alg. 2 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g and allowing m to decrease. 21 Alg. 13 & Alg. 2 Merge if we did not do a merge the last iteration and there are at least two old s vectors to merge, and allow m to vary iteratively basing the choice of m of g=x and allowing m to decrease. Table 4.2 Description of Numerical Algorithms 4. Compute U 1 1. Rather than recomputing U, we update the matrix from the previous iteration by deleting the leftmost column and topmost row if m = m 1 and appending a new column on the right and a new row on the bottom. Let 1 = 1=s T 1y 1 and let (U 1 1 )0 be the (m 1)(m 1) lower right submatrix of U 1 1 and let (S T y 1 ) 0 be the upper m 1 elements of S T y 1. Then U 1 = (U 1 1 )0 1 (U 1 1 )0 (S T y 1 ) :

21 L-BFGS Variations 21 Note that s T 1y 1 = (S T y 1 ) m and so is already computed. 5. Assemble Y T Y. We have already computed all the components. 6. Update D using D 1 and s T 1y 1 = (S T y 1 ) m. 7. Compute = y T 1s 1 =y T 1y 1 : Note that both y T 1s 1 and y T 1y 1 have already been computed. 8. Compute two intermediate values p 1 = U 1 ST g ; p 2 = U 1 ( Y T Y p 1 + D p 1 Y T g ): 9. Compute d = Y p 1 S p 2 g : The storage costs for this are very low. In order to reconstruct H, we need to store S ; Y ; U 1 ; YT Y, D (a diagonal matrix) and a few m-vectors. This requires only 2mn + 2m 2 + O(m) storage. Assuming m << n, this is much less storage than the n 2 storage required for typical implementation of BFGS. Step Operation Count 2 4mn 2m 3 4n + 2m 2 4 2m 2 4m m 2 + 2m 9 4m 2 + 2m Table 4.3 Operations Count for Computation of H g. Steps with no operations are not shown. The computation of Hg taes at most O(mn) operations assuming n >> m. (See Table 4.3.) This is much less than the O(n 2 ) time normally needed to compute Hg when the whole matrix H is stored. We are using L-BFGS as our basis for comparison. For information on the performance of L-BFGS see Liu and Nocedal [15] and Nash and Nocedal [19] Varying m iteratively: Algorithms 1{4. In typical implementations of L-BFGS, m is xed throughout the iterations: once m updates have accumulated, m updates are always used. We considered the possibility of varying m iteratively, preserving nite termination on convex quadratics. Using an argument similar to that presented in [15], we can also prove that this algorithm has a linear rate of convergence on a convex function that satises a few modest conditions. We tried four dierent variations on this theme. All were based on the following linear formula that scales m in relation to the size of g. Let m be the number of iterates saved at the th iteration, with m 0 = 1. Here, thin of m as the maximum allowable value of m. Let the convergence test be given by g =x <. Then the formula for m at iteration is m = min m 1 + 1; (m 1) log log 0 log 100 log :

22 22 T. Gibson, D. P. O'Leary, L. Nazareth Alg. No. m = 5 m = 10 m = 15 m = Table 4.4 The number of failures of the algorithms on the 22 test problems. An algorithm is said to have \failed" on a particular problem if a line search fails or the maximum allowable number of iterations (3000 in our case) is exceeded. We have two choices for, and a choice of whether or not we will allow m to decrease as well as increase. The four variations are 1. = g and require m m 1, 2. = g, 3. = g =x and require m m 1, and 4. = g =x. We used four values of m: 5,10,15 and 50, for each algorithm. The results are summarized in Tables 4.4 { 4.8. More extensive results can be obtained [9]. Table 4.4 shows that these algorithms had roughly the same number of failures as L-BFGS. Table 4.5 compares each algorithm to L-BFGS in terms of function evaluations. For each algorithm and each value of m, the number of times that the algorithm used as few or fewer function evaluations than L-BFGS is listed relative to the total number of admissible problems. Problems are admissible if at least one of the two methods solved it. We observe that in all but three cases, the algorithm used as few or fewer function evaluations than L-BFGS for over half the test problems. Table 4.6 compares each algorithm to L-BFGS in terms of time. The entries are similar to those in Table 4.5. Observe that Algorithms 1-4 did very well in terms of time, doing as well or better than L-BFGS in nearly every case. For each problem in each algorithm, we computed the ratio of the number of function evaluations for the algorithm to the number of function evaluations for L- BFGS. Table 4.7 lists the means of these ratios. A mean below 1.0 implies that the algorithm does better than L-BFGS on average. The average is better for the algorithms in 6 out of 16 cases for the rst four algorithms. Observe, however, that all the means are close to one.

23 L-BFGS Variations 23 Alg. No. m = 5 m = 10 m = 15 m = /22 10/22 17/22 17/22 2 7/22 13/22 13/22 19/ /21 14/22 12/22 15/ /21 17/22 15/22 16/ /22 20/22 20/22 21/ /21 22/22 22/22 21/21 7 8/22 12/22 10/22 10/ /22 14/22 12/22 15/22 9 6/22 13/22 12/22 16/ /22 14/22 12/22 15/ /22 10/22 11/22 14/ /21 22/22 22/22 21/ /22 4/22 4/22 4/ /21 2/22 2/22 2/ /22 1/22 1/22 1/ /22 1/22 1/22 1/ /22 13/22 12/22 14/ /22 4/22 5/22 4/ /22 3/22 4/22 4/ /22 4/22 4/22 5/ /22 2/22 4/22 4/22 Table 4.5 Function Evaluations Comparison. The rst number in each entry is the number of times the algorithm did as well as or better than normal L-BFGS in terms of function evaluations. The second number is the total number of problems solved by at least one of the two methods (the algorithm and/or L-BFGS). We experience savings in terms of time for the rst four algorithms. These algorithms will tend save fewer vectors than L-BFGS since m is typically less than m; and so less wor is done computing H g in these algorithms. Table 4.8 gives the mean of the ratios of time to solve for each value of m in each algorithm. Note that most of the ratios are far below one in this case. These variations did particularly well on problem 7. See [9] for more information Disposing of old information: Algorithm 5. We may decide that we are storing too much old information and that we should stop using it. For example, we may choose to throw away everything except for the most recent information whenever we tae a big step, since the old information may not be relevant to the new neighborhood. We use the following test: If the last step length was bigger than 1, dispose of the old information. The algorithm performed nearly the same as L-BFGS. There was substantial deviation on only one or two problems for each value of m, and this seemed evenly divided in terms of better and worse. From Table 4.4, we see that this algorithm successfully converged on every problem. Table 4.5 shows that it almost always did as well or better than L-BFGS in terms of function evaluations. However, Table 4.7 shows that the dierences were minor. In terms of time, we observe that the algorithm generally was faster than L-BFGS (Table 4.6), but again, considering the mean ratios of time (Table 4.8), the dierences were minor. The method also does particularly well on problem 7 [9] Bacing Up in the Update to H: Algorithms As discussed in x2.2, if we always use the most recent s and y in the update, we preserve quadratic termination regardless of which older values of s and y we use.

24 24 T. Gibson, D. P. O'Leary, L. Nazareth Alg. No. m = 5 m = 10 m = 15 m = /22 18/22 20/22 18/ /22 19/22 18/22 18/ /21 14/22 15/22 15/ /21 18/22 20/22 18/ /22 13/22 14/22 15/ /21 19/22 15/22 15/ /22 11/22 10/22 7/ /22 7/22 6/22 5/22 9 9/22 10/22 7/22 8/ /22 8/22 5/22 5/ /22 8/22 9/22 5/ /21 12/22 8/22 11/ /22 10/22 13/22 17/ /21 2/22 2/22 3/ /22 6/22 9/22 10/ /22 2/22 3/22 3/ /22 8/22 5/22 4/ /22 14/22 19/22 20/ /22 11/22 17/22 19/ /22 14/22 17/22 19/ /22 16/22 16/22 18/22 Table 4.6 Time Comparison. The rst number in each entry is the number of times the algorithm did as well as or better than normal L-BFGS in terms of time. The second number is the total number of problems solved by at least one of the two methods (the algorithm and/or L-BFGS). Using this idea, we created some algorithms. Under certain conditions, we discard the next most recent values of s and y in the H although we still use the most recent s and y vectors and any other vectors that have been saved from previous iterations. We call this \bacing up" because it as if we bac-up over the next most recent values of s and y. These algorithms used the following four tests to trigger bacing up: 1. The current iteration is odd. 2. The current iteration is even. 3. A step length of 1.0 was used in the last iteration. 4. g > g 1. In two additional algorithms, we varied situations 3 and 4 by not allowing a bac-up if a bac-up was performed on the previous iteration. The bacing up strategy seemed robust in terms of failures. In 4 out of the 6 variations we did for this algorithm, there were no failures at all. See Table 4.4 for more information. It is interesting to observe that bacing up on odd iterations (Algorithm 6) and bacing up on even iterations (Algorithm 7) caused very dierent results. Bacing up on odd iterations seemed to have almost no eect on the number of function evaluations (Table 4.7) and little eect on the time (Table 4.8). However, bacing up on even iterations causes much dierent behavior from L-BFGS. It does worse than L-BFGS on most problems, but better on a few. Algorithms 8 and 10 were two variations of the same idea: bacing up if the previous step length was one. This wipes out the data from the previous iteration after it has been used in one update. Both show improvement over L-BFGS in terms of function evaluations; in fact, these two algorithms have the best function evaluation ratio for the m = 50 case (Table 4.7). Unfortunately, these algorithms did not compete with L-BFGS in terms of time (Table 4.8). There is little dierence between

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number Limited-Memory Matrix Methods with Applications 1 Tamara Gibson Kolda 2 Applied Mathematics Program University of Maryland at College Park Abstract. The focus of this dissertation is on matrix decompositions

More information

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS HONOUR SCHOOL OF MATHEMATICS, OXFORD UNIVERSITY HILARY TERM 2005, DR RAPHAEL HAUSER 1. The Quasi-Newton Idea. In this lecture we will discuss

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Quasi-Newton methods for minimization

Quasi-Newton methods for minimization Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universitá di Trento November 21 December 14, 2011 Quasi-Newton methods for minimization 1

More information

A COMBINED CLASS OF SELF-SCALING AND MODIFIED QUASI-NEWTON METHODS

A COMBINED CLASS OF SELF-SCALING AND MODIFIED QUASI-NEWTON METHODS A COMBINED CLASS OF SELF-SCALING AND MODIFIED QUASI-NEWTON METHODS MEHIDDIN AL-BAALI AND HUMAID KHALFAN Abstract. Techniques for obtaining safely positive definite Hessian approximations with selfscaling

More information

5 Quasi-Newton Methods

5 Quasi-Newton Methods Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min

More information

2. Quasi-Newton methods

2. Quasi-Newton methods L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization

More information

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell DAMTP 2014/NA02 On fast trust region methods for quadratic models with linear constraints M.J.D. Powell Abstract: Quadratic models Q k (x), x R n, of the objective function F (x), x R n, are used by many

More information

Search Directions for Unconstrained Optimization

Search Directions for Unconstrained Optimization 8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All

More information

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS Anders FORSGREN Tove ODLAND Technical Report TRITA-MAT-203-OS-03 Department of Mathematics KTH Royal

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

1. Introduction Let the least value of an objective function F (x), x2r n, be required, where F (x) can be calculated for any vector of variables x2r

1. Introduction Let the least value of an objective function F (x), x2r n, be required, where F (x) can be calculated for any vector of variables x2r DAMTP 2002/NA08 Least Frobenius norm updating of quadratic models that satisfy interpolation conditions 1 M.J.D. Powell Abstract: Quadratic models of objective functions are highly useful in many optimization

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

THE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS

THE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS Furman University Electronic Journal of Undergraduate Mathematics Volume 12, 5 20, 2007 HE RELAIONSHIPS BEWEEN CG, BFGS, AND WO LIMIED-MEMORY ALGORIHMS ZHIWEI (ONY) QIN Abstract. For the solution of linear

More information

Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization M. Al-Baali y December 7, 2000 Abstract This pape

Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization M. Al-Baali y December 7, 2000 Abstract This pape SULTAN QABOOS UNIVERSITY Department of Mathematics and Statistics Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization by M. Al-Baali December 2000 Extra-Updates

More information

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS) Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version) LBFGS John Langford, Large Scale Machine Learning Class, February 5 (post presentation version) We are still doing Linear Learning Features: a vector x R n Label: y R Goal: Learn w R n such that ŷ w (x)

More information

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Justin Solomon CS 205A: Mathematical Methods Optimization II: Unconstrained Multivariable 1

More information

Chapter 10 Conjugate Direction Methods

Chapter 10 Conjugate Direction Methods Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method

More information

Quasi-Newton Methods

Quasi-Newton Methods Newton s Method Pros and Cons Quasi-Newton Methods MA 348 Kurt Bryan Newton s method has some very nice properties: It s extremely fast, at least once it gets near the minimum, and with the simple modifications

More information

Quasi-Newton Methods

Quasi-Newton Methods Quasi-Newton Methods Werner C. Rheinboldt These are excerpts of material relating to the boos [OR00 and [Rhe98 and of write-ups prepared for courses held at the University of Pittsburgh. Some further references

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

New class of limited-memory variationally-derived variable metric methods 1

New class of limited-memory variationally-derived variable metric methods 1 New class of limited-memory variationally-derived variable metric methods 1 Jan Vlček, Ladislav Lukšan Institute of Computer Science, Academy of Sciences of the Czech Republic, L. Lukšan is also from Technical

More information

Convex Optimization CMU-10725

Convex Optimization CMU-10725 Convex Optimization CMU-10725 Quasi Newton Methods Barnabás Póczos & Ryan Tibshirani Quasi Newton Methods 2 Outline Modified Newton Method Rank one correction of the inverse Rank two correction of the

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Lecture 18: November Review on Primal-dual interior-poit methods

Lecture 18: November Review on Primal-dual interior-poit methods 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Lecturer: Javier Pena Lecture 18: November 2 Scribes: Scribes: Yizhu Lin, Pan Liu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Improved Damped Quasi-Newton Methods for Unconstrained Optimization Improved Damped Quasi-Newton Methods for Unconstrained Optimization Mehiddin Al-Baali and Lucio Grandinetti August 2015 Abstract Recently, Al-Baali (2014) has extended the damped-technique in the modified

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

First Published on: 11 October 2006 To link to this article: DOI: / URL:

First Published on: 11 October 2006 To link to this article: DOI: / URL: his article was downloaded by:[universitetsbiblioteet i Bergen] [Universitetsbiblioteet i Bergen] On: 12 March 2007 Access Details: [subscription number 768372013] Publisher: aylor & Francis Informa Ltd

More information

f = 2 x* x g 2 = 20 g 2 = 4

f = 2 x* x g 2 = 20 g 2 = 4 On the Behavior of the Gradient Norm in the Steepest Descent Method Jorge Nocedal y Annick Sartenaer z Ciyou Zhu x May 30, 000 Abstract It is well known that the norm of the gradient may be unreliable

More information

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived

More information

Statistics 580 Optimization Methods

Statistics 580 Optimization Methods Statistics 580 Optimization Methods Introduction Let fx be a given real-valued function on R p. The general optimization problem is to find an x ɛ R p at which fx attain a maximum or a minimum. It is of

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Downloaded 12/02/13 to Redistribution subject to SIAM license or copyright; see

Downloaded 12/02/13 to Redistribution subject to SIAM license or copyright; see SIAM J. OPTIM. Vol. 23, No. 4, pp. 2150 2168 c 2013 Society for Industrial and Applied Mathematics THE LIMITED MEMORY CONJUGATE GRADIENT METHOD WILLIAM W. HAGER AND HONGCHAO ZHANG Abstract. In theory,

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

154 ADVANCES IN NONLINEAR PROGRAMMING Abstract: We propose an algorithm for nonlinear optimization that employs both trust region techniques and line

154 ADVANCES IN NONLINEAR PROGRAMMING Abstract: We propose an algorithm for nonlinear optimization that employs both trust region techniques and line 7 COMBINING TRUST REGION AND LINE SEARCH TECHNIQUES Jorge Nocedal Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208-3118, USA. Ya-xiang Yuan State Key Laboratory

More information

THE EGG GAME DR. WILLIAM GASARCH AND STUART FLETCHER

THE EGG GAME DR. WILLIAM GASARCH AND STUART FLETCHER THE EGG GAME DR. WILLIAM GASARCH AND STUART FLETCHER Abstract. We present a game and proofs for an optimal solution. 1. The Game You are presented with a multistory building and some number of superstrong

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)

More information

Reduced-Hessian Methods for Constrained Optimization

Reduced-Hessian Methods for Constrained Optimization Reduced-Hessian Methods for Constrained Optimization Philip E. Gill University of California, San Diego Joint work with: Michael Ferry & Elizabeth Wong 11th US & Mexico Workshop on Optimization and its

More information

Notes on Some Methods for Solving Linear Systems

Notes on Some Methods for Solving Linear Systems Notes on Some Methods for Solving Linear Systems Dianne P. O Leary, 1983 and 1999 and 2007 September 25, 2007 When the matrix A is symmetric and positive definite, we have a whole new class of algorithms

More information

17 Solution of Nonlinear Systems

17 Solution of Nonlinear Systems 17 Solution of Nonlinear Systems We now discuss the solution of systems of nonlinear equations. An important ingredient will be the multivariate Taylor theorem. Theorem 17.1 Let D = {x 1, x 2,..., x m

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

1 Numerical optimization

1 Numerical optimization Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms

More information

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02)

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02) Linear Algebra (part ) : Matrices and Systems of Linear Equations (by Evan Dummit, 206, v 202) Contents 2 Matrices and Systems of Linear Equations 2 Systems of Linear Equations 2 Elimination, Matrix Formulation

More information

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method Conjugate Gradient (CG) Method by K. Ozawa 1 Introduction In the series of this lecture, I will introduce the conjugate gradient method, which solves efficiently large scale sparse linear simultaneous

More information

ALGORITHM XXX: SC-SR1: MATLAB SOFTWARE FOR SOLVING SHAPE-CHANGING L-SR1 TRUST-REGION SUBPROBLEMS

ALGORITHM XXX: SC-SR1: MATLAB SOFTWARE FOR SOLVING SHAPE-CHANGING L-SR1 TRUST-REGION SUBPROBLEMS ALGORITHM XXX: SC-SR1: MATLAB SOFTWARE FOR SOLVING SHAPE-CHANGING L-SR1 TRUST-REGION SUBPROBLEMS JOHANNES BRUST, OLEG BURDAKOV, JENNIFER B. ERWAY, ROUMMEL F. MARCIA, AND YA-XIANG YUAN Abstract. We present

More information

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition) Vector Space Basics (Remark: these notes are highly formal and may be a useful reference to some students however I am also posting Ray Heitmann's notes to Canvas for students interested in a direct computational

More information

The Conjugate Gradient Method for Solving Linear Systems of Equations

The Conjugate Gradient Method for Solving Linear Systems of Equations The Conjugate Gradient Method for Solving Linear Systems of Equations Mike Rambo Mentor: Hans de Moor May 2016 Department of Mathematics, Saint Mary s College of California Contents 1 Introduction 2 2

More information

Optimization and Root Finding. Kurt Hornik

Optimization and Root Finding. Kurt Hornik Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

MATH 4211/6211 Optimization Quasi-Newton Method

MATH 4211/6211 Optimization Quasi-Newton Method MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Algorithms for Constrained Optimization

Algorithms for Constrained Optimization 1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic

More information

Unconstrained Multivariate Optimization

Unconstrained Multivariate Optimization Unconstrained Multivariate Optimization Multivariate optimization means optimization of a scalar function of a several variables: and has the general form: y = () min ( ) where () is a nonlinear scalar-valued

More information

4.1 Eigenvalues, Eigenvectors, and The Characteristic Polynomial

4.1 Eigenvalues, Eigenvectors, and The Characteristic Polynomial Linear Algebra (part 4): Eigenvalues, Diagonalization, and the Jordan Form (by Evan Dummit, 27, v ) Contents 4 Eigenvalues, Diagonalization, and the Jordan Canonical Form 4 Eigenvalues, Eigenvectors, and

More information

MA106 Linear Algebra lecture notes

MA106 Linear Algebra lecture notes MA106 Linear Algebra lecture notes Lecturers: Diane Maclagan and Damiano Testa 2017-18 Term 2 Contents 1 Introduction 3 2 Matrix review 3 3 Gaussian Elimination 5 3.1 Linear equations and matrices.......................

More information

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM JIAYI GUO AND A.S. LEWIS Abstract. The popular BFGS quasi-newton minimization algorithm under reasonable conditions converges globally on smooth

More information

x k+1 = x k + α k p k (13.1)

x k+1 = x k + α k p k (13.1) 13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,

More information

Outline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St

Outline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St Structured Lower Rank Approximation by Moody T. Chu (NCSU) joint with Robert E. Funderlic (NCSU) and Robert J. Plemmons (Wake Forest) March 5, 1998 Outline Introduction: Problem Description Diculties Algebraic

More information

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Doug James (and Justin Solomon) CS 205A: Mathematical Methods Optimization II: Unconstrained

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Gradient-Based Optimization

Gradient-Based Optimization Multidisciplinary Design Optimization 48 Chapter 3 Gradient-Based Optimization 3. Introduction In Chapter we described methods to minimize (or at least decrease) a function of one variable. While problems

More information

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion: Unconstrained Convex Optimization 21 4 Newton Method H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion: f(x + p) f(x)+p T f(x)+ 1 2 pt H(x)p ˆf(p) In general, ˆf(p) won

More information

A new ane scaling interior point algorithm for nonlinear optimization subject to linear equality and inequality constraints

A new ane scaling interior point algorithm for nonlinear optimization subject to linear equality and inequality constraints Journal of Computational and Applied Mathematics 161 (003) 1 5 www.elsevier.com/locate/cam A new ane scaling interior point algorithm for nonlinear optimization subject to linear equality and inequality

More information

Optimization Methods

Optimization Methods Optimization Methods Decision making Examples: determining which ingredients and in what quantities to add to a mixture being made so that it will meet specifications on its composition allocating available

More information

Preconditioned conjugate gradient algorithms with column scaling

Preconditioned conjugate gradient algorithms with column scaling Proceedings of the 47th IEEE Conference on Decision and Control Cancun, Mexico, Dec. 9-11, 28 Preconditioned conjugate gradient algorithms with column scaling R. Pytla Institute of Automatic Control and

More information

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods AM 205: lecture 19 Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods Optimality Conditions: Equality Constrained Case As another example of equality

More information

1 Matrices and Systems of Linear Equations

1 Matrices and Systems of Linear Equations Linear Algebra (part ) : Matrices and Systems of Linear Equations (by Evan Dummit, 207, v 260) Contents Matrices and Systems of Linear Equations Systems of Linear Equations Elimination, Matrix Formulation

More information

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A. AMSC/CMSC 661 Scientific Computing II Spring 2005 Solution of Sparse Linear Systems Part 2: Iterative methods Dianne P. O Leary c 2005 Solving Sparse Linear Systems: Iterative methods The plan: Iterative

More information

1. Introduction In this paper we study a new type of trust region method for solving the unconstrained optimization problem, min f(x): (1.1) x2< n Our

1. Introduction In this paper we study a new type of trust region method for solving the unconstrained optimization problem, min f(x): (1.1) x2< n Our Combining Trust Region and Line Search Techniques Jorge Nocedal y and Ya-xiang Yuan z March 14, 1998 Report OTC 98/04 Optimization Technology Center Abstract We propose an algorithm for nonlinear optimization

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Introduction to Nonlinear Optimization Paul J. Atzberger

Introduction to Nonlinear Optimization Paul J. Atzberger Introduction to Nonlinear Optimization Paul J. Atzberger Comments should be sent to: atzberg@math.ucsb.edu Introduction We shall discuss in these notes a brief introduction to nonlinear optimization concepts,

More information

IE 5531: Engineering Optimization I

IE 5531: Engineering Optimization I IE 5531: Engineering Optimization I Lecture 15: Nonlinear optimization Prof. John Gunnar Carlsson November 1, 2010 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 1 / 24

More information

Institute for Advanced Computer Studies. Department of Computer Science. On the Convergence of. Multipoint Iterations. G. W. Stewart y.

Institute for Advanced Computer Studies. Department of Computer Science. On the Convergence of. Multipoint Iterations. G. W. Stewart y. University of Maryland Institute for Advanced Computer Studies Department of Computer Science College Park TR{93{10 TR{3030 On the Convergence of Multipoint Iterations G. W. Stewart y February, 1993 Reviseed,

More information

Matrix Derivatives and Descent Optimization Methods

Matrix Derivatives and Descent Optimization Methods Matrix Derivatives and Descent Optimization Methods 1 Qiang Ning Department of Electrical and Computer Engineering Beckman Institute for Advanced Science and Techonology University of Illinois at Urbana-Champaign

More information

2 Chapter 1 rely on approximating (x) by using progressively ner discretizations of [0; 1] (see, e.g. [5, 7, 8, 16, 18, 19, 20, 23]). Specically, such

2 Chapter 1 rely on approximating (x) by using progressively ner discretizations of [0; 1] (see, e.g. [5, 7, 8, 16, 18, 19, 20, 23]). Specically, such 1 FEASIBLE SEQUENTIAL QUADRATIC PROGRAMMING FOR FINELY DISCRETIZED PROBLEMS FROM SIP Craig T. Lawrence and Andre L. Tits ABSTRACT Department of Electrical Engineering and Institute for Systems Research

More information

Cubic regularization in symmetric rank-1 quasi-newton methods

Cubic regularization in symmetric rank-1 quasi-newton methods Math. Prog. Comp. (2018) 10:457 486 https://doi.org/10.1007/s12532-018-0136-7 FULL LENGTH PAPER Cubic regularization in symmetric rank-1 quasi-newton methods Hande Y. Benson 1 David F. Shanno 2 Received:

More information

1 Numerical optimization

1 Numerical optimization Contents Numerical optimization 5. Optimization of single-variable functions.............................. 5.. Golden Section Search..................................... 6.. Fibonacci Search........................................

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method

Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method Handling Nonpositive Curvature in a Limited Memory Steepest Descent Method Fran E. Curtis and Wei Guo Department of Industrial and Systems Engineering, Lehigh University, USA COR@L Technical Report 14T-011-R1

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x) Solving Nonlinear Equations & Optimization One Dimension Problem: or a unction, ind 0 such that 0 = 0. 0 One Root: The Bisection Method This one s guaranteed to converge at least to a singularity, i not

More information

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for Life beyond Newton Notes for 2016-04-08 Newton s method has many attractive properties, particularly when we combine it with a globalization strategy. Unfortunately, Newton steps are not cheap. At each

More information

Handling nonpositive curvature in a limited memory steepest descent method

Handling nonpositive curvature in a limited memory steepest descent method IMA Journal of Numerical Analysis (2016) 36, 717 742 doi:10.1093/imanum/drv034 Advance Access publication on July 8, 2015 Handling nonpositive curvature in a limited memory steepest descent method Frank

More information

Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization 1

Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization 1 journal of complexity 18, 557 572 (2002) doi:10.1006/jcom.2001.0623 Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization 1 M. Al-Baali Department of Mathematics

More information

Adaptive two-point stepsize gradient algorithm

Adaptive two-point stepsize gradient algorithm Numerical Algorithms 27: 377 385, 2001. 2001 Kluwer Academic Publishers. Printed in the Netherlands. Adaptive two-point stepsize gradient algorithm Yu-Hong Dai and Hongchao Zhang State Key Laboratory of

More information

The speed of Shor s R-algorithm

The speed of Shor s R-algorithm IMA Journal of Numerical Analysis 2008) 28, 711 720 doi:10.1093/imanum/drn008 Advance Access publication on September 12, 2008 The speed of Shor s R-algorithm J. V. BURKE Department of Mathematics, University

More information

Linear Algebra: Linear Systems and Matrices - Quadratic Forms and Deniteness - Eigenvalues and Markov Chains

Linear Algebra: Linear Systems and Matrices - Quadratic Forms and Deniteness - Eigenvalues and Markov Chains Linear Algebra: Linear Systems and Matrices - Quadratic Forms and Deniteness - Eigenvalues and Markov Chains Joshua Wilde, revised by Isabel Tecu, Takeshi Suzuki and María José Boccardi August 3, 3 Systems

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 6 Optimization Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction permitted

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 6 Optimization Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction permitted

More information

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent R-Linear Convergence of Limited Memory Steepest Descent Fran E. Curtis and Wei Guo Department of Industrial and Systems Engineering, Lehigh University, USA COR@L Technical Report 16T-010 R-Linear Convergence

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Nonmonotonic back-tracking trust region interior point algorithm for linear constrained optimization

Nonmonotonic back-tracking trust region interior point algorithm for linear constrained optimization Journal of Computational and Applied Mathematics 155 (2003) 285 305 www.elsevier.com/locate/cam Nonmonotonic bac-tracing trust region interior point algorithm for linear constrained optimization Detong

More information

Jorge Nocedal. select the best optimization methods known to date { those methods that deserve to be

Jorge Nocedal. select the best optimization methods known to date { those methods that deserve to be Acta Numerica (1992), {199:242 Theory of Algorithms for Unconstrained Optimization Jorge Nocedal 1. Introduction A few months ago, while preparing a lecture to an audience that included engineers and numerical

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information