ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

Size: px

Start display at page:

Download "ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS"

Britton Carpenter
5 years ago
Views:

1 ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS Anders FORSGREN Tove ODLAND Technical Report TRITA-MAT-203-OS-03 Department of Mathematics KTH Royal Institute of Technology February 203 Abstract It is well nown that the conugate gradient method and a quasi-newton method, using any well-defined update matrix from the one-parameter Broyden family of updates, produce the same iterates on a quadratic problem with positive-definite Hessian. This equivalence does not hold for any quasi-newton method. We discuss more precisely the conditions on the update matrix that give rise to this behavior, and show that the crucial fact is that the components of each update matrix is choosen in the last two dimensions of the Krylov subspaces defined by the conugate gradient method. In the framewor based on a sufficient condition to obtain mutually conugate search directions, we show that the one-parameter Broyden family is complete. We also show that the update matrices from the one-parameter Broyden family is almost always well-defined on a quadratic problem with positivedefinite Hessian. The only exception is when the symmetric ran-one update is used and the unit steplength is taen in the same iteration, in this case it is the Broyden parameter that becomes undefined. Introduction In this paper we examine some well-nown methods used for solving unconstrained optimization problems and specifically their behavior on quadratic problems. A motivation why these problems are of interest is that the tas of solving a linear system of equations Ax = b, Optimization and Systems Theory, Department of Mathematics, KTH Royal Institute of Technology, SE Stocholm, Sweden (andersf@th.se, odland@th.se). Research supported by the Swedish Research Council (VR).

2 2 On the connection between CG and QN with the assumption A = A T 0, may equivalently be considered as the one of solving an unconstrained quadratic programming problem, min q(x) = min x Rn x R n 2 xt Hx + c T x, (QP) where one lets H = A and c = b to obtain the usual notation. Some examples of methods that can be used to solve (QP) are the steepest descent method, Quasi-Newton methods, Newton s method and conugate directions methods. Given an initial guess x 0, the general idea of these methods is to, in each iteration, generate a search direction p and then tae a step α along that direction to approach the optimal solution. For 0, the next iterate is hence obtained as x + = x + α p. () The main difference between the methods mentioned above is the manner in which the search direction p is generated. For high-dimensional problems it is preferred that only function and gradient values are used in calculations. The gradient of the obective function q(x) is given by g(x) = Hx + c and its value at x is denoted by g. The research presented in this paper stems from the desire to better understand the well-nown connection between the conugate gradient method, henceforth CG, and quasi-newton methods, henceforth QN, namely that, using exact linesearch, these two methods will generate the same sequence of iterates as they approach the optimal solution of (QP). We are interested in understanding this connection especially in light of the choices made in association with the generation of p in QN. QN and CG will be introduced in Section 2 where we also state some bacground results on the connection between the two methods. In Section 3 we present our results and some concluding remars are made in Section 4. 2 Bacground On (QP), naturally exact linesearch is used to obtain the steplength in each iteration. Exact linesearch entails that the steplength is chosen such that the unique minimum of the function value along the given search direction p is obtained. In iteration, the optimal steplength is given by α = pt g p T Hp. (2) Since H = H T 0, it can be shown that the descent property, q(x + ) < q(x ), holds, as long as p T g 0. (3) Solves (QP) in one iteration, but requires an explicit expression of the Hessian H.

3 2 BACKGROUND 3 Note that if p T g < 0, i.e. if p is a descent direction then α > 0 and a step is taen along that direction. If p T g > 0, i.e. if p is an ascent direction then α < 0 and a step is taen in the opposite direction. On (QP), one is generally interested in methods for which the optimal solution is found in at most n iterations. It can be shown that a sufficient property for this behavior is that the method generates search directions which are mutually conugate with respect to H, i.e. p T i Hp = 0, i, (4) see, e.g., [3, Chapter 5]. A generic way to generate conugate vectors is by means of the conugate Gram- Schmidt process. Given a set of linearly independent vectors {a 0,..., a n }, a set of vectors {p 0,..., p n } mutually conugate with respect to H can be construct by letting p 0 = a 0 and for > 0, p = a + β p. (5) The values of {β } =0 are uniquely determined in order to mae p conugate to {p 0,..., p } with respect to H. Conugate direction methods is the common name for all methods which are based on generating search directions in this manner. See, e.g., [6] for an intuitive introduction. 2. Quasi-Newton methods In QN methods the search directions are generated by solving =0 B p = g, (6) in each iteration. A well-nown sufficient condition for this p to be conugate with respect to H to all previous search directions, and hence guarantee the finite termination property, is for B to satisfy B p i = ρ i Hp i, i, (7) where ρ i is an arbitrary nonzero scaling factor that we for the time being will put to. Observe that (7) can be split in the two sets of equations and B p = Hp, (8) B p i = Hp i, i 2. (9) Here (8) can be seen as installing the correct curvature for B over p, while (9) installs the correct curvature over the rest of the previous search directions. The matrix B is in this sense an approximation of the Hessian H. 2 2 The choice B = H would give Newton s method, wheras the choice B = I would give the method of steepest-descent.

4 4 On the connection between CG and QN In this paper, we will consider symmetric approximations of the Hessian, i.e. B = B T. As will be mentioned, one could also consider unsymmetric approximation matrices, see [0]. The first suggestion of a QN method was made by Davidon in 959 [3], using the term variable metric method. In 963, in a famous paper by Fletcher and Powell [5], Davidon s method was modified 3 and this was the starting point for maing these QN methods widely nown, used and studied. We choose to wor with an approximation of the Hessian rather than an approximation of the inverse Hessian, M, as many of the earlier papers did, e.g. [5]. Our results can however straightforwardly be derived for the inverse point of view where (6) is replaced by p = M g. The approximation matrix B used in iteration to solve for p is obtained by adding an update matrix, U, to the previous approximation matrix, B = B + U. (0) One often considers the Cholesy factorization of B, then (6) can be solved in order of n 2 operations. Also, if in (0) the update matrix U is of low-ran, one does not need to compute the Cholesy factorization of B from scratch in each iteration, see, e.g., [7]. Equation (7) can be seen as a starting point for deriving update schemes. In the literature, much emphasis is placed on that the updates should satisfy (8), the so called quasi-newton condition. 4 Probably the most famous update scheme is the one using update matrices from the one-parameter Broyden family of updates [] described by U = Hp p T H p T Hp B p p T B p T B p + φp T B p ww T, () with Hp w = p T Hp B p p T B, p and where φ is a free parameter, which we will refer to as the Broyden parameter. For all updates in this family, (0) has the property of hereditary symmetry, i.e. if B is symmetric then B will be symmetric. The update given by the choice φ = 0 is nown as the Broyden-Fletcher-Goldfarb-Shanno-update, or BFGS-update for short. For this update, when exact linesearch is used, (0) has the property of hereditary positive definiteness, i.e. if B 0 then B 0. An implication of this is that for all updates given by φ 0, when exact linesearch is used, (0) has the property of hereditary positive definiteness, see, e.g., [, Chapter 9]. Note that 3 We have made both a simplification by which certain orthogonality conditions which are important to the rate of attaining the solution are preserved, and also an improvement in the criterion of convergence. [5] 4 This condition alone is not a sufficient condition on B to give conugate directions.

5 2 BACKGROUND 5 there are updates in the one-parameter Broyden family for which (0) does not have this property. For all values of the Broyden parameter φ in (), the approximation matrix B given by (0) satisfies (7). The one-parameter Broyden family of updates is by no means the only updates that give an approximation matrix B that satisfies (7). However, on (QP), they are lined in a particular way to the conugate gradient method introduced next. 2.2 Conugate gradient method With the choice a = g in (5) one obtains the conugate gradient method, CG, of Hestenes and Stiefel [9]. In effect, in CG let p 0 = g 0, and for > 0 the only β-values in (5) that will be non-zero are β, = pt Hg p T Hp, (2) where one may drop the first sub-index. Equation (5) can then be written as p = g + β p. (3) From the use of exact linesearch it holds that g T p = 0,, which implies g T g = 0,, (4) i.e. the produced gradients are orthogonal and therefore linearly independent, as required. In CG, one only needs to store the most recent previous search direction, p. This is a reduction in the amount of storage required compared to a general conugate direction method where potentially all previous search directions are needed to compute p. Although equations (), (2), (2) and (3) give a complete description of an iteration of CG, the power and richness of the method is somewhat clouded in notation. An intuitive way to picture what happens in an iteration of CG is to describe it as a Krylov subspace method. Definition 2.. Given a matrix A and a vector b the Krylov subspace generated by A and b is given by K (b, A) = span{b, Ab,..., A b}. Krylov subspaces are linear subspaces, which are expanding, i.e. K (A, b) K 2 (A, b) K 3 (A, b)..., and dim(k (b, A)) =. Given x K (A, b), then Ax K + (A, b), see, e.g., [8] for an introduction to Krylov space methods. CG is a Krylov subspace method and iteration may be put as the following constrained optimization problem min q(x), s.t. x x 0 + K + (p 0, H). (CG ) The optimal solution of (CG ) is x + and the corresponding multiplier is given by g + = q(x + ) = Hx + + c. In each iteration, the dimension of the affine

6 6 On the connection between CG and QN subspace where the optimal solution is sought increases by one. After at most n iterations the optimal solution in an n-dimensional space is found, which will then be the optimal solution of (QP). It may happen, depending on the number of distinct eigenvalues of H and the orientation of p 0, that the optimal solution is found after less than n iterations, see, e.g., [5, Chapter 6]. We will henceforth let n CG denote the number of iterations that CG will require to solve a given problem on the form (QP). The search direction p belongs to K + (p 0, H), and as it is conugate to all previous search directions with respect to H it holds that span{p 0, p,..., p } = K + (H, p 0 ), i.e., the search directions p 0,..., p form an H-orthogonal basis for K + (p 0, H). This result is summarized in the following lemma, for a proof see, e.g. [8]. Lemma 2.2. If a set of vectors {p 0,..., p } satisfy p i K i+ (p 0, H), i and p T i Hp = 0, i, then span{p 0,..., p } = K + (p 0, H). We will henceforth refer to the search direction produced by CG, in iteration on (QP), as p CG. Since the gradients are mutually orthogonal, and because of the relationship with the search directions in (3), it can be shown that span{g 0,..., g } = K + (p 0, H), i.e. the gradients form an orthogonal basis for the subspace K + (p 0, H). General conugate direction methods can not be described as Krylov subspace methods, since in general one does not have span{p 0,..., p } = K + (p 0, H). We will use this special characteristic of CG when investigating the well-nown connection to QN. Although our focus is quadratic programming it may interest the reader to now that CG was extended to general unconstrained problems by Fletcher and Reeves [6]. 2.3 Bacground results Given two parallel vectors and an initial point x, performing exact linesearch from the initial point with respect to a given obective function along these two vectors will yield the same iterate x +. Hence, two methods will find the same sequence of iterates if and only if the search directions generated by the two methods are parallel. In [0], Huang shows that a QN method with B in the Huang family of updates produces search directions which are parallel to those of CG on (QP). In the Huang family of updates the additional scaling parameter ρ i in (7) is allowed to be different from, also this family includes unsymmetric approximation matrices. The symmetric part of the Huang family, with ρ i =, i, is the one-parameter Broyden family, see, e.g., [4].

7 3 RESULTS 7 As we focus on symmetric approximations, our interest is on the connection between CG and QN with B in the one-parameter Broyden family of updates. However, the following results could straightforwardly be translated to the case of an arbitrary ρ i in (7), but we eep this parameter fixed to to eep the notation cleaner. Nazareth [2] shows that a QN method with B updated using the BFGS-update, gives a search direction that, in addition to being parallel to the search direction of CG, has the same length. Bucley, see e.g., [2], uses this special connection to design algorithms that combine CG and QN using the BFGS-update. Another interesting result proved by Dixon, [4], is that on any convex function, using perfect line-search 5, the one-parameter Broyden family gives rise to parallel search directions. However, the connection between CG and QN does not hold for general convex functions. All B satisfying (7) can be seen as defining different conugate direction methods, but only certain choices of B will give CG. The one-parameter Broyden family satisfies (7), but some additional condition is required in order to obtain the special conugate direction method which is CG. What is it about this family of updates that maes the generated search direction p parallel to p CG for all? In this paper, we provide an answer to this question by turning our attention to the search directions defined by CG and the Krylov subspaces they span. We will investigate the conditions on B and in particular on U that arise in this setting. The conditions we obtain will contribute to a better understanding of the well-nown connection between CG and QN. 3 Results As a reminder for the reader we will, in the following proposition, state the necessary and sufficient conditions for when each vector in the set {p 0,..., p } is parallel to the corresponding vector in the set {p CG 0,..., p CG }. Given x 0, one may calculate g 0 = Hx 0 + c. Proposition 3.. Let p 0 = p CG 0 = g 0. Then for all 0, p 0 is parallel to if and only if p CG (i) p K + (p 0, H), and, (ii) p is conugate to {p 0,..., p } with respect to H. Proof. For the sae of completeness we include the proof. Necessary: Suppose p is parallel to p CG for all 0. Then p = δ p CG, for arbitrary nonzero scalars δ for, and δ 0 =. As p CG satisfies (i), it holds that p = δ p CG K + (p 0, H), since K + (p 0, H) is a linear subspace. And as p CG p T Hp i = δ δ i (p CG ) T Hp CG i = 0, i, 5 A generalization of exact linesearch for general convex functions, see [4]. satisfies (ii), it follows that,

8 8 On the connection between CG and QN i.e. p is conugate to {p 0,..., p } with respect to H. Sufficient: Suppose p satisfies (i) and (ii) for all 0. It is clear that p 0 and p CG 0 are parallel. From (i) one has that, for all, p and p CG lie in the same subspace of dimension +, namely K + (p 0, H). By (ii) and Lemma 2.2, both {p 0,..., p } and {p CG 0,..., p CG } form an H-orthogonal basis for the space K (p 0, H). This only leaves one dimension of K + (p 0, H) in which p and p CG can lie. Hence, p and p CG are parallel for each. These necessary and sufficient conditions will serve as a foundation for the rest of our results. We will determine conditions, first on B, and second on U used in QN, in order for the generated search direction p to satisfy the conditions of Proposition 3.. Given the current iterate x, one may calculate g = Hx +c. Hence, in iteration of QN, p is determined from (6) and depends entirely on the choice of B. Assume that B is constructed as, n B = I + γ v v T, (5) = i.e. an identity matrix plus linear combination of ran one matrices. 6 We mae no assumption on n. Let B 0 = I, then p 0 = p CG 0 = g 0. Given a sequence {p 0,..., p } that satisfies the conditions of Proposition 3., the following lemma gives sufficient conditions on B in order for the sequence {p 0,..., p, p } to satisfy the conditions of Proposition 3.. Lemma 3.2. Let B 0 = I, so that p 0 = p CG 0 = g 0. Given {p 0,..., p } that satisfies the conditions of Proposition 3., if p is obtained from (6), and B satisfies (i) v K + (p 0, H) for all =,..., n, and, (ii) B p i = Hp i for all i, then {p 0,..., p, p } will satisfy the conditions of Proposition 3.. Proof. First we show that p satisfies condition (i) of Proposition 3.. If (5) is inserted into (6) one obtains, ( n I + γ v v T ) p = g, = so that where n p = g γ v v T p = g + ˆγ v, n = = ˆγ = γ v T p, 6 Any symmetric matrix can be expressed on this form.

9 3 RESULTS 9 are scalars for all. Since g K + (p 0, H), and since, by assumption, v K + (p 0, H) for all, it holds that p K + (p 0, H). Hence, p satisfies condition (i) of Proposition 3.. Condition (ii) is identical to (7) with ρ i = for all i, and this is a sufficient condition for p to be conugate to {p 0,..., p } with respect to H, i.e. p satisfies condition (ii) of Proposition 3.. If in QN, for each = 0,..., n CG, the matrix B satisfies the conditions of Lemma 3.2, then one will obtain a sequence {p 0,..., p ncg } that satisfies the conditions of Proposition 3.. Hence, with exact line-search, the same iterates would be obtained as for CG on (QP). The conditions on B given in Lemma 3.2 and the conditions in all the following results will be sufficient conditions to generate search directions that satisfies condition (ii) of Proposition 3.. The reason we do not get necessary and sufficient conditions is due to the use of (7) to obtain the desired conugacy of the search directions. As B is updated according to (0), one would prefer to have conditions, in iteration, on the update matrix U instead of on the entire matrix B. Therefore, we now modify Lemma 3.2 by noting that equation (5) may be stated as n m B = B + U = I + γ v v T + = = η () u () (u () ) T, (6) where the last sum is U, the part that is added in iteration. We mae no assumption on m. Lemma 3.2 can be modified, recalling that one may spilt condition (ii) as B p i = Hp i, i 2, (7) B p = Hp. (8) Using (0) one can reformulate (7) and (8) in terms of U, see (ii) and (iii) of the following proposition. Given B i that satisfies Lemma 3.2 for i = 0,...,, i.e. {p 0,..., p } satisfies the conditions of Proposition 3., the following proposition gives sufficient conditions on the update matrix U in order for the sequence {p 0,..., p, p } to satisfy the conditions of Proposition 3.. Proposition 3.3. Let B 0 = I, so that p 0 = p CG 0 = g 0. Given {p 0,..., p } that satisfies the conditions of Proposition 3., if p is obtained from (6), B is obtained from (0), and U satisfies (i) u () K + (p 0, H) for all =,..., m, (ii) U p i = 0, i 2, and, (iii) U p = (H B )p, then {p 0,..., p, p } will satisfy the conditions of Proposition 3..

10 0 On the connection between CG and QN Proof. The proof is identical to the one of Lemma 3.2. If (6) is inserted into (6) one obtains n m p = g γ v v T p = = η () u () (u () n ) T p = g + ˆγ v = m = ˆη () u (). Since g K + (p 0, H), v K (p 0, H) K + (p 0, H), for all =,... n and, by assumption, u () K + (p 0, H) for all =,... m, it holds that p K + (p 0, H). Hence, p satisfies condition (i) of Proposition 3.. Conditions (ii) and (iii) are merely a reformulation of (7) with ρ i = for all i, and this is a sufficient condition for p to be conugate to {p 0,..., p } with respect to H, i.e. p satisfies condition (ii) of Proposition 3.. If in QN, for each = 0,..., n CG, the matrix U satisfies the conditions of Proposition 3.3, then one will obtain a sequence {p 0,..., p ncg } that satisfies the conditions of Proposition 3.. Hence, with exact line-search, the same iterates would be obtained as for CG on (QP). Next, we consider a modification of Proposition 3.3 where condition (ii) is rewritten in terms of the vectors {u () } that compose U, as m = η () u () Clearly the above holds if for all =,..., m, (u () ) T p i = 0, i 2. (9) (u () ) T p i = 0, i 2. (20) A necessary and sufficient condition for (20) to hold is that u () K (p 0, H), for all, since, by Lemma 2.2, span{p 0,..., p 2 } = K (p 0, H). Note that (20) is a potentially stronger condition on the vectors {u () } than (9). Hence, given B i that satisfies Lemma 3.2 for i = 0,...,, i.e. {p 0,..., p } satisfies the conditions of Proposition 3., the next proposition potentially gives stronger sufficient conditions on U than Proposition 3.3 in order for the sequence {p 0,..., p, p } to satisfy the conditions of Proposition 3.. Proposition 3.4. Let B 0 = I, so that p 0 = p CG 0 = g 0. Given {p 0,..., p } that satisfies the conditions of Proposition 3., if p is obtained from (6), B is obtained from (0), and U defined by (6) satisfies (i) u () (ii) m K (p 0, H) K + (p 0, H) for all =,..., m, and, = η() u () (u () ) T p = (H B )p, then {p 0,..., p, p } will satisfy the conditions of Proposition 3..

11 3 RESULTS If in QN, for each = 0,..., n CG, the matrix U satisfies the conditions of Proposition 3.4, then one will obtain a sequence {p 0,..., p ncg } that satisfies the conditions of Proposition 3.. Hence, with exact line-search, the same iterates would be obtained as for CG on (QP). All U that satisfies Proposition 3.4 will also satisfy Proposition 3.3. Next we state a result which will be needed in our further investigation, and which holds for any update scheme with approximation matrix B that generates search directions that satisfy the conditions of Proposition 3.. Hence, it holds for any update scheme defined by Proposition 3.3 or Proposition 3.4. Proposition 3.5. If B i for i = 0,..., is such that {p 0,..., p } satisfy the conditions of Proposition 3., then p T B p 0, unless B p = g = 0. Proof. Assume that p T B p = 0, but B p = g 0. Since p satisfies the conditions of Proposition 3. and {g i } i=0 form an orthogonal basis for K +(p 0, H), one may write p = β i g i = β i g i + β g. i=0 i=0 Introduce this expression in p T B p = p T g = 0, 0 = p T g = ( β i g i + β g ) T g = β g T g β = 0, i=0 where the last equality follows from the gradients being mutually orthogonal. However, since span{p 0,..., p } = span{g 0,..., g } = K + (p 0, H), one can only get β = 0 if K + (p 0, H) = K (p 0, H), i.e. x is the optimal solution and g = 0 which is a contradiction to our assumption. This implies that one does not get p T B p = 0 unless B p = 0. Hence, on (QP), p T B p = p T g 0, unless g = 0 for update schemes that generates search directions which satisfy the conditions of Proposition 3.. Note that this implies that (3) is satisfied for any QN method using an update scheme that satisfies Proposition 3.3 or Proposition 3.4. We will now investigate the updates defined by Proposition 3.3 and Proposition 3.4 separately, starting with Proposition 3.4 as it gives an intuitive picture of how the vectors {u () } have to be chosen in a two-dimensional space. 3. Updates matrices defined by Proposition 3.4 As for each, {g 0,..., g } form an orthogonal basis for K + (p 0, H) it holds that span{g, g } = K (p 0, H) K + (p 0, H). Hence, {u () } satisfying condition (i) of Proposition 3.4, is equivalent to expressing {u () } as a linear combination of g and g.

12 2 On the connection between CG and QN Due to the linear relationship it follows that g = B p, g = α Hp + g, span{g, g } = span{b p, Hp } = K (p 0, H) K + (p 0, H). This implies that {u () } satisfying condition (i) of Proposition 3.4, is equivalent to expressing {u () } as a linear combination of B p and Hp. Note that B p and Hp are linearly independent if and only if g and g are linearly independent and that this holds until the optimal solution has been found, i.e. only if g = 0, does it holds that α Hp = B p. So far, we have made no assumptions on the ran of U. By condition (i) of Proposition 3.4 the vectors {u () }, must be chosen in the 2-dimensional space K (p 0, H) K + (p 0, H). Hence, a general construction of U is given by U = η () u() (u() )T + η () 2 u() 2 (u() 2 )T. (2) For this U, ran(u ) 2, where equality holds as long as u () and u () 2 are linearly independent and η () 0, for =, 2. Hence, we can conclude that the conditions of Proposition 3.4 will define updates of at most ran two. In the following theorem, we show that Proposition 3.4 defines the one-parameter Broyden family of updates. Theorem 3.6. Assume g 0. A matrix U satisfies (i)-(ii) of Proposition 3.4 if and only if U can be expressed according to (), the one-parameter Broyden family, for some φ. Proof. We want to express U on a general form that satisfies the conditions of Proposition 3.4, and may do so in the following setting. As g 0, the vectors {B p, Hp } form a basis for K (p 0, H) K + (p 0, H), one may scale them and obtain the following basis, span{ p T Hp Hp, p T B B p } = K (p 0, H) K + (p 0, H). p Note that the division by p T B p in the scaling is well-defined according to Proposition 3.5, since our assumption is that we have used B i, i, that generate search directions which satisfies the conditions of Proposition 3.. For now we will drop the sub-indices on p, U and B. Then one may express (2) as ( U = ) p T Hp Hp p T Bp Bp M ( p T Hp pt H p T Bp pt B ), (22)

13 3 RESULTS 3 where M is any symmetric two-by-two matrix. Inserting this expression into condition (ii) yields ( ( ) ( ) ( p T Hp Hp p T Bp Bp p M T Hp pt Hp p T ) ) ( ) Hp 0 p T Bp pt Bp p T =. Bp 0 This implies that, since the first matrix has linearly independent columns, ( ) ( p M T ) ( ) Hp 0 p T =. Bp 0 One may express M as M = ( p T Hp 0 0 p T Bp ) + T, (23) where T is any symmetric matrix such that ( ) ( 0 T = 0 Hence, T has one eigenvalue which is zero and we may conclude that ran(t ) =. In addition the eigenvector corresponding to the nonzero eigenvalue is a multiple of ( ) T. Therefore, T may be written as ( T = ϕ ). ) ( ( ) ϕ ϕ = ϕ ϕ ), (24) where ϕ is a free parameter. One may scale ϕ as ϕ := φp T Bp, again using the fact that, by Proposition 3.5, it holds that p T Bp 0. Combining (22), (23) and (24), U may be expressed as, with U = p T Hp HppT H p T Bp BppT B + φp T Bpww T, w = p T Hp Hp p T Bp Bp. In effect, U describes the one-parameter Broyden family defined in (). The one-parameter Broyden family of updates completely describes the updates defined by Proposition 3.4. It can be shown that there is a unique value of the Broyden parameter φ for which the update matrix U, defined by (), is of ran one. This value is φ = and the corresponding update is called the symmetric ran-one update, SR, see, e.g., [3, Chapter 8]. It is well-nown that the SR-update is uniquely defined by condition (ii) of Proposition 3.4. This result is summarized in the following lemma, for a proof see, e.g., [, Chapter 9]. pt Hp p T (H B)p,

14 4 On the connection between CG and QN Lemma 3.7. Let U be such that ran(u) =, i.e. U = η u u T. Then if U satisfies condition (ii) of Proposition 3.4, then U is uniquely determined as U = η u u T = η u u T p = (H B)p, p T (H B)p (H B)p( (H B)p ) T. (25) In addition, it holds that u = ±(H B)p satisfies condition (i) of Proposition 3.4, u K (p 0, H) K + (p 0, H). We may conclude that for ran(u ) =, condition (ii) of Proposition 3.4 is by itself sufficient to obtain p that satisfies Proposition 3.. For ran(u ) = 2 both conditions of Proposition 3.4 are needed in order to define the update matrices. We will now discuss under which circumstances the one-parameter Broyden family is not well-defined. For the SR-update of (25) we state, in following lemma, the necessary and sufficient conditions for when this update not well-defined. Lemma 3.8. The SR-update defined by (25) is not well-defined if and only if α = and g 0. Proof. SR is undefined if and only if ( (H B )p ) T p = 0, (H B )p 0. Since p T Hp p T B p = 0 = pt B p p T Hp = pt g p T Hp and 0 (H B )p 0 (g g ) g = g, α the statement of the lemma holds. = α, If iteration gives α = and g 0, i.e., x is not the optimal solution, then U will be undefined if the SR-update (25) is used to update B. Hence, taing the unit steplength in an iteration is an indication that one needs to choose a different update scheme in that iteration. Note that the undefinedness for SR enters in the Broyden parameter φ SR := pt Hp p T (H B)p. For all well-defined values of φ it holds that the one-parameter Broyden family given by () is not well-defined if and only if p T Bp = 0 and Bp 0. It is clear that requiring B to be positive definite is sufficient to avoid U being undefined. However, from Proposition 3.5 it follows that, on (QP), this undefinedness does not occur for any update scheme that generates search directions that satisfy the conditions of Proposition 3.. Hence, based on Theorem 3.6, Proposition 3.5 and Lemma 3.8, we may pose the following corollary giving necessary and sufficient conditions for when the oneparameter Broyden family is well-defined on (QP).

15 3 RESULTS 5 Corollary 3.9. Unless φ = φ SR in () and the unit steplength is taen in the same iteration, then U defined by (), the one-parameter Broyden family, is always well-defined on (QP). Our results hold for the entire one-parameter Broyden family and it is wellnown that (0) does not have the property of hereditary positive definiteness for all updates in this family, e.g. SR. We may therefore conclude that hereditary positive definiteness is not a necessary property of an update scheme in order for it to produce search directions that satisfy Proposition 3.. Also, as was mentioned above, positive definiteness of B is not a necessary property to guarantee that the updates are well-defined on (QP). 3.2 Updates matrices defined by Proposition 3.3 We now turn to Proposition 3.3 to determine if the seemingly weaer conditions (i) u () K + (p 0, H) for all =,..., m, (ii) U p i = 0, i 2, (iii) U p = (H B )p, yield a larger class of updates than the conditions of Proposition 3.4. It turns out that this is not the case, and in the following theorem we show that Proposition 3.3 also defines the one-parameter Broyden family of updates. Theorem 3.0. Assume g 0. A matrix U satisfies (i)-(ii) of Proposition 3.3 if and only if U can be expressed according to (), the one-parameter Broyden family, for some φ. Proof. As span{g 0,..., g } = K + (p 0, H), a general form for U that satisfies condition (i) of Proposition 3.3 would be U = GMG T, where G = ( ) g 0 g R n + and M R + + is any symmetric matrix. Then conditions (ii) and (iii) may be written as (ii) GMG T p i = 0, i 2, (iii) GMG T p = (H B )p = ( α )g + α g. Let and ( α = P = ( p 0 p ) R n, 0 0 ( α ) α ) R +, ᾱ = ( 0 + α T ) R +,

16 6 On the connection between CG and QN then conditions (ii) and (iii) may be written as GMG T P = Gᾱ G(MG T P ᾱ) = 0. Since G is a matrix whose columns are linearly independent it holds that MG T P ᾱ = 0 MG T P = ᾱ. (26) Note that, since the matrix G T P has the form go T p 0 g0 T p g0 T p G T g T P = p 0 g T p g T p = g T p 0 g T p g T p go T p 0 g0 T p g0 T p 0 g T p g T p g T p 0 0 0, and since ᾱ has only two non-zero components, M in (26) has to be on the form M = , 0 0 m, m, 0 0 m, m, with m, = g T p ( ), α m, = m, = g T p α and m, = ϕ, a free parameter. Hence, one may write a general U that satisfies the conditions of Proposition 3.3 as U = ( ) ( g g g T p ( ) α ) ( ) g T p α g. (27) ϕ g T p α Recalling that span{g, g } = K (p 0, H) K + (p 0, H), it turns out that the conditions of Proposition 3.3 restrict the choice of {u } to the subspace K (p 0, H) K + (p 0, H). Hence, it follows that the updates defined by Proposition 3.3 satisfy Proposition 3.4, and the statement of the theorem follows from Theorem 3.6. In effect, the sufficient conditions on U of the two propositions 3.3 and 3.4 are equivalent and they define the one-parameter Broyden family of updates according to theorems 3.6 and 3.0. From these results we may draw the conclusion that, in the framewor where we use the sufficient condition (7) to guarantee conugacy of the search directions, the update schemes in QN that give the same sequence of iterates as CG is completely described by the one-parameter Broyden family. g

17 4 CONCLUSIONS 7 4 Conclusions According to our analysis, if one in a QN method chooses B to satisfy the conditions of Theorem 3.2 or equivalently U to satisfy the conditions of Proposition 3.3 or 3.4, then the generated search directions will satisfy the conditions of Proposition 3.. Hence, with exact line-search, the same iterates would be obtained as for CG on (QP). As is shown in Theorem 3.6 and 3.0 the one-parameter Broyden family of updates is defined by these results. In the derivation it becomes clear that it is the fact that one chooses the components {u () } of the update matrix U in the subspace K (p 0, H) K + (p 0, H), i.e. the two last added dimensions of the Krylov subspace, that guarantees that the generated p will be such that {p 0,..., p } satisfy the conditions of Proposition 3.. What is often mentioned as a feature of the one-parameter Broyden family, using information only from the current and previous iteration, is in fact the condition that guarantees this behavior. Also, the fact that the update matrices in () are of low ran is a consequence of satisfying this condition. This is what distinguishes the one-parameter Broyden family of updates from any B satisfying (7). We believe that this extended nowledge on the connection between QN using the one-parameter Broyden family and CG is valuable as it gives a deeper understanding of the impact of the choices made in different update schemes of QN. In our framewor, where the derivation of conugate directions are based on (7), the one-parameter Broyden family completely describes the updates which will render QN to produce identical iterates as CG on (QP). It seems as if it is the sufficient requirement to get conugate directions, B p i = Hp i, i, that limits the freedom when choosing B to the one-parameter Broyden family of updates as shown by the above results. Since this condition is only sufficient, one may still pose the question if there are other update schemes that yield the same sequence of iterates as CG on (QP). We believe that in order to understand this possible limitation it will be necessary to tae a closer loo at what really happens in CG. We are also able to mae the following remars regarding when the one-parameter Broyden family is not well-defined. The only time when these updates are not welldefined on (QP) is when the steplength is of unit length and in the same iteration the ran-one update is used. This means that one can employ the ran-one update scheme when solving (QP) and only switch to a scheme with some ran-two update 7 from the one-parameter Broyden family when the unit steplength is taen. Furthermore, it is clear from our results that the concept of hereditary positive definiteness, or ust hereditary definiteness for (0) is not relevant when solving (QP). However, on problems with general convex functions this would be a sufficient condition to mae sure that B does not become undefined for ran two. In fact, the necessary and sufficient condition that needs to be satisfied to avoid undefinedness 7 Or a ran-one update scheme defined for some ρ i in (7).

18 8 On the connection between CG and QN for U of ran two, is p T Bp = p T g 0. This condition is, as mentioned, identical to (3), the condition that guarantees the descent property in exact linesearch. On (QP), this condition will always be fulfilled according to Proposition 3.5 for any update scheme that generates search directions which satify the conditions of Proposition 3.. In this paper we have focused on quadratic programming. Besides being important in its own right, it is also a highly important as a subproblem when solving unconstrained optimization problems. For a survey on methods for unconstrained optimization see, e.g., [4]. A further motivation for this research is that the deeper understanding of what is important in the choice of U could be implemented in a limited-memory QN method. I.e., can one choose which columns to save based on some other criteria than ust picing the most recent ones? See, e.g., [3, Chapter 9], for an introduction to limited-memory QN methods.

19 REFERENCES 9 References [] Broyden, C. G. Quasi-Newton methods and their application to function minimisation. Math. Comp. 2 (967), [2] Bucley, A. G. A combined conugate-gradient quasi-newton minimization algorithm. Math. Programming 5, 2 (978), [3] Davidon, W. C. Variable metric method for minimization. SIAM J. Optim., (99), 7. [4] Dixon, L. C. W. Quasi-Newton algorithms generate identical points. Math. Programming 2 (972), [5] Fletcher, R., and Powell, M. J. D. A rapidly convergent descent method for minimization. Comput. J. 6 (963/964), [6] Fletcher, R., and Reeves, C. M. Function minimization by conugate gradients. Comput. J. 7 (964), [7] Gill, P. E., Murray, W., and Wright, M. H. Practical optimization. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], London, 98. [8] Gutnecht, M. H. A brief introduction to Krylov space methods for solving linear systems. In Frontiers of Computational Science (2007), Kaneda, Y and Kawamura, H and Sasai, M, Ed., Springer-Verlag Berlin, pp International Symposium on Frontiers of Computational Science, Nagoya Univ, Nagoya, Japan, Dec 2-3, [9] Hestenes, M. R., and Stiefel, E. Methods of conugate gradients for solving linear systems. J. Research Nat. Bur. Standards 49 (952), (953). [0] Huang, H. Y. Unified approach to quadratically convergent algorithms for function minimization. J. Optimization Theory Appl. 5 (970), [] Luenberger, D. G. Linear and nonlinear programming, second ed. Addison- Wesley Pub Co, Boston, MA, 984. [2] Nazareth, L. A relationship between the BFGS and conugate gradient algorithms and its implications for new algorithms. SIAM J. Numer. Anal. 6, 5 (979), [3] Nocedal, J., and Wright, S. J. Numerical optimization. Springer Series in Operations Research. Springer-Verlag, New Yor, 999. [4] Powell, M. J. D. Recent advances in unconstrained optimization. Math. Programming, (97), [5] Saad, Y. Iterative methods for sparse linear systems, second ed. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2003.

20 20 On the connection between CG and QN [6] Shewchu, J. R. An introduction to the conugate gradient method without the agonizing pain. Tech. rep., Pittsburgh, PA, USA, 994.

A general Krylov method for solving symmetric systems of linear equations

A general Krylov method for solving symmetric systems of linear equations Anders FORSGREN Tove ODLAND Technical Report TRITA-MAT-214-OS-1 Department of Mathematics KTH Royal Institute of Technology March