UNIVERSITY OF CALIFORNIA, SAN DIEGO

Size: px

Start display at page:

Download "UNIVERSITY OF CALIFORNIA, SAN DIEGO"

Eustace Shelton
5 years ago
Views:

1 UNIVERSITY OF CALIFORNIA, SAN DIEGO Reduced Hessian Quasi-Newton Methods for Optimization A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Mathematics by Michael Wallace Leonard Committee in charge: Professor Philip E. Gill, Chair Professor Randolph E. Bank Professor James R. Bunch Professor Scott B. Baden Professor Pao C. Chau 1995

3 The dissertation of Michael Wallace Leonard is approved, and it is acceptable in quality and form for publication on microfilm: Professor Philip E. Gill, Chair University of California, San Diego 1995 iii

4 This dissertation is dedicated to my mother and father. iv

5 Contents Signature Page iii Dedication iv Table of Contents vi List of Tables vii Preface viii Acknowledgements xiii Curriculum Vita xiv Abstract xv 1 Introduction to Unconstrained Optimization Newton s method Quasi-Newton methods Minimizing strictly convex quadratic functions Minimizing convex objective functions Computation of the search direction Notation Using Cholesky factors Using conjugate-direction matrices Transformed and reduced Hessians Reduced-Hessian Methods for Unconstrained Optimization Fenelon s reduced-hessian BFGS method The Gram-Schmidt process The BFGS update to R Z Reduced inverse Hessian methods An extension of Fenelon s method The effective approximate Hessian Lingering on a subspace Updating Z when p = p r Calculating s Z and y ɛ Z The form of R Z when using the BFGS update Updating R Z after the computation of p The Broyden update to R Z 41 v

6 2.5.6 A reduced-hessian algorithm with lingering Rescaling Reduced Hessians Self-scaling variable metric methods Rescaling conjugate-direction matrices Definition of p Rescaling V The conjugate-direction rescaling algorithm Convergence properties Extending Algorithm RH Reinitializing the approximate curvature Numerical results Rescaling combined with lingering Numerical results Algorithm RHRL applied to a quadratic Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling A search-direction basis for range(v 1 ) A transformed Hessian associated with B How rescaling V affects Ū T BŪ The proof of equivalence Reduced-Hessian Methods for Large-Scale Unconstrained Optimization Large-scale quasi-newton methods Extending Algorithm RH to large problems Imposing a storage limit The deletion procedure The computation of T The updates to ḡ Z and R Z Gradient-based reduced-hessian algorithms Quadratic termination Replacing g with p Numerical results Algorithm RHR-L-P applied to quadratics Reduced-Hessian Methods for Linearly-Constrained Problems Linearly constrained optimization A dynamic null-space method for LEP Numerical results Bibliography 125 vi

7 List of Tables 2.1 Alternate methods for computing Z Alternate values for σ Test Problems from Moré et al Results for Algorithm RHR using R1, R4 and R Results for Algorithm RHRL on problems Results for Algorithm RHRL on problems Comparing p from CG and Algorithm RH-L-G on quadratics Iterations/Functions for RHR-L-G (m = 5) Iterations/Functions for RHR-L-P (m = 5) Results for RHR-L-P using R3 R5 (m = 5) on Set # Results for RHR-L-P using R3 R5 (m = 5) on Set # RHR-L-P using different m with R RHR-L-P (R4) for m ranging from 2 to n Results for RHR-L-P and L-BFGS-B (m = 5) on Set # Results for RHR-L-P and L-BFGS-B (m = 5) on Set # Results for LEPs (m L = 5, δ = 10 10, N T g 10 6 ) Results for LEPs (m L = 8, δ = 10 10, N T g 10 6 ) vii

8 Preface This thesis consists of seven chapters and a bibliography. Each chapter starts with a review of the literature and proceeds to new material developed by the author under the direction of the Chair of the dissertation committee. All lemmas, theorems, corollaries and algorithms are those of the author unless otherwise stated. Problems from all areas of science and engineering can be posed as optimization problems. An optimization problem involves a set of independent variables, and often includes constraints or restrictions that define acceptable values of the variables. The solution of an optimization problem is a set of allowed values of the variables for which some objective function achieves its maximum or minimum value. The class of model-based methods form quadratic approximations of optimization problems using first and sometimes second derivatives of the objective and constraint functions. If no constraints are present, an optimization problem is said to be unconstrained. The formulation of effective methods for the unconstrained case is the first step towards defining methods for constrained optimization. The unconstrained optimization problem is considered in Chapters 1 5. Methods for problems with linear equality constraints are considered in Chapter 6. Chapter 1 opens with a discussion of Newton s method for unconstrained optimization. Newton s method is a model-based method that requires both first and second derivatives. In Section 1.2 we move on to quasi-newton methods, which are intended for the situation when the provision of analytic second derivatives is inconvenient or impossible. Quasi-Newton methods use only first derivatives to build up an approximate Hessian over a number of iterations. At viii

9 each iteration of a quasi-newton method, the approximate Hessian is altered to incorporate new curvature information. This process, which is known as an update, involves the addition of a low-rank matrix (usually of rank one or rank two). This thesis will be concerned with a class of rank-two updates known as the Broyden class. The most important member of this class is the so-called Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula. In Chapter 2 we consider quasi-newton methods from a completely different point of view. Quasi-Newton methods that employ updates from the Broyden class are known to accumulate approximate curvature in a sequence of expanding subspaces. It follows that the search direction can be defined using matrices of smaller dimension than the approximate Hessian. In exact arithmetic these so-called reduced Hessians generate the same iterates as the standard quasi- Newton methods. This result is the basis for all of the new algorithms defined in this thesis. Reduced-Hessian and reduced inverse Hessian methods are considered in Sections 2.1 and 2.2 respectively. In Section 2.3 we propose Algorithm RH, which is the template algorithm for this thesis. In Section 2.5 this algorithm is generalized to include a lingering scheme (Algorithm RHL) that allows the iterates to be restricted to certain low dimensional manifolds. In practice, the choice of initial approximate Hessian can greatly influence the performance of quasi-newton methods. In the absence of exact secondderivative information, the approximate Hessian is often initialized to the identity matrix. Several authors have observed that a poor choice of initial approximate Hessian can lead to inefficiences especially if the Hessian itself is ill-conditioned. These inefficiences can lead to a large number of function evaluations in some cases. ix

10 Rescaling techniques are intended to address this difficulty and are the subject of Chapter 3. The rescaling methods of Oren and Luenberger [39], Siegel [45] and Lalee and Nocedal [27] are discussed. In particular, the conjugatedirection rescaling method of Siegel (Algorithm CDR), which is also a variant of the BFGS method, is described in some detail. Algorithm CDR (page 48) has been shown to be effective in solving ill-conditioned problems. Algorithm CDR has notable similarities to reduced-hessian methods, and two new rescaling algorithms follow naturally from the interpretation of Algorithm CDR as a reduced Hessian method. These algorithms are derived in Sections 3.3 and 3.4. The first (Algorithm RHR) is a modification of Algorithm RH; the second (Algorithm RHRL) is derived from Algorithm RHL. Numerical results are given for both algorithms. Moreover, under certain conditions Algorithm RHRL is shown to converge in a finite number of iterations when applied to a class of quadratic problems. This property, often termed quadratic termination, can be numerically beneficial for quasi-newton methods. In Chapter 4, it is shown that if Algorithm RHRL is used in conjunction with a particular rescaling technique of Siegel [45], then it is equivalent to Algorithm CDR in exact arithmetic. Chapter 4 is mostly technical in nature and may be skipped without loss of continuity. However, the convergence results given in Section 4.4 should be reviewed before passing to Chapter 5. If the problem has many independent variables, it may not be practical to store the Hessian matrix or an approximate Hessian. In Chapter 5, methods for solving large unconstrained problems are reviewed. Conjugate-gradient (CG) methods require storage for only a few vectors and can be used in the large-scale case. However, CG methods can require a large number of iterax

11 tions relative to the problem size and can be prohibitively expensive in terms of function evaluations. In an effort to accelerate CG methods, several authors have proposed limited-memory and reduced-hessian quasi-newton methods. The limited-memory algorithm of Nocedal [35], the successive affine reduction method of Nazareth [34], the reduced-hessian method of Fenelon [14] and reduced inverse- Hessian methods due to Siegel [46] are reviewed. In Chapter 5, new reduced-hessian rescaling algorithms are derived as extensions of Algorithms RH and RHR. These algorithms (Algorithms RHR-L-G and RHR-L-P) employ the rescaling method of Algorithm RHR. Algorithm RHR- L-P shares features of the methods of Fenelon, Nazareth and Siegel. However, the inclusion of rescaling is demonstrated numerically to be essential for efficiency. Moreover, Algorithm RHR-L-P is shown to enjoy the property of quadratic termination, which is shown to be beneficial when the algorithm is applied to general functions. Chapter 6 considers the minimization of a function subject to linear equality constraints. Two algorithms (Algorithms RH-LEP and RHR-LEP) extend reduced-hessian methods to problems with linear constraints. Numerical results are given comparing Algorithm RHR-LEP with a standard method for solving linearly constrained problems. In summary, a total of seven new reduced-hessian algorithms are proposed. Algorithm RH (p. 28) The algorithm template. Algorithm RHL (p. 41) Uses a lingering scheme that constrains the iterates to remain on a manifold. Algorithm RHR (p. 52) Rescales when approximate curvature is obtained xi

12 in a new subspace. Algorithm RHRL (p. 56) Exploits the special form of the reduced Hessian resulting from the lingering strategy. This special form allows rescaling on larger subspaces. Algorithm RHR-L-G (p. 95) A gradient-based method with rescaling for large-scale optimization. Algorithm RHR-L-P (p. 95) A direction-based method with rescaling for large-scale optimization. This algorithm converges in a finite number of iterations when applied to a quadratic function. Algorithm RHR-LEP (p. 123) A reduced-hessian rescaling method for linear equality-constrained problems. xii

13 Acknowledgements I am pleased to acknowledge my advisor, Professor Philip E. Gill. I became interested in doing research while I was a student in the Master of Arts program, but writing a dissertation seemed an unlikely task. However, Professor Gill thought that I had the right stuff. He has helped me hurdle many obstacles, not the least of which was transferring into the Ph.D. program. He introduced me to a very interesting and rewarding problem in numerical optimization. He also supported me as a Research Assistant for several summers and during my last quarter as a graduate student. I would like to express my gratitude to Professors James R. Bunch, Randolph E. Bank, Scott B. Baden and Pao C. Chao, all of whom served on my thesis committee. My thanks also to Professors Maria E. Ong and Donald R. Smith from whom I learned much in my capacity as a teaching assistant. My special thanks to Professor Carl H. Fitzgerald. His training inspired in me a much deeper appreciation of mathematics and is the basis of my technical knowledge. My family has always prompted me towards further education. I want to thank my mother and father, my stepmother Maggie and my brother Clif for their encouragement and support while I have been a graduate student. I also want to express my appreciation to all of my friends who have been supportive while I worked on this thesis. My climbing friends Scott Marshall, Michael Smith, Fred Weening and Jeff Gee listened to my ranting and raving and always encouraged me. My friends in the department, Jerome Braunstein, Scott Crass, Sam Eldersveld, Ricardo Fierro, Richard LeBorne, Ned Lucia, Joe Shinnerl, Mark Stankus, Tuan Nguyen and others were all inspirational, informative and helpful. xiii

14 Vita 1982 Appointed U.C. Regents Scholar. University of California, Santa Barbara 1985 B.S., Mathematical Sciences, Highest Honors. University of California, Santa Barbara 1985 B.S., Mechanical Engineering, Highest Honors. University of California, Santa Barbara Associate Engineering Scientist. McDonnell-Douglas Astronautics Corporation High School Mathematics Teacher. Vista Unified School District 1988 Mathematics Single Subject Teaching Credential. University of California, San Diego 1991 M.A., Applied Mathematics. University of California, San Diego Adjunct Mathematics Instructor. Mesa Community College Teaching Assistant. Department of Mathematics, University of California, San Diego 1993 C.Phil., Mathematics. University of California, San Diego 1995 Research Assistant. Department of Mathematics, University of California, San Diego 1995 Ph.D., Mathematics. University of California, San Diego Major Fields of Study Major Field: Mathematics Studies in Numerical Optimization. Professor Philip E. Gill Studies in Numerical Analysis. Professors Randolph E. Bank, James R. Bunch, Philip E. Gill and Donald R. Smith Studies in Complex Analysis. Professor Carl H. Fitzgerald Studies in Applied Algebra. Professors Jeffrey B. Remmel and Adriano M. Garsia xiv

15 Abstract of the Dissertation Reduced Hessian Quasi-Newton Methods for Optimization by Michael Wallace Leonard Doctor of Philosophy in Mathematics University of California, San Diego, 1995 Professor Philip E. Gill, Chair Many methods for optimization are variants of Newton s method, which requires the specification of the Hessian matrix of second derivatives. Quasi- Newton methods are intended for the situation where the Hessian is expensive or difficult to calculate. Quasi-Newton methods use only first derivatives to build an approximate Hessian over a number of iterations. This approximation is updated each iteration by a matrix of low rank. This thesis is concerned with the Broyden class of updates, with emphasis on the Broyden-Fletcher-Goldfarb- Shanno (BFGS) update. Updates from the Broyden class accumulate approximate curvature in a sequence of expanding subspaces. This allows the approximate Hessians to be represented in compact form using smaller reduced approximate Hessians. These reduced matrices offer computational advantages when the objective function is highly nonlinear or the number of variables is large. Although the initial approximate Hessian is arbitrary, some choices may cause quasi-newton methods to fail on highly nonlinear functions. In this case, rescaling can be used to decrease inefficiencies resulting from a poor initial approximate Hessian. Reduced-Hessian methods facilitate a trivial rescaling that implicitly changes the initial curvature as iterations proceed. Methods of this type are shown to have global and superlinear convergence. Moreover, numerical xv

16 results indicate that this rescaling is effective in practice. In the large-scale case, so-called limited-storage reduced-hessian methods offer advantages over conjugate-gradient methods, with only slightly increased memory requirements. We propose two limited-storage methods that utilize rescaling, one of which can be shown to terminate on quadratics. Numerical results suggest that the method is effective compared with other state-of-the-art limited-storage methods. Finally, we extend reduced-hessian methods to problems with linear equality constraints. These methods are the first step towards reduced-hessian methods for the important class of nonlinearly constrained problems. xvi

17 Chapter 1 Introduction to Unconstrained Optimization Problems from all areas of science and engineering can be posed as optimization problems. An optimization problem involves a set of independent variables, and often includes constraints or restrictions that define acceptable values of the variables. The solution of an optimization problem is a set of allowed values of the variables for which some objective function achieves its maximum or minimum value. The class of model-based methods form quadratic approximations of optimization problems using first and sometimes second derivatives of the objective and constraint functions. Consider the unconstrained optimization problem minimize f(x), (1.1) x IR n where f : IR n IR is twice-continuously differentiable. Since maximizing f can be achieved by minimizing f, it suffices to consider only minimization. When no constraints are present, the problem of minimizing f is often called unconstrained optimization. When linear constraints are present, the minimization problem is called linearly-constrained optimization. The unconstrained opti- 1

18 Introduction to Unconstrained Optimization 2 mization problem is introduced in the next section. Linearly constrained optimization is introduced in Chapter 6. Nonlinearly constrained optimization is not considered. However, much of the work given here applies to solving subproblems that might arise in the course of solving nonlinearly constrained problems. 1.1 Newton s method A local minimizer x of (1.1) satisfies f(x ) f(x) for all x in some open neighborhood of x. The necessary optimality conditions at x are f(x ) = 0 and 2 f(x ) 0, where 2 f(x ) 0 means that the Hessian of f at x is positive semi-definite. Sufficient conditions for a point x to be a local minizer are f(x ) = 0 and 2 f(x ) > 0, where 2 f(x ) > 0 means that the Hessian of f at x is positive definite. Since f(x ) = 0, many methods for solving the (1.1) attempt to drive the gradient to zero. The methods considered here are iterative and generate search directions by minimizing quadratic approximations to f. In what follows, let x k denote the kth iterate and p k the kth search direction. Newton s method for solving (1.1) minimizes a quadratic model of f each iteration. The function qk N (x) given by q N k (x) = f(x k ) + f(x k ) T (x x k ) (x x k) T 2 f(x k )(x x k ), (1.2) is a second-order Taylor-series approximation to f at the point x k. If 2 f(x k ) > 0, then q N k (x) has a unique minimizer, corresponding to the point at which q N k (x)

19 Introduction to Unconstrained Optimization 3 vanishes. This point is taken as the new estimate x k+1 of x. If the substitution p = x x k is made in (1.2) then the resulting quadratic model q N k (p) = f(x k ) + f(x k ) T p pt 2 f(x k )p (1.3) can be minimized with respect to p for a search direction p k. If 2 f(x k ) > 0, then the vector p k such that q N k (p k ) = 2 f(x k )p k + f(x k ) = 0 minimizes q N k (p). The new iterate is defined as x k+1 = x k + p k. This leads to the definition of Newton s method given below. Algorithm 1.1. Newton s method Initialize k = 0 and choose x 0. while not converged do Solve 2 f(x k )p = f(x k ) for p k. x k+1 = x k + p k. k k + 1 end do We now summarize the convergence properties of Newton s method. It is important to note that the method seeks points at which the gradient vanishes and has no particular affinity for minimizers. In the following theorem we will let x denote a point such that f( x) = 0. Theorem 1.1 Let f : IR n IR be a twice-continuously differentiable mapping defined in an open set D, and assume that f( x) = 0 for some x D and that 2 f( x) is nonsingular. Then there is an open set S such that for any x 0 S the Newton iterates are well defined, remain in S, and converge to x. Proof. See Moré and Sorenson [30, pp ].

20 Introduction to Unconstrained Optimization 4 The rate or order of convergence of a sequence of iterates is as important as its convergence. If a sequence {x k } converges to x and x k+1 x C x k x p (1.4) for some positive constant C, then {x k } is said to converge with order p. The special cases of p = 1 and p = 2 correspond to linear and quadratic convergence respectively. In the case of linear convergence, the constant C must satisfy C (0, 1). Note that if C is close to 1, linear convergence can be unsatisfactory. For example, if C =.9 and x k x =.1, then roughly 21 iterations may be required to attain x k x =.01. A sequence {x k } that converges to x and satisfies x k+1 x β k x k x, for some sequence {β k } that converges to zero, is said to converge superlinearly. Note that a sequence that converges superlinearly also converges linearly. Moreover, a sequence that converges quadratically converges superlinearly. In this sense, superlinear convergence can be considered a middle ground between linear and quadratic convergence. We now state order of convergence results for Newton s method (for proofs of these results, see Moré and Sorenson [30]). If f satisfies the conditions of Theorem 1.1, the iterates converge to x superlinearly. Moreover, if the Hessian is Lipschitz continuous at x, i.e., 2 f(x) 2 f( x) κ x x (κ > 0), (1.5) then {x k } converges quadratically. These asymptotic rates of convergence of Newton s method are the benchmark for all other methods that use only first

21 Introduction to Unconstrained Optimization 5 and second derivatives of f. Note that since x satisfies f(x ) = 0, these results hold also for minimizers. If x 0 is far from x, Newton s method can have several deficiencies. Consider first when 2 f(x k ) is positive definite. In this case, p k is a descent direction satisfying f(x k ) T p k < 0. However, since the quadratic model q N k is only a local approximation of f, it is possible that f(x k + p k ) > f(x k ). This problem is alleviated by redefining x k+1 = x k + α k p k, where α k is a positive step length. If p T k f(x k ) < 0, then the existence of ᾱ > 0 such that α k (0, ᾱ) implies f(x k+1 ) < f(x k ) is guaranteed (see Fletcher [15]). The specific value of α k is computed using a line search algorithm that approximately minimizes the univariate function f(x k + αp k ). As a result of the line search, the iterates satisfy f(x k+1 ) < f(x k ) for all k, which is the defining property associated with all descent methods. This thesis is concerned mainly with descent methods that use a line search. Another problem with Algorithm 1.1 arises when 2 f(x k ) is indefinite or singular. In this case, p k may be undefined, non-uniquely defined, or a nondescent direction. This drawback has been successfully overcome by both modified Newton methods and trust-region methods. Modified Newton methods replace 2 f(x k ) with a positive-definite approximation whenever the former is indefinite or singular (see Gill et al. [22] for details). Trust-region methods minimize the quadratic model (1.3) in some small region surrounding x k (see Moré and Sorenson [13, pp ] for further details). Any Newton method requires the definition of O(n 2 ) second-derivatives associated with the Hessian. In some cases, for example when f is the solution to a differential or integral equation, it may be inconvenient or expensive to define

22 Introduction to Unconstrained Optimization 6 the Hessian. In the next section, quasi-newton methods are introduced that solve the unconstrained problem (1.1) using only gradient information. 1.2 Quasi-Newton methods The idea of approximating the Hessian with a symmetric positive-definite matrix was first introduced in Davidon s 1959 paper, Variable metric methods for minimization [9]. If B k denotes an approximate Hessian, then the quadratic model q N k is replaced by q k (x) = f(x k ) + f(x k ) T (x x k ) (x x k) T B k (x x k ). (1.6) In this case, p k is the solution of the subproblem minimize p IR n f(x k ) + f(x k ) T p pt B k p. (1.7) Since B k is positive definite, p k satisfies B k p k = f(x k ) (1.8) and p k is guaranteed to be a descent direction. Approximate second-derivative information obtained in moving from x k to x k+1 is incorporated into B k+1 using an update to B k. Hence, a general quasi-newton method takes the form given in Algorithm 1.2 below. Algorithm 1.2. Quasi-Newton method Initialize k = 0; Choose x 0 and B 0 ; while not converged do Solve B k p k = f(x k ); Compute α k, and set x k+1 = x k + α k p k ;

23 Introduction to Unconstrained Optimization 7 Compute B k+1 by applying an update to B k ; k k + 1; end do It remains to discuss the form of the update to B k and the choice of α k. Define s k = x k+1 x k, g k = f(x k ) and y k = g k+1 g k. The definition of x k+1 implies that s k satisfies s k = α k p k. (1.9) This relationship will used throughout this thesis. The curvature of f along s k at a point x k is defined as s T k 2 f(x k )s k. The gradient of f can be expanded about x k to give ( 1 g k+1 = f(x k + s k ) = g k + 0 ) 2 f(x k + ξs k )dξ s k. It follows from the definition of y k that s T k 2 f(x k )s k s T k y k. (1.10) The quantity s T k y k is called the approximate curvature of f at x k along s k. Next, we present a class of low-rank changes to B k that ensure s T k B k+1 s k = s T k y k, (1.11) so that B k+1 incorporates the correct approximate curvature. The well-known Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula defined by B k+1 = B k B ks k s T k B k s T k B ks k + y ky T k s T k y k (1.12) is easily shown to satisfy (1.11). An implementation of Algorithm 1.2 using the BFGS update will be called a BFGS method.

24 Introduction to Unconstrained Optimization 8 The Davidon-Fletcher-Powell (DFP) formula is defined by ( ) B k+1 = B k 1 + st kb k s k yk yk T s T k y k s T k y y ks T kb k + B k s k yk T k s T k y. (1.13) k An implementation of Algorithm 1.2 using the DFP update will be called a DFP method. The approximate Hessians of the so-called Broyden class are defined by the formulae where B k+1 = B k B ks k s T kb k s T k B ks k + y ky T k s T k y k + φ k (s T kb k s k )w k w T k, (1.14) w k = y k s T k y k B ks k s T k B ks k, and φ k is a scalar parameter. Note that the BFGS and DFP formulae correspond to the choices φ k = 0 and φ k = 1. The convex class of updates is a subclass of the Broyden updates for which φ k [0, 1] for all k. The updates from convex class satisfy (1.11) since they are all elements of the Broyden class. Several results follow immediately from the definition of the updates in the Broyden class. First, formulae in the Broyden class apply at most ranktwo updates to B k. Second, updates in the Broyden class are such that B k+1 is symmetric as long as B k is symmetric. Third, if B k is positive definite and φ k is properly chosen (e.g., any φ k 0 is acceptable (see Fletcher [16])), then B k+1 is positive definite if and only if s T ky k > 0. In unconstrained optimization, the value of α k can ensure that s T ky k > 0. In particular, s T ky k is positive if α k satisfies the Wolfe [48] conditions f(x k + α k p k ) f(x k ) + να k g T kp k and g T k+1p k ηg T kp k, (1.15)

25 Introduction to Unconstrained Optimization 9 where 0 < ν < 1/2 and ν η < 1. The existence of such α k is guaranteed if, for example, f is bounded below. In a practical line search, it is often convenient to require α k to satisfy the modified Wolfe conditions f(x k + α k p k ) f(x k ) + να k g T kp k and g T k+1p k η g T kp k. (1.16) The existence of an α k satisfying these conditions can also be guaranteed theoretically. (See Fletcher [15, pp ] for the existence results and further details.) For theoretical discussion, α k is sometimes considered to be an exact minimizer of the univariate function Ψ(α) defined by Ψ(α) = f(x k + αp k ). This choice ensures a positive-definite update since, for such an α k, gk+1p T k = 0, which implies s T k y k > 0. Properties of Algorithm 1.2 when it is applied to a convex quadratic objective function using such an exact line search are given in the next section Minimizing strictly convex quadratic functions Consider the quadratic function q(x) = d + c T x xt Hx, where c IR, d IR n, H IR n n, (1.17) and H is symmetric positive definite and independent of x. This quadratic has a unique minimizer x that satisfies Hx = c. If Algorithm 1.2 is used with an exact line search and an update from the Broyden class, then the following properties hold at the kth (0 < k n) iteration: B k s i = Hs i, (1.18) s T i Hs k = 0, and (1.19) s T i g k = 0, (1.20)

26 Introduction to Unconstrained Optimization 10 for all i < k. Multiplying (1.18) by s T i gives s T i Bs i = s T i Hs i, which implies that the curvature of the quadratic model (1.6) along s i (i < k) is exact. Define S k = ( s 0 s 1 s k 1 ) and assume that s i 0 (0 i n 1). Under this assumption, note that (1.19) implies that the set {s i i n 1} is linearly independent. At the start of the nth iteration, (1.18) implies that B n S n = HS n, and B n = H since S n is nonsingular. It can be shown that x k minimizes q(x) on the manifold defined by x 0 and range(s k ) (see Fletcher [15, pp ]). It follows that x n minimizes q(x). This implies that Algorithm 1.2 with an exact line search finds the minimizer of the quadratic (1.17) in at most n steps, a property often referred to as quadratic termination. Further properties of Algorithm 1.2 follow from its well-known equivalence to the conjugate-gradient method when used to minimize convex quadratic functions using an exact line search. If B 0 = I and the updates are from the Broyden class, then for all k 1 and 0 i < k, g T i g k = 0 and (1.21) p k = g k + β k 1 p k 1, (1.22) where β k 1 = g k 2 / g k 1 2 (see Fletcher [15, p. 65] for further details) Minimizing convex objective functions Much of the convergence theory for quasi-newton methods involves convex functions. The theory focuses on two properties of the sequence of iterates. First, given an arbitrary starting point x 0, will the sequence of iterates converge to x? If so, then the method is said to be globally convergent. Second, what is the order of convergence of the sequence of iterates? In the next two sections, we present

27 Introduction to Unconstrained Optimization 11 some of the results from the literature regarding the convergence properties of quasi-newton methods. Global convergence of quasi-newton methods Consider the application of Algorithm 1.2 to a convex function. Powell has shown that in this case, the BFGS method with a Wolfe line search is globally convergent with lim inf g k = 0 (see Powell [40]). Byrd, Nocedal and Yuan have extended Powell s result to a quasi-newton method using any update from the convex class except the DFP update (see Byrd et al. [6]). Uniformly convex functions are an important subclass of the set of convex functions. The Hessian of these functions satisfy m z 2 z T 2 f(x)z M z 2, (1.23) for all x and z in IR n. It follows that a function in this class has a unique minimizer x. Although the DFP method is on the boundary of the convex class, it has not been shown to be globally convergent, even on uniformly convex functions (see Nocedal [36]). Order of convergence of quasi-newton methods The order of convergence of a sequence has been defined in Section 1.1. The method of steepest descent, which sets p k = g k for all k, is known to converge linearly from any starting point (see, for example, Gill et al. [22, p. 103]). This poor rate of convergence occurs because steepest descent uses no secondderivative information (the method implicitly chooses B k = I for all k). On the other hand, Newton s method can be shown to converge quadratically for x 0 sufficiently close to x if 2 f(x) is nonsingular and satisfies the Lipschitz condition (1.5) at x. Since quasi-newton methods use an approximation to the Hessian,

28 Introduction to Unconstrained Optimization 12 they might be expected to converge at a rate between linear and quadratic. This is indeed the case. The following order of convergence results apply to the general quasi- Newton method given in Algorithm 1.2. It has been shown that {x k } converges superlinearly to x if and only if (B k 2 f(x ))s k lim k s k = 0 (1.24) (see Dennis and Moré [11]). Hence, the approximate curvature must converge to the curvature in f along the unit directions s k / s k. In a quasi-newton method using a Wolfe line search, it has been shown that if the search direction approaches the Newton direction asymptotically, the step length α k = 1 is acceptable for large enough k (see Dennis and Moré [12]). Suppose now that a quasi-newton method using updates from the convex class converges to a point x such that 2 f(x ) is nonsingular. In this case, if f is convex, Powell has shown that the BFGS method with a Wolfe line search converges superlinearly as long as the unit step length is taken whenever possible (see [40]). This result has been extended to every member of the convex class of Broyden updates except the DFP update (see Byrd et al. [6]). The DFP method has not been shown to be superlinearly convergent when using a Wolfe line search. However, there are convergence results concerning the application of the DFP method using an exact line search (see Nocedal [36] for further discussion). In Section 1.2.1, it was noted that if Algorithm 1.2 with exact line search is applied to a strictly convex quadratic function, and the steps s k (0 k n 1) are nonzero, then B n = H. When applied to general functions, it should be noted that B k need not converge to 2 f(x ) even when {x k } converges to x (see Dennis

29 Introduction to Unconstrained Optimization 13 and Moré [11]). The global and superlinear convergence of Algorithm 1.2 when applied to general f using a Wolfe line search remains an open question. 1.3 Computation of the search direction Various methods for solving the system B k p k = g k in a practical implementation of Algorithm 1.2 are discussed in this section Notation For simplicity, the subscript k is suppressed in much of what follows. Bars, tildes and cups are used to define updated quantities obtained during the kth iteration. Underlines are sometimes used to denote quantities associated with x k 1. The use of the subscript will be retained in the definition of sets that contain a sequence of quantities belonging to different iterations, e.g., {g 0, g 1,..., g k }. Also, for clarity, the use of subscripts will be retained in the statement of results. Throughout the thesis, I j denotes the j j identity matrix, where j satisfies 1 j < n. The matrix I is reserved for the n n identity matrix. The vector e i denotes the ith column of an identity matrix whose order depends on the context. If u IR n and v IR m, then (u, v) T denotes the column vector of order n + m whose components are the components of u and v Using Cholesky factors The equations Bp = g can be solved if an upper-triangular matrix R is known such that B = R T R. If B is obtained from B using a Broyden update, then an upper-triangular matrix R satisfying B = R T R can be obtained from a rank-one

30 Introduction to Unconstrained Optimization 14 update to R (see Goldfarb [24], Dennis and Schnabel [10]). In particular, the BFGS update can be written as R = S(R + u(w R T u) T ) where u = Rs Rs, w = y, (1.25) (y T s) 1/2 and S is an orthogonal matrix that transforms R + u(w R T u) T to uppertriangular form. Since many choices of S yield an upper-triangular R, we now describe the particular choice used throughout the paper. The matrix S is of the form S = S 2 S 1, where S 1 and S 2 are products of Givens matrices. The matrix S 1 is defined by S 1 = P n,1 P n,n 2 P n,n 1, where P nj (1 j n 1) is a Givens matrix in the (j, n) plane designed to annihilate the jth element of P n,j+1 P n,n 1 u. The product S 1 R is upper triangular except for the presence of a row spike in the nth row. Since S 1 u = ±e n, the matrix S 1 (R + u(w R T u) T ) is also upper triangular except for a row-spike in the nth row. This matrix is restored to upper-triangular form using a second product of Givens matrices. In particular, S 2 = P n 1,n P n 2,n P 1n, where P in (1 i n 1) is a Givens matrix in the (i, n) plane defined to annihilate the (n, i) element of P i 1,n P 1n S 1 (R+u(w R T u) T ). For simplicity, the BFGS update (1.25) and the Broyden update to R will be written R = BFGS(R, s, y) and R = Broyden(R, s, y). (1.26) The form of S will be as described in the last paragraph. Another choice of S that implies S 1 (R + u(w R T u) T ) is upper Hessenberg is described by Gill, Golub, Murray and Saunders [17]. Goldfarb prefers to write the update as a product of R and a rank-one modification of the identity. This form of the update is also easily restored to upper-triangular form (see Goldfarb [24]).

31 Introduction to Unconstrained Optimization 15 Some authors reserve the term Cholesky factor of a positive definite matrix B to mean the triangular factor with positive diagonals satisfying B = R T R. However, throughout this thesis, the diagonal components of R are not restricted in sign, but R will be called the Cholesky factor of B Using conjugate-direction matrices Since B is symmetric positive definite, there exists a nonsingular matrix V such that V T BV = I. The columns of V are said to be conjugate with respect to B. In terms of V, the approximate Hessian satisfies B 1 = V V T, (1.27) which implies that the solution of (1.7) may be written as p = V V T g. (1.28) If B is defined by the BFGS formula (1.12), then a formula for V satisfying V T B V = I can be obtained from the product form of the BFGS update (see Brodlie, Gourlay, and Greenstadt [3]). The formula is given by V = (I su T )V Ω, where u = Bs (s T y) 1/2 (s T Bs) + y 1/2 s T y (1.29) and Ω is an orthogonal matrix. Powell has proposed that Ω be defined as follows. Let Ṽ denote the product V Ω. The matrix Ω is chosen as a lower-hessenberg matrix such that the first column of Ṽ is parallel to s (see Powell [42]). Let g V be defined as g V = V T g, (1.30) and define Ω such that Ω T = P 12 P 23 P n 1,n, where P i,i+1 is a rotation in the (i, i+1) plane chosen to annihilate the (i+1)th component of P i+1,i+2 P n 1,n g V.

32 Introduction to Unconstrained Optimization 16 Then, Ω is an orthogonal lower-hessenberg matrix such that Ω T g V = g V e 1. Furthermore, (1.28) and the relation s = αp give Hence, the first column of Ṽ is parallel to s. Ṽ e 1 = 1 s. (1.31) α g V With this choice of Ω, Powell shows that the columns of V satisfy v i = s, if i = 1; (s T y) 1/2 (1.32) ṽ i ṽt i y s T y s, otherwise. Note that the matrix B in the update (1.29) has been eliminated in the formulae (1.32). Formulae have also been derived for matrices V that satisfy V T B V = I, where B is any Broyden update to B (see Siegel [47]). 1.4 Transformed and reduced Hessians Let Q denote an n n orthogonal matrix and let B denote a positive-definite approximation to 2 f(x). The matrix Q T BQ is called the transformed approximate Hessian. If Q is partitioned as Q = ( Z W ), the transformed Hessian has a corresponding partition Q T BQ = ZT BZ W T BZ Z T BW W T BW. The positive-definite submatrices Z T BZ and W T BW are called reduced approximate Hessians. Transformed Hessians are often used in the solution of constrained optimization problems (see, for example, Gill et al. [21]). In the next chapter, a

33 Introduction to Unconstrained Optimization 17 particular choice of Q will be seen to give block-diagonal structure to the approximate Hessians associated with quasi-newton methods for unconstrained optimization. This simplification leads to another technique for solving Bp = g that involves a reduced Hessian. Reduced Hessian quasi-newton methods using this technique are the subject of Chapter 2.

34 Chapter 2 Reduced-Hessian Methods for Unconstrained Optimization In her dissertation, Fenelon [14] has shown that the BFGS method accumulates approximate curvature information in a sequence of expanding subspaces. This feature is used to show that the BFGS search direction can often be generated with matrices of smaller dimension than the approximate Hessian. Use of these reduced approximate Hessians leads to a variant of the BFGS method that can be used to solve problems whose Hessians may be too large to store. In this chapter, reduced Hessian methods are reviewed from Fenelon s point of view. A reduced inverse Hessian method, due to Siegel [46], is reviewed in Section 2.2. Fenelon s and Siegel s work is extended in Sections , giving new reduced-hessian methods that utilize the Broyden class of updates. 18

35 Reduced-Hessian Methods for Unconstrained Optimization Fenelon s reduced-hessian BFGS method Using the equations B i p i = g i and s i = α i p i for 0 i k, the BFGS updates from B 0 to B k can be telescoped to give B k = B 0 + k 1 i=0 ( gi g T i g T i p i + y iy T i s T i y i ). (2.1) If B 0 = σi (σ > 0), then (2.1) can be used to show that the solution of B k p k = g k is given by p k = 1 σ g k 1 σ k 1 i=0 ( ) g T i p k g gi T i + yt i p k y p i s T i. (2.2) i y i Hence, if G k denotes the set of vectors G k = {g 0, g 1,..., g k }, (2.3) then (2.2) implies that p k span(g k ). The following lemma summarizes this result. Lemma 2.1 (Fenelon) If the BFGS method is used to solve the unconstrained minimization problem (1.1) with B 0 = σi (σ > 0), then p k span(g k ) for all k. Using this result, Fenelon has shown that if Z k is a full-rank matrix such that range(z k ) = span(g k ), then p k = Z k p Z, where p Z = (Z T kb k Z k ) 1 Z T kg k. (2.4) This form of the search direction implies a reduced-hessian implementation of the BFGS method employing Z k and an upper-triangular matrix R Z such that R T ZR Z = Z T kb k Z k.

36 Reduced-Hessian Methods for Unconstrained Optimization The Gram-Schmidt process The matrix Z k is obtained from G k using the Gram-Schmidt process. This process gives an orthonormal basis for G k. The choice of orthonormal basis is motivated by the result cond(z T k B k Z k ) cond(b k ) if Z T k Z k = I rk (see Gill et al. [22, p. 162]). To simplify the description of this process we drop the subscript k, as discussed in Section At the start of the first iteration, Z is initialized to g 0 / g 0. During the kth iteration, assume that the columns of Z approximate an orthonormal basis for span(g). The matrix Z is defined so that range( Z) = span(g ḡ) as follows. The vector ḡ can be uniquely written as ḡ = ḡ R + ḡ N, where ḡ R range(z), ḡ N null(z T ). The vector ḡ R satisfies ḡ R = ZZ T ḡ, which implies that the component of ḡ orthogonal to range(z) satisfies ḡ N = ḡ ZZ T ḡ = (I ZZ T )ḡ. Let zḡ denote the normalized component of ḡ orthogonal to range(z). If we define ρḡ = ḡ N, then zḡ = ḡ N /ρḡ. Note that if ρḡ = 0, then ḡ range(z). In this case, we will define Z = Z. To summarize, if r denotes the column dimension of Z, we define r, if ρḡ = 0; r = r + 1, otherwise. Using r, zḡ and Z satisfy and zḡ = Z = 0, if r = r; 1 (I ZZ T )ḡ, otherwise, ρḡ Z, if r = r; ( Z zḡ ), otherwise. (2.5) (2.6) (2.7)

37 Reduced-Hessian Methods for Unconstrained Optimization 21 It is well-known that the Gram-Schmidt process is unstable in the presence of computer round-off error (see Golub and Van Loan [25, p. 218]). Several methods have been proposed to stabilize the process. These methods are given in Table 2.1. The advantages and disadvantages of each method are also given in the table. Note that a flop is defined as a multiplication and an addition. The flop counts given in the table are only approximations of the actual counts. The value of 3.2nr flops for the reorthogonalization process is an average that results if 3 reorthogonalizations are performed every 5 iterations. Table 2.1: Alternate methods for computing Z Method Advantage Disadvantage Gram-Schmidt Simple Unstable 2nr flops Modified More stable Z must be recomputed Gram-Schmidt than GS each iteration. Gram-Schmidt with Stable Expensive, e.g., reorthogonalization 3.2nr flops (Daniel et al. [76], Fenelon [81]) Implicitly nr + O(r 2 ) flops Expensive if (Siegel [92]) r is large Another technique for stabilizing the process suggested by Daniel et al. [8] (and used by Siegel [46]) is to ignore the component of ḡ orthogonal to range(z) if it is small (but possibly nonzero) relative to g. In this case, the definition of r satisfies r = r, if ρḡ ɛ ḡ ; r + 1, otherwise, (2.8) where ɛ 0 is a preassigned constant.

38 Reduced-Hessian Methods for Unconstrained Optimization 22 The matrix Z that results when this definition of r is used has properties that depend on the choice of ɛ. If ɛ = 0, then in exact arithmetic the columns of Z form an orthonormal basis for span(g). Moreover, for any ɛ (ɛ 0), the matrix Z forms an orthonormal basis for a subset of G. If K ɛ = {k 1, k 2,..., k r } denotes the set of indices for which ρ g > ɛ g and G ɛ = ( g k1 g k2 g kr ) is the matrix of corresponding gradients, then the columns of Z form an orthonormal basis for range(g ɛ ). Gradients satisfying ρ g > ɛ g are said to be accepted ; otherwise, they are said to be rejected. Hence, G ɛ is the matrix of accepted gradients associated with a particular choice of ɛ. Note that the dimension of Z is nondecreasing with k. During iteration k + 1, the vector ḡ Z (ḡ Z = Z T ḡ) is needed to compute the next search direction p. Since Z T ḡ, if r = r; ḡ Z = ZT ḡ, otherwise, ρḡ (2.9) this quantity is a by-product of the computation of Z. If r, ḡ Z and Z satisfy (2.8), (2.9) and (2.7), then we will write ( Z, ḡ Z, r) = GS(Z, ḡ, r, ɛ). (2.10) The BFGS update to R Z If Z, g Z and R Z are known during the kth iteration of a reduced-hessian method, then p is computed using (2.4). Following the calculation of x in the line search, ḡ is either rejected or added to the basis defined by Z. It remains to define a matrix R Z satisfying Z T B Z = RT Z R Z, where B is obtained from B using the BFGS update.

39 Reduced-Hessian Methods for Unconstrained Optimization 23 Let y Z denote the quantity Z T y. If ḡ is rejected, Fenelon employs the method of Gill et al. [17] to obtain R Z from R Z via two rank-one updates involving g Z and y Z. If ḡ is accepted, R Z can be partitioned as R Z = R Z Rḡ, where 0 φ φ is a scalar. The matrix R Z is obtained from R Z using g Z and y Z. The following lemma is used to define Rḡ and φ. Lemma 2.2 (Fenelon) If zḡ denotes the normalized component of g k+1 orthogonal to span(g k ), then Z T B k+1 zḡ = yt g s T y y Z and zḡt B k+1 zḡ = σ + (z ḡ T y) 2 s T k y. (2.11) k (Although the relation zḡt g = 0 is used in the proof of Lemma 2.2, it was not used to simplify (2.11).) The solution of an upper-triangular system involving R Z and (y T g/s T y)y Z is used to define Rḡ. The value φ is then obtained from Rḡ and zḡt Bzḡ. 2.2 Reduced inverse Hessian methods Many quasi-newton algorithms are defined in terms of the inverse approximate Hessian H k = Bk 1. The Broyden update to H k is H k+1 = M T k H k M k + s ks T k s T k y k M k = I s ky T k s T k y k ψ k (y T k H k y k )r k r T k, and r k = H ky k y T k H ky k The parameter φ k is related to ψ k by the equation s k s T k y. k where (2.12) φ k (ψ k 1)(y T k H k y k )(s T k B k s k ) = ψ k (φ k 1)(s T k y k ) 2.

40 Reduced-Hessian Methods for Unconstrained Optimization 24 Note that the values ψ k = 0 and ψ k = 1 correspond to the BFGS and the DFP updates respectively. Siegel [46] gives a more general result than Lemma 2.1 that applies to the entire Broyden class. The result is stated below without proof. Lemma 2.3 (Siegel) If Algorithm 1.2 is used to solve the unconstrained minimization problem (1.1) with B 0 = σi (σ > 0) and a Broyden update, then p k span(g k ) for all k. Moreover, if z span(g k ) and w span(g k ), then B k z span(g k ), H k z span(g k ), B k w = σw and H k w = σ 1 w. Let G k denote the matrix of the first k + 1 gradients. For simplicity, assume that these gradients are linearly independent and that k is less than n. Since G k has full column rank, it has a QR factorization of the form G k = Q k T k 0, where Q T k Q k = I and (2.13) T k is nonsingular and upper triangular. Define r k = dim(span(g k )), and partition Q k = ( Z k W k ), where Z k IR n r k. Note that the product Gk = Z k T k defines a skinny QR factorization of G k (see Golub and Van Loan [25, p. 217]). The columns of Z k form an orthonormal basis for range(g k ) and the columns of W k form an orthonormal basis for null(g T k ). If the first k+1 gradients are not linearly independent, Q k is defined as in (2.13), except that G 0 k is used in place of G k. Hence, the first r columns of Q k are still an orthonormal basis for G k. Consider the transformed inverse Hessian Q T k H k Q k. Lemma 2.3 implies that if H 0 = σ 1 I, then Q T k H k Q k is block diagonal and satisfies Q T k H k Q k = ZT k H k Z k 0 0 σ 1 I n rk. (2.14) As the equation for the search direction in terms of H k satisfies p k = H k g k, we have Q T k p k = (Q T k H k Q k )Q T k g k. It follows that p k = Z k (Z T k H k Z k )Z T k g k

Reduced-Hessian Methods for Constrained Optimization

Reduced-Hessian Methods for Constrained Optimization Philip E. Gill University of California, San Diego Joint work with: Michael Ferry & Elizabeth Wong 11th US & Mexico Workshop on Optimization and its