IMPLEMENTING GENERATING SET SEARCH METHODS FOR LINEARLY CONSTRAINED MINIMIZATION

Technical Report Department of Computer Science WM CS 2005 01 College of William & Mary Revised July 18, 2005 Williamsburg, Virginia 23187 IMPLEMENTING GENERATING SET SEARCH METHODS FOR LINEARLY CONSTRAINED MINIMIZATION ROBERT MICHAEL LEWIS, ANNE SHEPHERD, AND VIRGINIA TORCZON Abstract. We discuss an implementation of a generating set search method for linearly constrained minimization with no assumption of nondegeneracy placed on the constraints. The convergence guarantees for generating set search methods require that the set of search directions possesses certain geometrical properties that allow it to approximate the feasible region near the current iterate. In the hard case, the calculation of the search directions corresponds to finding the extreme rays of a cone with a degenerate vertex at the origin, a difficult problem. We discuss here how state of the art computational geometry methods make it tractable to solve this problem in connection with generating set search. We also discuss a number of other practical issues of implementation, such as the desirability of augmenting the set of search directions beyond the theoretically minimal set. We illustrate the behavior of the implementation on several problems from the CUTEr test suite. Key words. nonlinear programming, nonlinear optimization, linear constraints, direct search, pattern search, generating set search, Fourier Motzkin double description algorithm 1. Introduction. We consider ways to implement direct search methods for solving linearly constrained nonlinear optimization problems. These problems we write in the form minimize f(x (1.1 subject to x L x x U c L A I x c U A E x = b. The objective function is f : R n R with x R n. We partition the constraints into three sets: lower and upper bounds on the variables, general linear inequality constraints, and linear equality constraints. We allow for the possibility that decision variables may be unbounded above or below by allowing the components of x L to be and components of x U to be +. We also allow general linear inequalities to be unbounded above or below by allowing components of c L to be and components of c U to be +. A key step in generating set search algorithms for linearly constrained minimization is the computation of the requisite search directions. These are computed from a working set of constraints chosen in each iteration to be treated as nearly binding. The search directions are computed as the generators of the polar of the cone formed Department of Mathematics, College of William & Mary, P.O. Box 8795, Williamsburg, Virginia, 23187-8795; buckaroo@math.wm.edu. This research was supported by the National Aeronautics and Space Administration under Grant NCC-1-02029, by the Computer Science Research Institute at Sandia National Laboratories, and by the National Science Foundation under Grant DMS-0215444. Department of Computer Science, College of William & Mary, P.O. Box 8795, Williamsburg, Virginia, 23187-8795; plshep@cs.wm.edu. This research was conducted under the appointment of a Department of Energy High-Performance Computer Science Graduate Fellowship administered by the Krell Institute and funded by the Computer Science Research Institute at Sandia National Laboratories. Department of Computer Science, College of William & Mary, P.O. Box 8795, Williamsburg, Virginia, 23187-8795; va@cs.wm.edu. This research was supported by the Computer Science Research Institute at Sandia National Laboratories. 1

2 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON by the working set. If the working set defines a cone with a nondegenerate vertex at the origin, we have what we call the easy case. Computational approaches for solving the nondegenerate case have been known for some time [21, 25, 18]. The hard case arises if the working set defines a cone with a degenerate vertex at the origin. Analysis of algorithms that allow degenerate working sets is wellestablished [18, 20]. On the other hand, the resulting calculation of the search directions quickly becomes too difficult for a naive approach. As noted in [18], sophisticated and efficient computational geometry algorithms are required. A goal here is to show that by using state-of-the-art algorithms from computational geometry the hard case can be dealt with in a computationally effective way, and that the naive bounds on the combinatorial complexity involved in managing the hard case are unduly pessimistic. For example, in Section 9 we illustrate a problem for which the estimate on the number of possible vectors to be considered is more than 4 10 17 when, in fact, only 1 vector is needed and that vector is correctly identified in less than one-seventh of a second on a conventional workstation. Our specific implementation makes use of the analysis of generating set search found in [16], which synthesizes the analytical results found in [18, 20]. A new feature of the algorithm presented in [16] is the way in which we choose the set of search directions, which we show here to be effective in practice. We use to denote the feasible region for problem (1.1: = { x R n : x L x x U, c L A I x c U, and A E x = b }. The algorithm we present here, like all of the approaches referenced previously, is a feasible iterates method for solving (1.1: we require the initial iterate x 0 to be feasible and arrange for all subsequent iterates x k to satisfy x k. We assume that f is continuously differentiable on but that gradient information is not computationally available; i.e., no procedure exists for computing the gradient and it cannot be approximated accurately. We do not assume that the constraints are nondegenerate. We discuss several topics regarding our implementation in detail. The first is how to dynamically identify a set of search directions sufficient to guarantee desirable convergence properties. Generating set search methods for linearly constrained problems achieve convergence without explicit recourse to the gradient or the directional derivative of the objective. They also do not attempt to estimate Lagrange multipliers. Instead, when the search is close to the boundary of the feasible region, the set of search directions must include directions that conform to the geometry of the nearby boundary. The general idea is that the set must contain search directions that comprise a set of generators for the cone of feasible directions, hence the name generating set search. An important decision to be addressed in any implementation of this approach is what is meant by close to the boundary i.e., which constraints belong in the current working set since the current working set determines the requirements on the set of search directions needed for the iteration. Once the current working set has been identified, the next question becomes how to obtain a set of search directions that contains a set of generators for the cone of feasible directions. If the current working set is not degenerate, then a straightforward procedure is outlined in [18] (which is equivalent to that outlined in [21]. Equalities in the working set were not explicitly treated previously, so we give an elaboration that deals with this case. In the presence of degeneracy, the situation becomes considerably more complex. Thus obtaining a sufficient set of search directions will be the second of our concerns. We also discuss how augmenting the set of search directions can accelerate progress toward a solution.

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 3 Once a set of search directions has been identified, a step of appropriate length along each of the search directions must be found. The analysis makes this straightforward through the use of a scalar step-length control parameter k. In our implementation, the way in which we choose the working set is tied to k. Furthermore, our most recent sentiment, in line with the approach used by Lucidi and Sciandrone in [19] and Lucidi, Sciandrone, and Tseng in [20], is that there is value in decoupling the lengths of the steps along each of the search directions to ensure that, as much as possible, the requirement of maintaining feasibility does not overly limit the steps that can be tried. Finally, there are the usual implementation details of optimization algorithms. These include step acceptance criteria, stopping criteria, scaling of the variables, and data caching. We begin in Section 2 with an example and then give some preliminary definitions and notation in Section 3. The remainder of the paper, as well as the overall organization of the software, is described in Figure 1.1. Accepting steps Section 8.2 Stopping criteria Section 8.3 Scaling issues Section 8.4 Caching data Section 8.5 Finding an initial feasible iterate Section 8.1 Linearly Constrained GSS Section 8 Figure 8.1 Identifying nearby constraints Section 4 Generating core search directions Section 5 Augmenting core search directions Section 6 Choosing the steplengths Section 7 Fig. 1.1. A map for the organization of the paper and the software. 2. An illustrative example. Figure 2.1 illustrates the first iterations of a greatly simplified linearly constrained GSS method applied to the two-dimensional modified Broyden tridiagonal function [4, 22], to which we have added three linear constraints. Level curves of f are shown in the background. The three constraints form a triangle. In each figure, a dot denotes the current iterate x k, which is the best point the feasible point with the lowest value of f found so far. We partition the indices of the iterations into two sets: S and U. The set S corresponds to all successful iterations those for which x k x k+1 (i.e., at iteration k the search identified a feasible point with a lower value of the objective and that point will become the next iterate. The set U corresponds to all unsuccessful iterations those for which x k+1 = x k since the search was unable to identify a feasible point with a lower value of the objective. Each of the six subfigures in Figure 2.1 represents one iteration of a linearly constrained GSS method. The four squares represent the trial points under consideration at that iteration. Taking a step of length k along each of the four search directions

4 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON (a k = 0. The initial set of search directions. Move Northeast; k S. (b k = 1. Keep the set of search directions. Move Northeast; k S. (c k = 2. Change the set of search directions. Contract; k U. (d k = 3. Change the set of search directions. Move Northeast; k S. (e k = 4. Change the set of search directions. Contract; k U. (f k = 5. Change the set of search directions. Move Southeast; k S. Fig. 2.1. Linearly constrained GSS applied to the modified Broyden tridiagonal function. yields the trial points. In subfigures (b (f, the trial points from the previous iteration are shown in the background for comparison. In subfigure (a, the solution to the problem is marked with a star. Notice in subfigure (a that the initial iterate x 0 is near two constraints. Thus two of the initial search directions are chosen so that they are parallel to the two nearby constraints. In this instance, the result is two feasible descent directions; i.e., moving along two of the four search directions would allow the search to realize descent on the value of f at x 0 while remaining feasible. (Under appropriate assumptions, the analysis guarantees at least one feasible descent direction. Either of the two trial points produced by these two feasible descent directions is acceptable; in this instance we choose the one that gives the larger decrease in the value of the objective f to become x 1. Since the step of length 0 from x 0 to x 1 was sufficient to produce decrease, for the purposes of this simple example we set 1 to 0. In subfigure (b, the same two constraints remain nearby so we leave the set of search directions unchanged. Again, there are two feasible descent directions, each of which leads to a trial point that decreases the value of f at the current iterate. We accept the trial point that yields the larger decrease as x 2 and we set 2 to 1. In subfigure (c, the set of nearby constraints changes. We drop one of the constraints that had existed in our working set of nearby constraints in favor of a newly identified constraint that yields a different working set. The set of search directions that results is shown. As guaranteed by the analysis, we do have one feasible descent direction, but now the difficulty is that currently the step along that direction takes the search outside the feasible region. Thus, the best iterate is unchanged, so we set x 3 to x 2. Further, we contract by setting 3 to half the value of 2, which has the

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 5 effect of halving the length of the steps taken at the next iteration. The result of halving the length of the steps is illustrated in subfigure (d. First notice that in addition to reducing the length of the step allowed, we also have changed the set of working constraints; in fact, we have reduced the set to only one working constraint. (The mechanism by which we identify nearby constraints and its connection to the value of k are discussed in more detail in Section 4. Given both a new set of search directions, as well as a reduction in the length of the steps allowed, we now have a feasible trial point that gives decrease on the value of the objective, which we make x 4. In addition, we set 4 to 3. In subfigure (e the search moves to the new iterate. Once again the search is near the same two constraints seen in subfigure (c, so we have the same working set and the same set of search directions, though now we search with a reduced step length (i.e., 4 = 1 2 2. Once again, the length of the steps along the two descent directions is too long. So we set x 5 to x 4 and 5 to 1 2 4. The result of halving is illustrated in subfigure (f. In addition to reducing the length of the steps allowed, we have changed the set of working constraints; once again, we have reduced the set to only one working constraint. A new feasible trial point that gives decrease on the value of f is identified (to the Southeast and at the next iteration the search will proceed from there. This example illustrates the essential features the implementation must address: how to identify the nearby constraints, how to obtain a set of search directions that contains a set of generators for the cone of feasible directions, how to find a step of an appropriate length, and how to categorize the iteration as either successful (k S or unsuccessful (k U. 3. Notation and definitions. Unless otherwise noted, norms and inner products are assumed to be the Euclidean norm and inner product. A cone K is a set that is closed under nonnegative scalar multiplication: K is a cone if x K implies ax K for all a 0. A cone is called finitely generated if there exists a (finite set of vectors v 1,...,v r such that K = { λ 1 v 1 + + λ r v r λ 1,...,λ r 0 }. The vectors v 1,...,v r are called the generators of K. Conversely, the cone generated by a set of vectors v 1,...,v r is the set of all nonnegative combinations of these vectors. A special case occurs if K is a vector space. In that case, a set of generators is called a positive spanning set for K [7]. Thus a positive spanning set is like a linear spanning set but with the additional requirement that all the coefficients be nonnegative. One particular choice of generating set for R n to which we will have recourse is the set of the positive and negative unit coordinate vectors G = {e 1,e 2,...,e n, e 1, e 2,..., e n }. Clearly any vector in R n is a nonnegative combination of the vectors in G. The maximal linear subspace contained in a cone K is called its lineality space. If L is the lineality space of K, then we call K L the pointy part of K. Finally, the polar of a cone K, denoted by K, is the set K = { v w T v 0 for all w K }. The polar K is a convex cone, and if K is finitely generated, then so is K.

6 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON Given a vector v, let v K and v K denote the projections (in the Euclidean norm of v onto K and its polar K. The polar decomposition [23, 26] says that any vector v can be written as the sum of its projection onto K and its polar K, and the projections are orthogonal: v = v K + v K, and vk T v K = 0. This means that Rn is positively spanned by a set of generators of K together with a set of generators of K, a generalization of the direct sum decomposition given by a subspace and its orthogonal complement. 4. Constructing the working set from the nearby constraints. We next discuss how we determine the working set of constraints. From this working set we compute the minimal set of necessary search directions. For convenience we combine the bounds and general linear constraints. Let ( I A =. Let a T i denote the rows of A, indexed by i I. Let C L,i and C U,i be the sets where lower and upper bounds on a T i x are binding: C L,i = { y a T i y = (c L i } and CU,i = { y a T i y = (c U i }. These sets are faces of the feasible polyhedron. Of interest to us are the outwardpointing normals to the faces within a prescribed distance of x. Let N(A E denote the nullspace of the equality constraints A E. Given a point x that is feasible for (1.1 and any set S, we define dist(x,s = A I { min { x y y S N(AE } if S N(A E + otherwise. That is, we measure distance from x to S inside N(A E. We then define the sets of inequalities at their bounds that are within distance ε of x: I L (x,ε = { i I dist(x, C L,i ε } I U (x,ε = { i I dist(x, C U,i ε }. We now define the working set of constraints. Given feasible x and ε > 0, the corresponding working set W(x,ε is made up of two pieces. The first piece consists of the vectors W E that correspond to the equality constraints together with every inequality for which the faces for both its lower and upper bounds are within distance ε of x. If the rows a T e of the equality constraints A E are indexed by e E, then W E (x,ε = { a e e E } { a i i I L (x,ε I U (x,ε }. The second piece of the working set consists of the vectors W I that are outward pointing normals to the inequalities for which the faces of exactly one of their lower or upper bounds is within distance ε of x. Let I E (x,ε = I L (x,ε I U (x,ε; then W I (x,ε = { a i i I L (x,ε \ I E (x,ε } { a i i I U (x,ε \ I E (x,ε }. Given x, we define the ε-normal cone N(x,ε to be the cone generated by the linear span of the vectors in W E (x,ε together with the positive span of the vectors in

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 7 ε a 2 x 2 a 2 a 3 ε ε x 1 a 1 x 3 Fig. 4.1. The outward-pointing normals when constraints {1, 2} and {2, 3} are the working sets. Since the distance from x 3 to the boundary is greater than ε, the last working set is empty. W I (x,ε. If both the latter sets are empty, then N(x,ε = {0}. The ε-tangent cone, denoted by T(x,ε, is the polar of N(x,ε: T(x,ε = { v w T v 0 for all w N(x,ε }. If N(x,ε = {0}, then T(x,ε = R n (in other words, if the boundary is more than distance ε away, and one looks only within distance ε of x, then the problem looks unconstrained. Several examples are illustrated in Figure 4.2. N(x,ε 1 N(x,ε 2 x ε 1 x ε 2 ε 3 x T(x,ε 1 T(x,ε 2 T(x,ε 3 Fig. 4.2. The cones N(x, ε and T(x, ε for the values ε 1, ε 2, and ε 3. Note that for this example, as ε varies from ε 1 to 0, there are only the three distinct pairs of cones (N(x, ε 3 = {0}. The significance of N(x,ε is that for suitable choices of ε, its polar T(x,ε approximates the polyhedron near x, where near is in terms of ε. (More precisely, x + T(x,ε approximates the feasible region near x. The cone T(x,ε is important because if T(x,ε {0}, then one can proceed from x along all directions in T(x,ε for a distance of at least ε, and still remain inside the feasible region [18, 16]. Figure 4.2 illustrates this point. In the algorithm, we allow different values of ε k at every iteration k. There is a lot of flexibility in determining the choice of ε k, as discussed in [16]. We have chosen to yoke ε k to k. A connection between ε k and k makes sense. If every step likely to be considered at the current iteration would remain inside the feasible region, there seems little point in burdening the search with the need to consider nearby constraints. When the length of the steps allowed means that infeasible trial points could be constructed, however, then it makes sense to construct a working set

8 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON using these nearby constraints and to generate directions that allow the search to move along them. Revisiting the example in Figure 2.1, in Figure 4.3 we show the choice of ε k that was used to identify the working set of nearby constraints. This illustration has a (a k = 0. The initial set of search directions. Move Northeast; k S. (b k = 1. Keep the set of search directions. Move Northeast; k S. (c k = 2. Change the set of search directions. Contract; k U. (d k = 3. Change the set of search directions. Move Northeast; k S. (e k = 4. Change the set of search directions. Contract; k U. (f k = 5. Change the set of search directions. Move Southeast; k S. Fig. 4.3. One strategy for choosing the ε k used in linearly constrained GSS applied to the modified Broyden tridiagonal function. simple way of choosing ε k : ε k is the same as k. In practice we have found that the choice need not be much more complicated, a point we discuss in detail in Section 7. 5. Choosing the set of search directions. Once the nearby constraints for a given iteration have been identified, we can define the set of search directions. Figures 4.1 4.2 contain clues to the process. In order to ensure that the convergence results in [16] hold, we must arrange for the following. The search must conform to the local geometry of in the sense that at each iteration k, there is at least one direction in D k, the set of search directions at iteration k, along which the search can take an acceptably long step and still remain feasible. We decompose D k into G k and H k, where G k is made up of the directions necessary for convergence and H k contains any additional directions. Several options for choosing G k have been considered: 1. G k contains generators for all the cones T(x k,ε, for all ε [0,ε ], where ε > 0 is independent of k [18]. 2. G k is exactly the set of generators for the cone T(x k,ε k, where ε k 0 is updated according to whether acceptable steps are found, and H k = [20].

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 9 N(x,ε d (1 d (3 poor N(x,ε d (1 N(x,ε d (1 x ε x ε x ε d (3 optimal d (2 T(x,ε f(x d (2 f(x d (3 good T(x,ε d (2 T(x,ε f(x (a Poor choice (b Reasonable choice (c Optimal choice (wrt κ(g k Fig. 5.1. Condition 1 is needed to avoid a poor choice when choosing G k. 3. G k contains generators for the cone T(x k,ε k, where ε k = k β max, and β max is described below in Condition 2 [16]. In the work discussed here, we use the third option. As is necessary for any of the three options, the following is required of G k [15, 16]: Condition 1. For every k, G k generates T k = T(x k,ε k. In addition, there exists a constant κ min > 0, independent of k, such that for all k, if G k {0} then (5.1 κ(g k min v R n max d G k v Tk 0 v T d v Tk d κ min. The lower bound in Condition 1 is needed when there is a lineality space present in T(x k,ε k. In this case we have freedom in the choice of generators. Figure 5.1 depicts the situation where G k contains a halfspace. We require a lower bound on κ(g k to avoid the situation illustrated in Figure 5.1(a, in which the search directions can be almost orthogonal to the direction of steepest descent. Figure 5.1(c, which uses a minimal number of vectors but ensures as large a value for κ(g k as possible, represents an optimal choice in the absence of explicit knowledge of the gradient. All possible cones N(x,ε and T(x,ε can be generated using a finite number of vectors, as formalized in the following proposition from [16]. Proposition 5.1. There exists a finite set A such that A contains generators for every set N(x,ε with x and ε 0. As a consequence, there also exists a finite set G such that G contains generators for every set T(x,ε with x and ε 0. The definition of A ensures that for every x and ε 0 there exists A A such that A generates N(x,ε. Furthermore, for every x and ε 0, there exists a constant ν min > 0 such that the following holds: (5.2 κ(a ν min. There is also the set H k of additional directions we choose to add to improve the search. We discuss this further in Section 6.

10 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON In addition, as discussed in more detail in [16], we must enforce bounds on the lengths of the search directions themselves: Condition 2. There exist β min > 0 and β max > 0, independent of k, such that for all k, β min d β max for all d G k. In our implementation we choose to satisfy Condition 2 by normalizing all search directions so that β min = β max = 1. The necessary conditions on the set of search directions all involve the calculation of generators for the polars of cones determined by the current working set. We treat the determination of the search directions in increasing order of difficulty. Figure 5.2 shows the various cases to consider. Easy cases Empty working set Section 5.1 Only bound constraints in working set Section 5.1 Only equality constraints in working set Section 5.2 Generate core search directions Section 5 Straightforward case Nondegenerate working set Section 5.3.1 uses LAPACK Hard case Degenerate working set Section 5.3.2 uses cddlib Fig. 5.2. The components involved in the generation of the search directions. 5.1. When the working set contains only bound constraints or is empty. If only bound constraints are present the situation is quite simple since the generators for the ε-normal and ε-tangent cones can be drawn from the set of coordinate directions. We follow the prescription in [17] and include in G k the unit coordinate vectors ±e i. If some of the variables do not have bounds, a smaller number of the unit coordinate vectors could be used to form G k [18], but in our testing we have found that the coordinate directions are a consistently effective set of search directions. This has proven particularly true in engineering problems we have seen, where many of the responses are monotonic (at least locally with respect to each variable. The coordinate directions have physical meaning for such applications; thus, including them in the set of search directions seems to lead to better overall performance. A similar situation obtains when the working set is empty, so N(x k,ε k = {0} and T(x k,ε k = R n. In this case, any positive basis suffices [15]. We rely on the set

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 11 of unit coordinate vectors ±e i because it is apt for many engineering problems and because the orthonormality of the vectors ensures a reasonable value for κ min in (5.1. 5.2. When the working set consists only of equality constraints. If the working set consists only of equality constraints, then the generators of T(x k,ε k correspond to a positive spanning set for the nullspace of A E. In this situation by default we compute an orthonormal basis Z for this nullspace and take as our generators the compass-like set consisting of the columns of Z and their negatives. We also allow for a nullspace basis defined by a basic/nonbasic division of the variables, as in the simplex method. 5.3. When the working set contains general linear constraints. In the case shown in Figure 5.3, once the nearby constraints have been identified using x k and ε k, their outward pointing normals a 1 and a 2 are extracted from the rows of the constraint matrix A to form the set of working constraints W k = {a 1,a 2 }. The vectors in W k are the generators for the ε-normal cone N(x k,ε k. T(x k,ε k T(v 0 a 2 x k x k v 0 ε k a 1 v 0 N(x k,ε k ε k v 0 N(v 0 Fig. 5.3. The outward pointing normals a 1 and a 2 of the nearby (working constraints, which are then translated to x k to form the cone N(x k, ε k and define its polar T(x k, ε k. The generators for N(x k, ε k and T(x k, ε k form the set of search directions. In this example, notice that the same vectors are generators for the normal cone and the tangent cone at v 0. 5.3.1. When the working set is known to be nondegenerate. Figure 5.3 depicts the computationally straightforward case. The generators of N(x k,ε k are linearly independent. As a consequence, N(x k,ε k is linearly isomorphic to the positive orthant in R 2, and the same isomorphism maps T(x k,ε k to the negative orthant. Computing the generators of T(x k,ε k becomes a simple matter of linear algebra. More generally, in the easy case N(x k,ε k is linearly isomorphic to a cone in some Euclidean space that is the sum of the positive orthant and a vector space, and T(x k,ε k is linearly isomorphic to the corresponding polar, which is the sum of the negative orthant and a vector space. The following proposition describes this easier case of computing generators for T(x k,ε k. Proposition 5.2 treats equality constraints explicitly, an elaboration of the construction given in [18]. We explain shortly how the result is used in the actual algorithm. Proposition 5.2. Suppose N(x k,ε k is generated by the positive span of the columns of the matrix V P and the linear span of the columns of the matrix V L : N(x k,ε k = { v v = V P λ + V L α, λ 0 }. Let Z be a matrix whose columns are a basis for the nullspace of VL T, and N be a

12 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON matrix whose columns are a basis for the nullspace of VP T Z. Finally, suppose a right inverse R exists for VP TZ. Then T(x k,ε k is the positive span of the columns of ZR together with the linear span of the columns of ZN: (5.3 T(x k,ε k = { w w = ZRu + ZNξ, u 0 }. Proof. We begin with d = ZRu + ZNξ, where u 0. Then VL Td = 0 because VL TZ = 0. Meanwhile, since V P TZR = I and V P T ZN = 0, we have V T P d = V T P ( ZRu + ZNξ = u 0. Thus d has a nonpositive inner product with all the generators of N(x k,ε k, so d T(x k,ε k. Conversely, suppose d T(x k,ε k. Observe that because the linear span of the columns of V L is contained in K, we must have VL T d = 0. Now consider the vector d = ZR V T P d = ( ZR( V T P d. Since V T P d 0, we see that d lies in the positive span of the columns of ZR. Then (5.4 V T P (d d = V T P d V T P ZRV T P d = 0. Meanwhile, we know that VL T(d d = 0, so for some w we have d d = Zw. From (5.4 we obtain VP TZw = 0. This means that w is in the nullspace of V P T Z, so w = Nξ for some ξ, whence d d = ZNξ. The result then follows from the identity d = ZRu + ZNξ, u = VP T d 0. The adjective nondegenerate comes about because the easy case corresponds to the situation where the cone generated by the columns of V P is known to have a nondegenerate vertex at the origin. In applying this construction, we include as columns of V L all equality constraints along with the inequality constraints where we know that both a i and a i are in the working set. The latter situation can occur if ε k is large enough that the faces associated with both the lower and upper bound on a T i x are within distance ε k of x k. The columns of V P are the remaining outward pointing normals for the constraints in the working set. We then compute a basis Z for the nullspace of VL T and check whether a right inverse R exists for VP TZ. If so, the generators for T(x k,ε k are given by (5.3. Observe that if R exists, then ZR is a right inverse for VP T, so we actually check for the existence of a right inverse for the latter. By default, our implementation computes the pseudoinverse to serve as the right inverse in Proposition 5.2. This choice is particularly attractive because the columns of the pseudoinverse are orthogonal to any vector in the lineality space of T(x k,ε k, which helps improve conditioning of the resulting search directions (e.g., Condition 1. Within our implementation we include as an option a right inverse found by a basic/nonbasic division of the variables. To complete the set of generators of T(x k,ε k, we construct an orthonormal basis N for the nullspace of VP T Z. We then take as the remaining generators the columns of ±ZN; this corresponds to a compass-like search in the nullspace of VP T.

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 13 5.3.2. When the working set may be degenerate. If Proposition 5.2 cannot be applied the degenerate case the construction of the search directions becomes more complex, both computationally and algorithmically. In this case the rows of V T P are not linearly independent, and N(x k,ε k is a cone with a degenerate vertex at the origin. The problem of finding generators for T(x k,ε k is a special case of the problem of enumerating the vertices and extreme rays of a polyhedron (the V -representation of the polyhedron given the description of the polyhedron as an intersection of halfspaces (the H-representation. This is a well-studied problem in computational geometry (see [2] for a survey and effective algorithms have been developed. The two main classes of algorithms for solving this type of problem are pivoting algorithms and insertion algorithms. Pivoting algorithms move from vertex to vertex along the edges of the polyhedron by generating feasible bases, as in the simplex method for linear programming. The reverse search algorithm [3, 1] is an example of such a method. Insertion algorithms compute the V -description by working with the intersection of an increasing number of halfspaces. Typically, these algorithms start with a maximal linearly independent set of normals defining the halfspaces and constructs an initial set of extreme rays as in Proposition 5.2. Additional halfspaces are then introduced and the vertices and extreme rays are recomputed; this process continues until the entire set of halfspaces has been introduced. The Fourier Motzkin Double Description method [8, 24] is an example of an insertion algorithm. Others have observed that pivoting algorithms can be inefficient if they encounter a highly degenerate vertex [2]. This was our experience using the lrs reverse search code [1] in the highly degenerate case. For that reason we instead use the Double Description method, an insertion method, computing the necessary search directions using library procedures from Fukuda s cddlib code [9]. This package is based on an efficient implementation of the Double Description algorithm and has been shown to be effective in dealing with highly degenerate vertices [10]. We found it more efficient to retain equalities in the problem rather than to eliminate them by reparameterizing the problem in terms of a basis for the nullspace of the equality constraints. In some cases elimination of equalities led to a significant deterioration in the performance of cddlib. This is due to the fact that the constraint coefficients went from being sparse and frequently integral to being dense arrays of relatively random floating point numbers, which made the Fourier Motzkin elimination process more difficult. Thus, eliminating equalities had some serious disadvantages which in our testing were not offset by any advantages. 6. Augmenting the core set of search directions. While generators of T(x k,ε k are theoretically sufficient for the convergence analysis, in practice there are significant benefits to including additional search directions. The situation illustrated in Figure 6.1 shows the desirability of additional directions, since the directions T(x k,ε k point into the interior of the feasible region, while it may be preferable to move toward the boundary. At the very least, we would like the total set of search directions to be a positive basis for R n, or, if equality constraints are present, a positive basis for the null space of the equality constraints. One simple way to accomplish this is to include in H k the set N k, the outward pointing normals to the working set of constraints. Since these directions generate N(x k,ε k, and T(x k,ε k is polar to N(x k,ε k, by the polar decomposition, R n = N(x k,ε k T(x k,ε k. Thus, adding the directions gen-

14 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON N(x,ε 1 f(x N(x,ε 2 f(x f(x x x T(x,ε 1 T(x,ε 2 T(x,ε 3 (a First iteration is necessarily unsuccessful and so 1 is halved for the next iteration. (b The second iteration is also unsuccessful and so 2 is halved again. (c Now, locally, the problem looks unconstrained and a successful step toward the boundary is possible. Fig. 6.1. Minimizing the number of function evaluations at a single iteration might require more total function evaluations since in this instance it will take at least five function evaluations to find descent. erating N(x k,ε k means our search directions constitute a positive spanning set for R n, as illustrated in Figure 6.2. See Section 9 for a computational illustration of the advantage of including these directions. N(x,ε 1 f(x N(x,ε 1 f(x x x T(x,ε 1 (a The generators of N(x, ε 1 are both descent directions. T(x,ε 1 (b Limiting the steps along the generators of N(x, ε 1 ensures feasible steps. Fig. 6.2. Why including the generators for the ε-normal cone N(x, ε 1 might be a good idea since descent will be found after at most three function evaluations. The preceding discussion needs to be made more precise if equality constraints are present. Typically, in this case the outward pointing normals a i to constraints in the working set are infeasible directions they do not lie in the nullspace of the equality constraints. Thus, a step along a direction a i will lead to violation of the equality constraints. In this situation we replace the outward pointing normals by their projection into the nullspace of the equality constraint matrix A E. We have tested other augmentations of the search directions, as well, but have found no real advantage. For instance, if the directions in G k linearly span R n, then adding the directions G k yields a positive basis. This follows from the simple observation that c 1 v 1 +... + c r v r = c 1 sign(c 1 v 1 +... + c r sign(c r v r. We have also

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 15 tried adding generators for T(x k,ε for values of ε less than ε k. This allows us to anticipate the effect of reducing ε k in subsequent iterations. The operation of the Fourier Motzkin Double Description algorithm is such that appending generators to a set of directions and then recomputing the polar can be done efficiently. 7. Determining the length of the step. The step-length control parameter k determines the length of a step along each direction for a given set of search directions. In the unconstrained case, where the set of directions at iteration k is D k = The associated set of trial points would be { { } d (1 k,d(2 k,...,d(p k k, x k + k d (i k i = 1,...,p k }. When constraints are introduced, however, we must consider the possibility that some of the trial points will be infeasible. With this in mind, we choose the maximum [0, k] such that x k +, and define the set of trial points as (i k { x k + (i k d(i k (i k d(i k i = 1,...,p k }. In choosing (i k, a full step should be used if feasible, as stated in Condition 3. Condition 3. If x k + k d (i k (i, then k = k. The simplest way of satisfying Condition 3 is to choose = k if the result will (i be feasible and k = 0 otherwise, thereby simply rejecting infeasible points. We have found it much more efficient, however, to allow steps that stop at the boundary, as in [19, 20]. This is illustrated in Figure 6.2(b. More generally, for the (i search directions in H k we choose k to be the maximum value in [σ tol k, k ] satisfying x k + k d (i k, where σ tol 0. The default value of σ tol in our implementation is 10 3. The factor σ tol prevents excessively small steps. Such steps can yield only a very tiny improvement, which has the effect of retarding decrease in k. (i k 8. GSS algorithms for problems with linear constraints. The variant of GSS that we have implemented for problems with linear constraints is outlined in Algorithm 8.1. See [15, 16] for a discussion of some of the many other possibilities. The remainder of this section is devoted to further explanation of Algorithm 8.1. 8.1. Finding an initial feasible iterate. GSS methods for linearly constrained problems are feasible iterates methods all iterates x k, k = 0,1,2,..., satisfy x k. If an infeasible initial iterate is given then we compute its Euclidean projection onto the feasible polyhedron; this is a quadratic program. For some of the test problems discussed in Section 9 the starting value provided is infeasible, and for these we used this projection strategy. Alternatively, one could use Phase 1 of the simplex method to obtain a starting point at a vertex of the feasible region.

16 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON Algorithm 8.1 (Linearly constrained GSS using a sufficient decrease globalization strategy Initialization. Let f : R n R be given. Let be the feasible region. Let x 0 be the initial guess. Let tol > 0 be the step-length convergence tolerance. Let 0 > tol be the initial value of the step-length control parameter. Let 0 < θ 0 < θ max < 1, where θ is the contraction parameter. Let ε max > β max tol be the maximum distance used to identify nearby constraints (ε max = + is permissible. Let ρ( = α 2 be the forcing function, with α (0,1. Algorithm. For each iteration k = 0,1,2,... Step 1. Let ε k = min{ε max,β max k }. Choose a set of search directions D k = G k H k satisfying Conditions 1 and 2. Step 2. If there exists d k D k and a corresponding k [0, k ] satisfying Condition 3 such that x k + k d k and then: f(x k + k d k < f(x k ρ( k, Set x k+1 = x k + k d k (a successful step. Set k+1 = φ k k for any choice of φ k 1. Step 3. Otherwise, for every d G k, either x k + k d or In this case: f(x k + k d f(x k ρ( k. Set x k+1 = x k (an unsuccessful step. Set k+1 = θ k k for some θ k such that 0 < θ k θ max. If k+1 < tol, then terminate. Fig. 8.1. Linearly constrained GSS using a sufficient decrease globalization strategy. 8.2. Step acceptance criteria. An iteration k is successful, i.e., k S, if it satisfies the decrease condition. That is, there exists a d k D k for which (8.1 f(x k + k d k < f(x k ρ( k.

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 17 The nonnegative function ρ defined on [0, + is called the forcing function and must satisfy one of the following two requirements. Either (8.2 ρ( 0 or (8.3 ρ is continuous, ρ( = o( as 0, and ρ( ρ( for <. Choosing ρ to be identically zero as in (8.2 imposes a simple decrease condition on the acceptance of the step. Otherwise, choosing ρ as in (8.3 imposes a sufficient decrease condition. We use ρ( = α 2 in the acceptance criterion (8.1. With this choice of ρ, all the terms on the right-hand side of (8.5 are O( k [16]. By default, the value of α is 10 4 ; however, for any given problem the value of α should be chosen to reflect the relative magnitude of f and. 8.3. Stopping criteria. We stop the algorithm once the step length parameter k falls below a specified tolerance tol. This stopping criterion is motivated by the relationship between k and χ(x k, a particular measure of progress toward a Karush Kuhn Tucker (KKT point of (1.1. This connection is stated in Theorem 8.3. For x, let χ(x max f(x T w. x+w w 1 This quantity is discussed at length in [5, 6]. Of interest to us here are the following properties: (1 χ(x is continuous on, (2 χ(x 0, and (3 χ(x = 0 if and only if x is a KKT point for (1.1. We therefore wish to see χ(x k tend to zero. Theorem 8.3 says that this occurs at unsuccessful iterations as k tends to zero. Theorem 8.3 is most simply stated under the following assumptions. Relaxations of these conditions, and a more general version of the theorem, are found in [16]. Assumption 8.1. The set F = { x f(x f(x 0 } is bounded. Assumption 8.2. The gradient of f is Lipschitz continuous with constant M on F. If both assumptions hold, then there exists γ > 0 such that for all x F, (8.4 f(x < γ. Theorem 8.3. Suppose that Assumptions 8.1 and 8.2 hold. Consider the linearly constrained GSS algorithm given in Figure 8.1, which satisfies Conditions 1, 2, and 3. If k is an unsuccessful iteration and ε k = β max k, then (8.5 χ(x k ( M κ min + γ ν min k β max + 1 κ min β min ρ( k k. Here, κ min is from Condition 1, ν min is from (5.2, M is from Assumption 8.2, γ is from (8.4, and β max and β min are from Condition 2. Obviously, one can also stop after a specified number of objective evaluations, as we do in our testing in Section 9.

18 R. M. LEWIS, A. SHEPHERD, AND V. TORCZON 8.4. Scaling. The relative scaling of variables can have a profound effect on the efficiency of optimization algorithms. We allow as an option the common technique of shifting and rescaling the variables so that they have a similar range and size. We work in a computational space whose variables w are related to the original variables x via x = Dw + c, where D is a diagonal matrix with positive diagonal entries and c is a constant vector, both provided by the user. We solve the transformed problem minimize f(dw + c subject to D 1 (x L c w D 1 (x U c c L A I c A I Dw c U A I c A E Dw = b A E c. The scaling scheme we use in the numerical tests presented in Section 9 is based on that described in Section 7.5.1 of [12]. The latter scaling is based on the range of the variables and transforms the variables to lie between 1 and 1. We modify the scaling scheme slightly by not scaling variables whose original range is narrow. One advantage of scaling is that it makes it easier to define a reasonable default value for 0. For the results reported in Section 9, whenever scaling is applied to a problem, the default value for 0 is 2. 8.5. Data caching. The potentially regular pattern to the steps we take means that we may revisit points. This is particularly true for unconstrained and bound constrained problems. To avoid reevaluating known values of f, we store values of x at which we evaluate the objective and the corresponding objective values. Before evaluating f at a trial point x t, we examine the data cache to see whether we have already evaluated f at a point within some relative distance of x t. We do not evaluate f at x t if we have already evaluated f at a point x for which x t x < r tol x t. By default, r tol = 10 8. Caching yields only minor improvements in efficiency if general linear inequality constraints are present. If there are a considerable number of general linear constraints (i.e., constraints other than bounds in the working set, then there is only a small chance of reevaluating f near a previous iterate. Still, if each evaluation of the objective function is expensive, then it is worth avoiding any reevaluations. 9. Some numerical illustrations. We present some illustrations of the behavior of our implementation of Algorithm 8.1 using problems from the CUTEr test suite [14, 13]. In several of these problems we must deal repeatedly with highly degenerate instances of the search direction calculation discussed in Section 5.3. As these examples show, the solution of such problems is quite tractable given the power of a good implementation of the Fourier Motzkin Double Description algorithm. All of the problems considered here have nonlinear objectives. We start with the value of x 0 specified by CUTEr, though we often had to project into the feasible region (as discussed in Section 8.1. In the tests we present here, we terminated the program when the number of evaluations of the objective reached that which would have been needed for 30 forward difference evaluations of the gradient if one had used the equality constraints to eliminate variables. When the problem is scaled (as it is for all but one of the examples, we use 0 = 2, as discussed in Section 8.4. While our implementation supports an option to increase k after a successful iteration, for the results reported here k+1 = k whenever k S (i.e., φ k = 1 for all k. Whenever k U, k+1 = 1 2 k (i.e., θ k = 1 2 for all k.

IMPLEMENTING GSS FOR LINEARLY CONSTRAINED PROBLEMS 19 We set ε max = 2 5 0 so that it plays no role in the results reported here. Since we normalize all the search directions, β max = 1. Thus, ε k = k for all k. We use ρ( = α 2 as our forcing function with α = 10 4, as noted in Section 8.2. For the results presented here, G k consists of the generators for the ε-tangent cone T(x k,ε k. With one exception (for the purposes of illustrating the potential advantage of allowing a nonempty set H k H k consists of the generators for the ε-normal cone N(x k,ε k (projected into the nullspace of any equality constraints in the working set, as described in Section 6. Recall that by definition, D k = G k H k. As discussed in Section 7, we allow steps that stop at the boundary. For the (i search directions in H k, we choose k to be the maximum value in [σ tol k, k ] satisfying x k + k d (i k, where we use σ tol = 10 3. Since we have occasion to mention run-times in passing, we note that the tests were run on a dual 3.2 GHz Xeon workstation with 1 GB of memory. 70 10 1 60 10 7.9762 50 10 0 Value of objective 10 7.9761 10 7.976 10 1 Value of Number of directions 40 30 20 10 2 10 7.9759 10 3 0 200 400 600 800 1000 1200 Number of objective evaluations 10 0 0 5 10 15 20 25 Iteration Working set Core directions Extra directions Fig. 9.1. Results for the avion2 test problem: 49 variables, 15 linear equality constraints, and upper and lower bounds on all variables. The first problem is avion2, an aeronautical design problem. In this case the equality constraints have rank 15, leaving 34 degrees of freedom in the problem. The plot on the left in Figure 9.1 gives the iteration history for the value of the objective and the value of, both plotted against the number of evaluations of the objective. Circles represent the value of the objective, with filled circles indicating unsuccessful iterations. The value of is represented by. The optimal objective value (computed using npsol [11] is the lowest value on the vertical scale. The plot on the right in Figure 9.1 shows the size of the working set at each iteration, as well as the number of vectors in G k, the set of generators for T(x k,ε k (core directions, and the number of vectors in H k, the extra search directions we include (extra directions. The set H k has fewer vectors than the working set because we do not include outward pointing normals for equality constraints in the working set. The solid line indicates the number of variables in the problem; when the size of the working set is above this line the algorithm must deal with the degeneracy discussed in Section 5.3.2. For avion2, a total of 0.12 seconds is spent in cddlib treating the degenerate cases. The next test problem is spanhyd, a model of a hydroelectric system. This problem includes network constraints. Since the ranges allowed for some of the variables is small, we scale only a subset of the variables, though we still start with 0 = 2. The plot on the left in Figure 9.2 shows that the algorithm makes good progress