Projected Gradient Methods for NCP 57. Complementarity Problems via Normal Maps

Projected Gradient Methods for NCP 57 Recent Advances in Nonsmooth Optimization, pp. 57-86 Eds..-Z. u, L. Qi and R.S. Womersley c1995 World Scientic Publishers Projected Gradient Methods for Nonlinear Complementarity Problems via Normal Maps Michael C. Ferris 1 University of Wisconsin{Madison, Computer Sciences epartment, Madison, WI 53706, USA aniel Ralph 2 University of Melbourne, epartment of Mathematics, Melbourne, Australia Abstract We present a new approach to solving nonlinear complementarity problems based on the normal map and adaptations of the projected gradient algorithm. We characterize a Gauss{Newton point for nonlinear complementarity problems and show that it is sucient to check at most two cells of the related normal manifold to determine such points. Our algorithm uses the projected gradient method on one cell and n rays to reduce the normed residual at the current point. Global convergence is shown under very weak assumptions using a property called nonstationary repulsion. A hybrid algorithm maintains global convergence, with quadratic local convergence under appropriate assumptions. 1 Introduction The nonlinear complementarity problem is to nd a vector z 2 IR n satisfying: f(z) 0; z 0; hf(z); zi = 0; (NCP) 1 The work of this author was based on research supported by the National Science Foundation grant CCR-9157632 and the Air Force Oce of Scientic Research grant F49620-94-1-0036. 2 The work of this author was based on research partially supported by the U.S. Army Research Oce through the Mathematical Sciences Institute, Cornell University, the National Science Foundation, the Air Force Oce of Scientic Research, the Oce of Naval Research, under grant MS-8920550, and the Australian Research Council.

58 M. C. Ferris and. Ralph where f : IR n! IR n is a smooth function and all vector inequalities are taken component-wise. In this paper, we will describe an algorithm for solving nonlinear complementarity problems that is computationally based on the projected gradient algorithm, and uses a reformulation of (NCP) as a system of nonsmooth equations. The algorithm is conceptually simple to implement and has a low cost per iteration; and we demonstrate its convergence properties assuming only that f is continuously dierentiable. The problem (NCP) can be reformulated using a normal map: 0 = f + (x) def = f(x + ) + x x + ; (NE) where x + is the Euclidean projection of x onto IR n +. Note that z solves (NCP) if and only if z f(z) solves (NE), and x solves (NE) if and only if x + solves (NCP). Normal maps were introduced by Robinson in [32] (see also [29, 30]) and we note here simply that the formulation (NE) has some advantages over (NCP). For example, it is an equation rather than a system of inequalities and equalities, hence its examination from the viewpoint of equations may yield insight dicult to obtain otherwise. This has proven to be the case as demonstrated by recent advances on nonsmooth Newtonlike algorithms for (NE) in [5, 4, 12, 28, 34]. Nonsmoothness of the normal map, however, is the diculty assumed. In fact, normal maps such as f + can be cast in a more general framework, where x + is replaced by (x), the projection of x onto a nonempty closed convex set. In this context, nding a zero of the normal map f (x) def = f( (x)) + x (x) is equivalent to a nonlinear variational inequality [11] dened by the set and the function f. In the special case where IR n +, f = f +. For polyhedral, the normal map [31, 33] f is intimately related to the normal manifold [32]. This manifold is constructed using the faces of the set ; it is a collection of n-dimensional polyhedral sets (called cells) which partition IR n. The normal map f is smooth in each cell of IR n ; nondierentiability only can occur as x moves from one cell to another. A cell is sometimes called a piece of linearity. In the particular example resulting from nonlinear complementarity problems where IR n +, the cells of the normal manifold are precisely the orthants of IR n. Practical Newton-like methods for (NE) solve a linear or piecewise linear model based at the kth iterate, x k, to obtain the next iterate x k+1. Unfortunately, this model is not always invertible and this creates problems for dening algorithms and in computing x k+1. In this paper, we are concerned with dening practical algorithms with strong global convergence properties for nding zeros of normal maps. Our goal is to obtain convergence, at least on a subsequence, to a Gauss{Newton point for normal maps. This generalizes the familiar notion from nonlinear equation theory where a Gauss{Newton point is a stationary point for the problem of minimizing the Euclidean norm residual of the function.

Projected Gradient Methods for NCP 59 We are ultimately interested in zeros of f but nding one may be on the level of diculty of nding zeros of general nonlinear functions. We revert to considering the residual function (x) def = min 1 2 kf (x)k 2 ; which gives us a measure of the violation of satisfying f (x) = 0. Our aim in this paper is to develop a robust algorithm for minimizing that has a low cost per iteration. Note that is a piecewise smooth function. In order to motivate our denition of Gauss{Newton points, let us rst examine the notion of a Gauss{Newton point for nonlinear equations. This corresponds to the case where IR n, and f = f. A Gauss{Newton point for the smooth function f is a point x 2 IR n such that x = x minimizes the rst-order model 1 kf(x ) + rf(x )(x x )k 2 of (x) over IR n. For 2 general, we construct a piecewise linear model of the residual function based on the directional derivative f(x 0 ; ). There are several key ideas on which the development of this paper are based. (i) The characterization of Gauss{Newton points for normal maps requires the stationarity of the residual function with respect to every cell that contains that Gauss{Newton point. Thus, for complementarity problems, we must examine up to 2 n orthants to determine whether or not x is a Gauss{Newton point of f +. Our rst key result is to show that it is sucient to check at most two of these cells, independent of the magnitude of n. An alternative characterization given in this paper shows that one cell and at most n rays in neighboring cells need to be examined to verify stationarity of (or give a descent direction). (ii) The inherent diculty in dening an algorithm to determine a Gauss{Newton point is that one must be sure that the limit point of the algorithm is stationary for in each piece of smoothness (orthant) containing that limit point. The second key idea, motivated by the characterizations above, is to apply variants of the projected gradient method [2] simultaneously to a single cell and n rays, to reduce. This means that the work performed by the projected gradient algorithm at each step of the Gauss{Newton method is comparable to performing just two projected gradient steps. (iii) Our algorithm depends heavily on the projected gradient method having Non- Stationary Repulsion or NSR (see Section 3). Simply stated, if an algorithm has NSR, then each nonstationary point has a neighborhood that can be visited by at most one iterate of the algorithm. The third key result is that the projected gradient algorithm and the adaptations that we use in our algorithm have NSR. This property forces our algorithm to generate a better point in a neighboring orthant if the limit point of the sequence is not stationary in such an orthant. The paper is organized as follows. In Section 2 we dene the notion of a Gauss{ Newton point for f + and and prove several equivalent characterizations (Proposi-

60 M. C. Ferris and. Ralph tion 2.3). We give a testable regularity condition (enition 2.4) that guarantees that such Gauss{Newton points are solutions of (NE). Section 3 outlines the nonstationary repulsion property and shows that any algorithm having NSR possesses strong global convergence properties (Theorem 3.2). We prove several technical results that are key to the convergence of our algorithms. A special case of these results are used to show that the projected gradient algorithm has NSR (Theorem 3.6). Section 4 contains a description of three algorithms and their convergence properties. Our main convergence result, Theorem 4.3, proves the Gauss{Newton method we present is extremely robust: assuming only continuous dierentiability of f, every limit point of the method is stationary for. No regularity assumptions on limit points are required. However, before proving this result, we outline a basic algorithm that can easily be shown to have NSR and hence global convergence under the same assumptions. Theorem 4.3 proves convergence of an extension to the basic algorithm that is motivated by the practical considerations of reducing the number of function and Jacobian evaluations. A Newton based hybrid method with global and local quadratic convergence is given in Subsection 4.3. Some simple examples of the use of these algorithms conclude the paper. There have been many other research papers devoted to solving nonlinear complementarity problems. Some of the more recent papers are mentioned below. There are several types of Newton methods for solving nonsmooth equations; see Subsection 4.3 for a brief introduction. Here we mention the following references on Newton methods for nonsmooth equations and extensions, [5, 4, 6, 12, 13, 15, 16, 20, 21, 26, 27, 28, 34]. A feature shared by \pure" Newton methods is the need for an invertible model function at the current iteration; applying the inverse of this model yields the next iterate. However singularities occur in many problems, for instance see [12], causing numerical diculties for, or outright failure of these methods. To circumvent the singularity problem, several Gauss{Newton techniques for solving nonlinear complementarity problems have been proposed. These can be found in the following references [1, 9, 19, 23, 22, 24]. Alternative techniques can be found in [8, 10, 14, 17, 18, 36]. Most of the notation in this paper is standard. We use IR n to denote the n{ dimensional real vector space, h; i for the inner product of two elements in this space, kk for the associated Euclidean norm, and IB for the corresponding ball of vectors x such that kxk 1. For a dierentiable function : IR n! IR m, r(x) 2 IR mn represents the Jacobian of evaluated at x, and r(x) T represents the transpose of this matrix. If is only directionally dierentiable, we denote the directional derivative mapping at x by 0 (x; ). Calligraphic upper case letters in general represent sets of indices, upper case letters represent sets or operators. If is a convex set, the normal cone to at a point x 2 is N (x) def = fy: hy; c xi 0; 8c 2 g :

Projected Gradient Methods for NCP 61 The tangent cone at x 2, is dened by T (x) def = N (x), where for a given convex cone K, the polar cone is dened by K def = fy: hy; ki 0; 8k 2 Kg : Both the tangent and normal cones are empty at points x =2. The Euclidean projection of x onto the set is represented by (x). A function :! IR m is C 1 (continuously dierentiable) if it is dierentiable in the relative interior of and, for each sequence fx k g in the relative interior of that converges (to a general point of ), fr(x k )g is also convergent. If is a polyhedral convex set, and F is a face of then N (x) is the same set for every x in the relative interior of F [32]. We call this set N (F ). A facet of is a face that has dimension 1 less than. Further denitions from convex analysis can be found in [35]. We may abuse notation, when there is no possibility of confusion, by writing O instead of j O to mean the restriction of to an orthant O (to be distinguished from a normal map involving O ). Finally, throughout the paper the function f :! IR m is assumed to be C 1 ; and usually = IR n +. 2 Gauss{Newton Points and Regularity As we outlined in the introduction, a Gauss{Newton point for the smooth function f is a point x 2 IR n such that x = x minimizes the rst-order model 1 2 kf(x ) + rf(x )(x x )k 2 of (x) = 1 2 kf(x)k2 over IR n. Equivalently, x is a stationary point of, that is r(x ) = rf(x ) T f(x ) = 0. Note again that for the remainder of this paper we assume that f is continuously dierentiable on its domain ( or IR n + ). In the general case, we approximate the normal map f (x) by the piecewise linear model f (x ) + f 0 (x ; x x ), where the directional derivative f 0 (x ; ) is a piecewise linear map. We can now dene the notion of a Gauss{Newton point of f, which is based on this directional derivative. enition 2.1 Let x 2 IR n. We say x is a Gauss{Newton point for f if x = x solves the problem 1 min x 2 kf (x ) + f(x 0 ; x x )k 2 : (1) Equivalently, x is a Gauss-Newton point if 1 2 kf (x )k 2 1 2 kf (x ) + f 0 (x ; x x )k 2 ; 8x 2 IR n : For the remainder of this paper we will consider only the special case of nonlinear complementarity problems where IR n +. However, many of the results have analogues in the general polyhedral case.

62 M. C. Ferris and. Ralph 2.1 Gauss{Newton points of complementarity problems Using enition 2.1, we see that x is a Gauss{Newton point of f + if it solves (1) with f = f +. To understand this more fully, we now investigate the directional derivative f 0 + in more detail. We can easily calculate the directional derivative of the function x + at x in the direction d: it is the vector x 0 +(d) in IR n whose ith component is given by [x 0 + (d)] i = 8 >< >: d i if x i > 0; (d i ) + if x i = 0; 0 if x i < 0. In fact x 0 +(d) is exactly the projection of d onto the critical cone of IR n + at x, K(x). This critical cone is the Cartesian product of n intervals in IR, the ith interval being K i = 8 >< >: IR if x i > 0; IR + if x i = 0; f0g if x i < 0. Since f is continuously dierentiable, f + is directionally dierentiable: for x, d 2 IR n, f 0 +(x; d) = rf(x + ) K (d) + d K (d); where the notation K = K(x) is used. As a function of d, the mapping on the right is exactly the normal map induced by the matrix rf(x + ) and the convex cone K, so f 0 +(x; d) = rf(x + ) K (d): As mentioned above, the diculty in determining whether a point x is a Gauss{ Newton point is that we must examine potentially exponentially many pieces of smoothness of f +, or pieces of linearity of rf(x + ) K. In fact, the number of pieces of linearity of rf(x + ) K is the number of orthants containing x, and is given by 2 m where m is the number of components of x equal to zero. The next result removes this diculty by showing that at most two pieces of linearity need to be considered. We introduce some notation. Given an orthant O, let H i be the half-line IR +, i = 1; : : : ; n, such that O = H 1 : : : H n : The complement of O at a point x 2 O is the orthant ~ O given as the Cartesian product of half-lines ~H i where ~H i = ( Hi if x i 6= 0, H i if x i = 0. It may seem odd that the complement of O at an interior point x is O itself. This is actually quite natural in the context of stationary points of because is dierentiable at each interior point x of an orthant, hence the question of stationarity of at x is independent of other orthants. We next introduce the formal denition of a stationary point.

Projected Gradient Methods for NCP 63 enition 2.2 If is directionally dierentiable and is a nonempty convex set, then x is a stationary point for min x2 (x) if 0 (x ; d) 0; 8d 2 T (x ): Note that if IR n, then a stationary point satises 0 (x ; d) 0, for all d 2 IR n. Proposition 2.3 Given x 2 IR n, let K be the critical cone to IR n + at x, O be any orthant containing x and ~ O be the complement of O at x. Suppose f is continuously dierentiable, then the function, dened by is directionally dierentiable and The following statements are equivalent: (x) def = 1 2 kf +(x)k 2 ; 0 (x ; d) = f + (x ); f 0 + (x ; d) E ; 8d 2 IR n : (2) 1. x is a Gauss{Newton point of f +. 2. x is a stationary point of minf(x): x 2 IR n g. 3. 0 2 rf(x + )T f + (x ) + K and 0 2 f + (x ) + K. 4. x is stationary for both minf(x): x 2 O g and min n (x): x 2 ~ O o 5. x is stationary for minf(x): x 2 O g and for each 1-dimensional problem minf(x): x 2 x + N O (F )g ; where F is a facet of O containing x. Proof If statement 1 holds, then we dene (x) def = 1 f + (x ) + f 0 2 +(x ; x x ) 2 ; and note that 0 (x ; h) = 0 (x ; h), for all h. Since x is a Gauss{Newton point, it follows that 0 (x ; d) + o(d) 0, for all d, and hence that statement 2 holds by positive homogeneity. Conversely, if statement 2 holds, then for all d and > 0, 0 f + (x ); f 0 +(x ; d) E, so that (x + d) = 1 2 = 1 2 f + (x ) + f 0 + (x ; d) 2 f + (x ) + f 0 +(x ; d) 2 1 2 kf +(x )k 2 + 1 2 2 f 0 + (x ; d) 2 (x ):

64 M. C. Ferris and. Ralph Hence statement 1 holds. Statement 2 means that f + (x ); rf(x + ) K(d) E 0, for all d 2 IR n. If d 2 K, then rf(x + ) K(d) = rf(x + )d so that f + (x ); rf(x + )ke 0, for all k 2 K. Similarly, hf + (x ); i 0, for all 2 K. This is exactly statement 3. Conversely, let d 2 IR n, and recall from the Moreau decomposition that d = k + where k = K (d) and = K (d). Using statement 3, f+ (x ); rf(x +) K (d) E = f + (x ); rf(x +)k E + hf + (x ); i 0: Thus statement 2 holds. Clearly statement 2 implies statement 4. Suppose statement 4 holds. Consider a facet F of O containing x. There is a unique index 1 i n such that neither e i nor e i lies in F, where e i is the vector in IR n with component i equal to 1 and all other components equal to zero. Choose s = 1 such that se i 62 O, then N O (F ) = fse i : 0g. Note further that se i 2 ~O. Thus statement 4 implies statement 5. Suppose statement 5 holds and consider e = e i for any index i. If x i 6= 0, then stationarity of x for minf(x): x 2 O g yields that 0 (x ; e) 0. If x i = 0 then either e 2 O or e 2 N O (F ) for some facet F of O containing x. Therefore 0 (x ; e) 0: It follows by linearity of 0 (x ; ) on each orthant, that 0 (x ; d) 0 for each d in each orthant, hence for d 2 IR n. This is statement 2. The proof of the equivalence between statements 1, 2 and 3 in Proposition 2.3 can be immediately adapted to the case of a general polyhedral set, with K then representing the critical cone to at the point x. 2.2 Regularity We now turn to the question of when a Gauss{Newton point for f + is a solution of f + (x) = 0. This is commonly called regularity and we introduce a notion of regularity that is pertinent to our Gauss{Newton formulation. Recall from Proposition 2.3 that x is a Gauss{Newton point if and only if f + (x) 2 K; rf(x + ) T f + (x) 2 K ; where K is the critical cone to IR n + at x. A simple regularity condition would be f + (x) 2 K; rf(x + ) T f + (x) 2 K =) f + (x) = 0: However, this condition is dicult to verify in most practical instances. In order to generate a more testable notion of regularity, we follow the development of More [19]. Here, f + (x) is replaced by a general vector z and extra conditions that

Projected Gradient Methods for NCP 65 are satised by f + (x) are used to weaken the regularity assumption. Thus we dene P def = fi: x i > 0; [f + (x)] i > 0g ; N def = fi: x i > 0; [f + (x)] i < 0g ; C def = fi: [f + (x)] i = 0g ; and we note that [f + (x)] P > 0, [f + (x)] N < 0 and [f + (x)] C = 0. enition 2.4 A point x 2 IR n is said to be regular if the only z satisfying is z = 0. z 2 K; rf(x + ) T z 2 K ; z P 0; z N 0; z C = 0 This condition is closely related to [19, enition 3.1]. This is because x is regular if and only if z 6= 0; z 2 K; z P 0; z N 0; z C = 0 =) rf(x + ) T z =2 K ; and the condition on the right is equivalent to existence of p 2 K such that z T rf(x + )p > 0. In contrast to [19, 22], the point x is not constrained to be nonnegative. Using enition 2.4, we can prove the following result. Lemma 2.5 x is a regular stationary point for if and only if x solves (NE). Proof If f + (x) = 0 then C = f1; : : : ; ng so z = z C = 0. Further, using (2), x is stationary for. Conversely, if x is stationary, then Proposition 2.3 shows that z = f + (x) satises all the relations required in the denition of regularity, and hence f + (x) = 0. We turn to the question of testing whether a point x is regular. [19, 22, 28] give several conditions on the Jacobian of f to ensure that x is regular in the sense dened in the corresponding paper. For brevity we only discuss the s-regularity condition of Pang and Gabriel [22], and do not repeat denitions here. More [19] argues that s-regularity is stronger than his regularity condition; a similar comparison between enition 2.4 and s-regularity can be made. Here we make a new observation about s-regularity. To explain this, recall that the goal of [22] is to solve 0 = (x) def = (1=2) kminff(x); xgk 2, where the min is taken component-wise; a solution of this equation solves (NCP) and vice versa. If x is nonstationary for, then s-regularity of x ensures that for some direction y 2 IR n and all x near x, y is a (strict) descent direction for at x, i.e. 0 (x; y) < 0; see [22, Lemmas 2, 6 and 7]. However (like ) is only piecewise smooth, and may have a nonstationary point which is a local minimum of some piece of smoothness of, contradicting the existence of such a direction y. So s-regularity is too strong in the context of this investigation.

66 M. C. Ferris and. Ralph In what follows, we give conditions that ensure x to be regular in the sense of enition 2.4. These results are proven by adapting arguments from More [19]. A key construct in the results is the matrix J(x) def = T 1 rf(x + )T 1 where T = diag(t i ); t i = ( 1 if i 2 P 1 if i =2 P. T is chosen so that every component of ~z def = T z is nonnegative. Under this transformation, x is regular if 0 6= ~z 0; ~z C = 0; ~z 2 T K =) 9~p 2 T K; ~z T J(x)~p > 0; where def = fi: [f + (x)] i 6= 0; x i 0g. Note that ~z i = 0 when i =2. The results we now give impose conditions of J(x) to guarantee regularity. We note that A 2 IR nn is an S-matrix if there is an x > 0 with Ax > 0, see [3]. Theorem 2.6 Let J(x) = T 1 rf(x + )T 1. If [J(x)] EE is an S-matrix for some index set E with E fi: x i 0g, then x is regular. Proof Since [J(x)] EE is an S-matrix, there is some ~p E > 0 such that [J(x)] EE ~p E > 0. Let ~p be the vector in IR n obtained by setting other elements to zero, so that [J(x)~p] E > 0. Now 0 6= ~z 0 and E so ~z T J(x)~p > 0. Also, T K is the Cartesian product of (T K) i = 8 >< Thus ~p 2 T K and hence x is regular. >: IR if x i > 0 IR + if x i = 0 f0g if x i < 0 ; i = 1; : : : ; n: (3) A is a P -matrix if all its principal minors are positive. P -matrices are S-matrices [3, Corollary 3.3.5]. The following corollary is now immediate. Corollary 2.7 If [rf(x + )] is a P -matrix, then x is regular. Proof The hypotheses imply that [J(x)] is a P -matrix and hence an S-matrix. To complete our discussion of tests for regularity, we give the following result. Recall that if A is partitioned in the form " # ANN A A = NM ; A MN A MM and the matrix A NN is nonsingular, then (AnA NN ) def = A MM A MN A 1 NNA NM is called the Schur complement of A NN in A. The proof of the following result is modeled after [19, Corollary 4.6].

Projected Gradient Methods for NCP 67 Theorem 2.8 If [rf(x + )] NN is nonsingular and the Schur complement of [rf(x + )] NN in [J(x)] is an S-matrix, then x is regular. Proof Let A = [J(x)] and partition A into " ANN A NM A MN A MM # ; where A NN = [rf(x + )] NN and M def = n N. We construct ~p N, ~p M such that [J(x)] " ~pn ~p M # > 0: (4) Let a > 0, then ~p N, ~p M solve " ANN A NM A MN A MM # " ~pn ~p M # = " a q # if and only if ~p N, ~p M solve " ANN A NM 0 (AnA NN ) # " ~pn ~p M # = " a q A MN A 1 NN a # : Since (AnA NN ) is an S-matrix by assumption, there exists ~p M > 0 with (AnA NN ) ~p M > 0. Multiplying ~p M by an appropriately large number gives (AnA NN ) ~p M +A MN A 1 NNa > 0. It follows that q def = (AnA NN ) ~p M + A MN A 1 NNa > 0, and taking ~p N = A 1 NN (a A NM ~p M ) implies (4). Let ~p 2 IR n be the vector constructed from ~p M and ~p N by adding appropriate zeros. Then it is easy to see that ~p 2 T K, see (3). Furthermore, ~z T J(x)~p = ~z [J(x)~p] T > 0. Hence x is regular. Note that [5, 12, 28] all assume that [rf(x + )] EE is nonsingular and ([rf(x + )] LL n[rf(x + )] EE ) is a P -matrix. (5) Here E def = fi: x i > 0g contains N and L def = fi: x i 0g contains. Theorem 2.8 requires the non-singularity of a smaller matrix and a weaker assumption on the Schur complement. However, (5) guarantees regularity in the sense of enition 2.4 as we now show. Lemma 2.9 If (5) holds or, equivalently, the B-derivative f 0 +(x; ) is invertible, then and so x is regular. z 2 K; rf(x + ) T z 2 K =) z = 0;

68 M. C. Ferris and. Ralph Proof The equivalence between (5) and the existence of a Lipschitz inverse of f 0 + (x; ) is given by [28, Proposition 12]. Since all piecewise linear functions are Lipschitz, the claimed equivalence holds. Suppose z 2 K and rf(x + ) T z 2 K. It follows that z i = 0, i =2 L. Also rf(x + ) T z 2 K implies h rf(x+ ) T z i E = 0; h rf(x+ ) T z i M 0; where M def = L n E. Using the invertibility assumption from (5) and z 2 K again, we see that [rf(x+ ) T ] MM [rf(x + ) T ] ME [rf(x + ) T ] 1 EE[rf(x + ) T ] EM zm 0; z M 0: The Schur complement is a P -matrix and hence z = 0 follows from [3, Theorem 3.3.4]. 3 Nonstationary Repulsion (NSR) of the Projected Gradient Method Let be a nonempty closed convex set in IR n and :! IR be C 1. (We are thinking of being an orthant and = j.) We paraphrase the description of the projected gradient (PG) algorithm given by Calamai and More [2] for the problem min (x): (6) x2 For any > 0, the rst-order necessary condition for x to be a local minimizer of this problem is that (x r(x)) = x: When x k 2 is nonstationary, a step length k > 0 is chosen by searching the path x k () def = (x k r(x k )); > 0: Given constants 1 > 0, 2 2 (0; 1), and 1 and 2 with 0 < 1 2 < 1, the step length k must satisfy (x k ( k )) (x k ) + 1 r(x k ); x k ( k ) x ke (7) and where k satises k 1 or k 2 k > 0; (8) (x k ( k )) > (x k ) + 2 r(x k ); x k ( k ) x ke : (9)

Projected Gradient Methods for NCP 69 Condition (7) forces k not to be too large; it is the analogue of the condition used in the standard Armijo line search for unconstrained optimization. Condition (8) forces k not to be too small; in the case that k < 1, this requirement is the analogue of the standard Wolfe-Goldstein [7] condition from unconstrained optimization. The PG method is a feasible point algorithm in that it requires a starting point x 0 in and produces a sequence of iterates fx k g. It is also monotonic, that is, if x k 2 is nonstationary, then (x k+1 ) (x k ). We claim that the PG method has NSR: enition 3.1 An iterative feasible point algorithm for (6) has nonstationary repulsion (NSR) if for each nonstationary x 2, there exists a neighborhood V of x such that if any iterate x k lies in V \, then (x k+1 ) < (x). The fact that the steepest descent method, i.e. the PG method when = IR n, has NSR is easy to see. Also, Polak [25, Chapter 1] discusses a general descent property that is similar to NSR and provides convergence results like Theorem 3.2 below. It is trivial but important that NSR yields strong global convergence properties: Theorem 3.2 Suppose A is a monotonic feasible point algorithm for (6) with NSR. Let x 0 2. 1. Any limit point of the sequence generated by A is stationary. 2. Let B be any monotonic feasible point algorithm for (6). Suppose fx k g is a sequence dened by applying either A or B to each x k. Then any limit point of fx k g k2k is stationary if A is applied innitely many times, where K = n k: x k+1 is generated by A o. Proof 1. This is a corollary of part 2 of the theorem. 2. Let x 2 be nonstationary for (6) and K have innite cardinality. NSR gives > 0 such that (x k+1 ) < (x) if x k 2 (x + IB) \. If the subsequence fx k g K does not intersect (x + IB) \ then x is not a limit point of this subsequence. So we assume that x k 2 (x + IB) \ for some k 2 K, hence (x k+1 ) < (x). By continuity of there is 1 2 (0; ) such that (x) > (x k+1 ) if x 2 (x + 1 IB)\. By monotonicity of A and B, (x k+j ) (x k+1 ) for each j 1; hence x is not a limit point of fx k+j g j1, or of fx k g K. Of course NSR is not a guarantee of convergence. To guarantee existence of a limit point of a sequence produced by a method with NSR we need additional knowledge, for instance boundedness of the lower level set n x: (x) (x0 ) o

70 M. C. Ferris and. Ralph where x 0 is the starting iterate. To prove that the PG method has NSR we need to establish that the rate of descent obtained along the path x() = (x r(x)) is uniform for feasible x in a neighborhood of a given nonstationary x 2. The lemma below states a uniform descent property for all small perturbations about a given function ; the reader may consider = for simplicity. In the case where many functions are present we use the notation enition 3.3 x () def = (x r(x)): 1. Let x 2 IR n and > 0. If :! IR is C 1, the modulus of continuity of r at x 2 is the function of and > 0 (and x, )!(; ) def = sup fkr(y) r(x)k : x; y 2 ; kx xk ; kx yk g : 2. Let x 2 IR n, > 0, and : IR n! IR be C 1. Given > 0, let U() = U(; ; x; ) be the set of all C 1 functions :! IR n such that sup n j(x) (x)j + r(x) r (x) : x 2 (x + IB) \ o < ; (10) and!(; ) (1 + )!( ; ); 8 2 (0; ): (11) Lemma 3.4 Let : IRn! IR be C 1, > 0 and x 2 be nonstationary for min n o (x): x 2. There exist positive constants and such that for each 2 U() = U(; ; x; ), x 2 (x + IB) \, and 0, hr(x); x () xi minf; g: (12) Proof Let :! IR be C 1, x 2 and > 0. According to [2, (2.4)], hr(x); x () xi kx () xk 2 =; 8x 2 ; > 0: Moreover, [2, Lemma 2.2] says that, as a function of > 0, kx () xk = is antitone (nonincreasing); in particular for any > 0, kx () xk = kx () xk =; 8 2 (0; ): Using this with the previous inequality, we deduce for any > 0 that hr(x); x () xi (kx () xk =) 2 ; 8x 2 ; > 0 (kx () xk =) 2 ; 8x 2 ; 0 < < : (13)

Projected Gradient Methods for NCP 71 Fix > 0. By hypothesis, the point x is such that x () x > 0. Also by (10), if x! x and!, where convergence of means that 2 U() and # 0, then x () converges to x (). Hence there are > 0, > 0 such that for 2 U(), x 2 x + IB, kx () xk p : Together with (13), this yields hr(x); x () xi ; 8x 2 (x + IB) \ ; 2 [0; ]: Let def = minf; g, then (12) holds for 0. Using the well known antitone property of hr(x); x () xi in 0, see [2, (2.6)], we see that (12) also holds for >. The following result gives some technical properties of the PG method that will be important for our main algorithm. We use it later to prove that the PG method has NSR, though in this case NSR follows from the simpler case in which is a xed function. Proposition 3.5 Let : IR n! IR be C 1, > 0 and x 2 be nonstationary for min n o (x): x 2. Then there is a positive constant such that for each x 2 (x + IB) \ and 2 U() = U(; ; x; ): 1. For each 2 [0; ], (x ()) (x) + 1 hr(x); x () xi : 2. One step of PG on (6) from x generates x () with. Proof Suppose, x and are as stated. Let 1 > 0, 2 2 (0; 1) and 0 < 1 2 < 1 be the constants of the PG method. Let 1 ; be given by Lemma 3.4, and 2 U( 1 ); and assume without loss of generality that 1 2 (0; ], i.e. (10) and (11) hold with = = 1. We estimate the error term "(; x; y) def = (y) (x) hr(x); y xi ; where y; x 2. By choice of 2 U( 1 ), specically (11) with = = 1, for each 2 (0; 1 ), x 2 (x + 1 IB) \ and y 2 (x + IB) \, kr(x) r(y)k (1 + 1 )!( ; ):

72 M. C. Ferris and. Ralph Thus, for x 2 (x + 1 IB) \ and y 2 (x + 1 IB) \, j"(; y; x)j = j Z 1 hr(x + t(y x)) r(x); y xi dtj 0 (1 + 1 )!( ; kx yk) ky xk : (14) 2 By continuity of r on the compact set (x + 1 IB) \, there is a nite upper bound on r (x) for x 2 (x + 1 IB) \. ene = + 1 ; by choice of 2 U( 1 ), specically (10) with = = 1, kr(x)k for x 2 (x + 1 IB) \. It follows for such x and any 0, that kx () xk ; (15) because is Lipschitz of modulus 1. Furthermore, since r is uniformly continuous on compact sets,!( ; ) # 0 as # 0. Thus, using the fact that!( ; ) is nondecreasing, there exists 2 2 (0; 1 ) such that for x 2 (x + 1 IB) \ and 2 (0; 2 ), both 1 and (1 + 1 )!( ; ) 2 (1 2): From these inequalities and the inequalities (14) and (15), we see that for x 2 (x + 1 IB) \ and 2 (0; 2 ), Now for such x and, "(; x (); x) (1 2 ): (16) (x ()) = (x) + [ 2 + (1 2 )] hr(x); x () xi + "(; x (); x) (x) + 2 hr(x); x () xi (1 2 ) + (1 2 ) where the second inequality relies on the uniform descent property of Lemma 3.4 and (16). Thus (x ()) (x) + 2 hr(x); x () xi ; 8x 2 (x + 1 IB) \ ; 2 (0; 2 ); and this inequality with 1 replacing 2 also holds. Finally, for any x k 2 (x+ 1 IB)\, the auxiliary scalar k satisfying (9) is bounded below by 2 ; hence the step size k is bounded below by def = minf 1 ; 2 2 g. Since 0 < 2 < 1, then < 2 < 1, parts 1 and 2 of the proposition hold. Theorem 3.6 The PG method applied to (6) has NSR.

Projected Gradient Methods for NCP 73 Proof Let x 2 IR n be nonstationary, so according to Proposition 3.5 and Lemma 3.4, if x k 2 (x + IB) \ then k and (x k+1 ) (x k ) + 1 r(x k ); x k ( k ) x ke (x k ) 2; (17) where = 1 =2 > 0. Now by continuity of there is 2 (0; ) such that j(x) (x)j ; 8x 2 (x + IB) \ : Take V def = (x + IB) \, and use the above inequality with (17) to see that for any x k 2 V, (x k+1 ) (x) : The NSR property of enition 3.1 follows. 4 Projected Gradient Algorithms for NCP Our main goal here is to present a method for minimizing that has a low computational cost, and has NSR. Before proceeding we will make a few comments on guaranteeing convergence, at least on a subsequence. Existence of a (stationary) limit point of a sequence produced by a method with NSR follows from boundedness of the lower level set n x 2 IR n : kf + (x)k f+ (x 0 ) o ; where x 0 is the initial point. This boundedness property holds in many cases, for instance if f is a uniform P-function, see Harker and Xiao [12]; hence if f is strongly monotone. However the uniform P-function property implies that f+(x; 0 ) is invertible for each x, a condition that we believe is too strong in general (c.f. Lemma 2.9). A weaker condition yielding boundedness of the above level set is that f + is proper, namely that the inverse image f 1 + (S) of any compact set S IR n is compact. 4.1 A simple globally convergent algorithm Given statement 4 of Proposition 2.3, it is tempting to use the following steepest descent idea in algorithms for minimizing. Given the kth iterate x k 2 IR n, an orthant O k containing x k and the complement ~ O k of O k at x k, let d k solve min 0 (x k ; d) subject to kdk 1; d 2 O k [ ~O k : d This essentially requires two n-dimensional convex quadratic programs to be solved (a polyhedral norm on d may be used), one for each orthant. If d = 0 is a solution,

74 M. C. Ferris and. Ralph then x k is stationary for. Otherwise 0 (x k ; d k ) < 0, and we can perform a line search to establish k > 0 such that for x k+1 = x k + k d k, (x k + k d k ) is strictly less than (x k ). However if is nonsmooth there seems to be little global convergence theory for algorithms based on this idea. For instance, it is not known if the step length k can be chosen to be uniformly large in a neighborhood of a nonstationary point, while still retaining a certain rate of descent; hence it is hard to show that the sequence produced will not accumulate at a nonstationary point. Pang, Han and Rangaraj [23, Corollary 1] give an additional smoothness assumption at a limit point that is required to prove stationarity. Alternatively given the stationarity characterization of Proposition 2.3.5, we can design a naive steepest descent algorithm for minimizing, each iteration of which is based on a projected gradient step over an orthant O k containing the current iterate x k, and an additional m projected gradient steps on 1-dimensional problems corresponding to moving in directions normal to the m facets of O k that contain x k (so m is the number of zero components of x k ). It is signicant that to obtain global convergence, we only need to increase the number of 1-dimensional subproblems at each iteration from m to n, i.e. normals to all facets of O k must be examined. The algorithm below introduces notation not strictly required for its statement; this notation is presented in preparation for the main algorithm, Algorithm 2, which appears in the next subsection. By O we mean the restriction j O of to O. Algorithm 1. Let x 0 2 IR n. Given k 2 f0; 1; 2; : : : ; g and x k 2 IR n ; dene x k+1 as follows. Choose any orthant O k containing x k, let y 0 () def = O k[x k r O k(x k )], and 0 be the step size determined by one step of the projected gradient algorithm applied to min n (x): x 2 O ko from x k. Suppose F 1 ; : : : ; F n are the facets of O k. For j = 1; : : : ; n, let y j def = Fj (x k def ), N j = N O k(f j ), y j () def = y j + Nj [ r(y j )], and j be the step size determined by one step of the projected gradient algorithm applied to starting from y j. Let min n (x): x 2 y j + N j o ; (18) x k+1 def = y^ (^ ); where ^ 2 argmin n (y j ( j )): j = 0; 1; 2; : : : ; n o : If (x k+1 ) = (x k ) then STOP; x k is a Gauss{Newton point of f +. Remark. In Algorithm 1 the projected gradient method is used as a subroutine. Therefore we assume that if the starting point of a subproblem is stationary, then the projected gradient method merely returns this point; the decision of whether or not the main algorithm should continue is made elsewhere.

Projected Gradient Methods for NCP 75 Theorem 4.1 Algorithm 1 is well dened and has NSR. Proof Since the projected gradient method is well dened, for each k and x k the algorithm produces x k+1. If (x k+1 ) = (x k ) then none of the subproblems of the form (18) produced a point with a lower function value than (x k ). So x k is stationary for each subproblem for which F j is a facet of O k containing x k, and by Proposition 2.3, x k is also a Gauss-Newton point of f +. Thus Algorithm 1 is well dened. We show that the algorithm has NSR. Suppose x is not a Gauss-Newton point of f +. For x k suciently close to x, x 2 O k. So consider the case when O k = O for some xed orthant O containing x. By Proposition 2.3 x is nonstationary either for min n (x): x 2 O o or for minf(x): x 2 x + N O(F )g, where F is some facet of O containing x. In the former case, for some = ( O) > 0 and each x k 2 x + IB, we have from Theorem 3.6 with = O and = O, that the candidate y 0 ( 0 ) for the next iterate x k+1 yields (y 0 ( 0 )) < (x). Hence our choice of x k+1 also yields (x k+1 ) < (x). In the latter case, we can apply Proposition 3.5 by reformulating the subproblem (18) as minf(y j + d): d 2 N O(F )g, i.e. dene (d) = (y j +d), (d) = (x+d), = N O(F ), and as any positive constant, and let 1 > 0 be the constant given by Proposition 3.5. Given the simple form of, it is easy to check that there is = ( O) > 0 such that if x k 2 x + IB, then 2 U( 1 ; ; x; ): For such x k, Proposition 3.5 says that the candidate iterate y j ( j ) yields (y j ( j )) < (x), hence (x k+1 ) < (x). Since there are only nitely many orthants, we conclude that for some > 0 independent of O k, and each x k 2 x + IB, we have (x k+1 ) < (x). This algorithm is extremely robust: under the single assumption that f is C 1 on IR n +, the method is well dened and accumulation points are always Gauss{Newton points. It is also reasonably simple, using the projected gradient method as the work horse. A serious drawback of Algorithm 1 is that we need at least n + 1 function and Jacobian evaluations per iteration, in order to carry out the projected gradient method on the n + 1 subproblems. By contrast, the use of 1-dimensional subproblems means the linear algebra performed by Algorithm 1 is only around twice as expensive as the linear algebra needed to perform one projected gradient step on an orthant. 4.2 An ecient globally convergent algorithm We present a globally convergent method for nding Gauss{Newton points of f + based on the PG method. It is ecient in the sense that per iteration, the number of function evaluations is comparable to that needed for the PG method applied to minimizing a smooth function over an orthant, and the linear algebra computation involves about double the work required for linear algebra in the PG method.

76 M. C. Ferris and. Ralph At each iteration, we approximate by linearizing f about x k +. Let where A k (x) def = 1 L k + 2 (x) 2 L k +(x) def = f(x k +) + rf(x k +)(x + x k +) + x x + : (19) The \linearization" L k + is a local point-based approximation [34] when rf is locally Lipschitz, and more generally a uniform rst-order approximation near x k [28]; such approximations are more powerful than directional derivatives in that they approximate f + uniformly well for all x near x k. In [5, 4, 28, 34] these approximation properties have been exploited to give strong convergence results for Newton methods applied to nonsmooth equations like f + (x) = 0. Our main algorithm, below, and its extremely robust convergence behavior also rely on these approximation properties. Lemma 4.2 Let x 2 IR n and > 0. There is a non-decreasing function " : IR +! IR + such that "() = o() as # 0, and for each x k ; x 2 x + IB, (x) A k (x) "( x x k ): Proof We have j(x) A k (x)j = (1=2) f + (x) + L k +(x); f + (x) L k +(x) E c f + (x) L k +(x) ; where c 2 (0; 1) is the maximum value of (1=2) f+ (x) + L k +(x) for x k ; x 2 x + IB. Let! be the modulus of continuity of rf on IR n + \ (x + + IB) (see enition 3.3). Similar to (14) in the proof of Proposition 3.5, f + (x) L k + (x)!( x x k ) x x k =2; where!()! 0 as # 0. Take "() def = c!()=2. We will search several paths during an iteration but, unlike Algorithm 1, our criteria for choosing the path parameter will use derivatives of the approximation A k rather than of. Let 0 2 (0; 1 ) and 2 (0; 1). Suppose we are also given an orthant O containing a point y (but not necessarily x k ), and a path y : [0; 1)! IR n with y(0) = y. Given > 0, y() is a candidate for x k+1 if (y()) A k (y) + 0 ra k O (y); y() y E : (20) Here A k O is the restriction A k j O. If fails the above test, we can try =. Note that if y = x k then ra k O(y) = r O (x k ), and the obvious choice for y() is O (x k r O (x k )). In this case (20) is equivalent to (7) with = O and = O. The rst part of Algorithm 2 is a single step of Algorithm 1 applied to A k instead of. The second part determines the path and the corresponding step length that will dene the next iterate x k+1.

Projected Gradient Methods for NCP 77 Algorithm 2. Let x 0 2 IR n and (in addition to the constants used for the PG method), 0 2 (0; 1 ); 2 (0; 1). Given k 2 f0; 1; 2; : : : ; g and x k 2 IR n ; dene x k+1 as follows. Part I. Choose any orthant O k containing x k, let y 0 def = x k, y 0 () def = O k[x k r O k(x k )]; and 0 be the step size determined by one step of the projected gradient algorithm applied to min n A k (x): x 2 O ko from y 0. Suppose F 1 ; : : : ; F n are the facets of O k. For j = 1; : : : ; n, let y j def = Fj (x k def ), N j = N O k(f j ), def O j = F j + N j, y j () def = y j + Nj [ ra k O j (y j )]; and j be the step size determined by one step of the projected gradient algorithm applied to from y j. min n A k (x): x 2 y j + N j o Part II. Path search: Let M def = f0; : : : ; ng, ^ def def = 0 and 0 = 0 =. REPEAT def Let ^ = ^. If ^ y^ x k then M def = M n f^ g. If M = ; STOP; x k is a Gauss{Newton point of f +. Else let ^ 2 argmin n A k (y j ) + 0 ra k Oj (y j ); y j ( j ) y je : j 2 M o : UNTIL (20) holds for y() = y^ (^ ), y = y^ and O = O^. k+1 def Let x = y^ (^ ). Remark. For the algorithm to work properly, we assume that part I returns j = 0 if y j is already stationary for the corresponding subproblem. Theorem 4.3 Algorithm 2 is well dened and has NSR. Proof First we show that each step of the algorithm is well dened. Consider one step of the algorithm given k 2 f0; 1; 2; : : :g and x k 2 IR n. Part I is well dened because the projected gradient method is well dened. For part II we see that each iteration of the REPEAT loop is well dened; we claim that the loop terminates after nitely many iterations. Certainly if j 2 f0; : : : ; ng and y j 6= x k, then after a nite number of loop iterations in which ^ = j and j def = j, we have j yj x k ; (21)

78 M. C. Ferris and. Ralph hence in any subsequent loop iterations j 62 M and ^ 6= j. Instead suppose j is such that y j = x k. Either y j is stationary for the jth subproblem hence j = 0 and, by construction of M, ^ equals j for at most one loop iteration; or using Proposition 3.5.1, def initially j > 0 and after nitely many loop iterations in which ^ = j and j = j, (20) holds, terminating the loop. It is only left to check that x k is a Gauss-Newton point of f + if M = ;. In this case, j yj x k for each j, in particular j = 0 if y j = x k, i.e. for j = 0 and each j in M 0 = n j: 1 j n; x k 2 F j o. This is only possible if xk is stationary for each subproblem min n A k (x): x 2 O ko and min n A k (x): x 2 x k + N Oj (F j ) o where j 2 M 0. Since for each orthant O containing x k we have ra k O (xk ) = r O (x k ), it follows that x k is also stationary for min n (x): x 2 O ko and min n (x): x 2 x k + N Oj (F j ) o where j 2 M 0. Proposition 2.3 says x k is indeed a Gauss-Newton point of f +. We now prove the NSR property. Suppose that x is nonstationary for. As in the proof of Theorem 4.1 we assume O k = O for some xed orthant O containing x. Observe from Proposition 2.3, that either x is nonstationary for min n (x): x 2 O o or x is nonstationary for minf(x): x 2 x + N O(F )g, for some facet F of O containing x. Below we assume the latter, and deduce for x k near x that (x k+1 ) < (x). Let O be an orthant containing x and F be a facet of O containing x. Assume x is nonstationary for minf(x): x 2 x + N O(F )g. Assume further that x k is some iterate with O k = O, so if F 1 ; : : : ; F n are the facets of O k, then F = F ~ for some index ~. To simplify notation we omit the superscript or subscript ~ where possible. Let N = N O(F ), O = F + N, y = F (x k ), y() = N (y ra k O(y)), and A(x) def = (1=2) kf(x + ) + rf(x + )(x + x + ) + x x + k 2 : Observe, since r O(x) = r A O(x), that x is nonstationary for min n A(x): x 2 y + N o. Rewriting the ~ th subproblem, min n AO (x): x 2 y + N o, as min n AO (y + d): d 2 N o ; dening (d) = A k N (y + d), (d) = AN (y + d), = N and choosing > 0, enables us to apply Lemma 3.4 and Proposition 3.5. Then there exist 1 > 0; > 0 such that if x k x 1 and 2 U( 1 ) = U( 1 ; ; x; ) (see enition 3.3), then A k (y()) A k (y) 1 ra k O (y); y() y E ; (22) ra k O (x k ); y() y E minf; 1 g; (23) and the initial step size ~ chosen in Part I of the algorithm is bounded below by 1. Now A N and A k N are quadratic functions dened on the half-line N, hence, by continuity of rf, it follows easily that there exists 2 2 (0; 1 ] such that 2 U( 1 ) if x k x 2. Thus (22) and (23) hold for such x k and 2 [0; 2 ].

Projected Gradient Methods for NCP 79 Let xk x 2 and 0 2. We have (y()) (x k ) = [A k (y()) A k (y)] + [A k (y) (x k )] + [(y()) A k (y())] 1 ra k O (y); y() y E + [A k (y) (x k )] + [(y()) A k (y())]; (24) using (22). Let L be an upper bound on ra k O(y) for x k 2 x + 2 IB, and observe y() x k ky() yk + y x k L + y x k : Also y = F (x k ) is bounded on x+ 2 IB, therefore Lemma 4.2 provides a non-decreasing error bound "(t) = o(t) such that for each x k 2 x + 2 IB, 2 [0; 2 ], (y()) A k (y()) "(L + y x k ): (25) ^: Now Let ^ = ( 1 0 )=2 and choose 2 (0; 2 ) such that "(2L) choose 3 2 (0; 2 ) such that if xk x 3, then both y x k minf; Lg and ja k (y) (x k )j ^. Let x k 2 x + 3 IB. For 2 (0; ], (24) and (25) yield (y()) (x k ) 1 ra k O (y); y() y E + ^ + "(L + L) 1 ra k O (y); y() y E + ( 1 0 ) 1 ra k O (y); y() y E + ( 1 0 ); 8 2 [ ; ]: From (23), if 1 then ra k O(y); y() ye ; therefore (y()) (x k ) 0 ra k O (y); y() y E ; 8 2 [ ; ]: (26) From above, the initial step size ~ and the point y ~ = y are such that ~, and > y x k. We claim it follows from (26) that, during the REPEAT loop of part II, ~ 2 M and ~. To see this suppose that ~ decreases in some loop iteration after the rst loop iteration. Then at the end of the previous loop iteration, (^ = ~ and) the condition (20) fails for y() = y ~ ( ~ ), y = y ~ and O = O ~ ; so it follows from (26) that ~ >. Thus the new value ~ of ~ is bounded below by, hence also ~ > y x k and ~ is not deleted from M. Therefore after the REPEAT loop terminates, ~ 2 M and ~ ; and the selection of x k+1, whether or not using y ~ () = y(), satises (x k+1 ) min n A k (y j ) + 0 ra k Oj (y j ); y j ( j ) y je : j 2 M o A k (y) + 0 ra k O (y); y( ~ ) y E A k (y) 0 minf ~ ; 1 g (from (23)) A k (y) ;

80 M. C. Ferris and. Ralph where def = 0 is a positive constant independent of x k. As noted above, A k (y)! (x) as x k! x, so (x k+1 ) < (x) for x k suciently close to x. A similar argument can be made for the case when O k = O and x is nonstationary for min n (x): x 2 O o. In this case ~ = 0, y = x k and y() = O(x k ra k O (xk )). We do not give details, but only note that this process is somewhat simpler than that above because the inequality corresponding to (24) only has two summands on the right: (y()) (x k ) 1 ra k (x k ); y() x ke + [(y()) A k (y())]: Since there are only nitely many choices of O, the NSR property of Algorithm 2 is established. 4.3 A hybrid algorithm with quadratic local convergence Both of the algorithms given above have at best a linear rate of convergence because the projected gradient method is only a rst-order method. However, if an algorithm for nding a Gauss-Newton point of f + has NSR (such as Algorithms 1 and 2), then this lends itself to hybrid methods that alternate between steps of the original algorithm and Newton-like steps and therefore admit the possibility of quadratic local convergence. For such a hybrid algorithm, let K be the set of indices k for which the original algorithm determines x k+1. If K has innitely many elements and monotonicity of the algorithm is maintained, accumulation points of the subsequence fx k g k2k are Gauss{Newton points of f +. If such a limit point x is in fact a point of attraction of a Newton method, and a Newton step is taken every `th iteration, then convergence will be `-step superlinear, or `-step quadratic if rf is Lipschitz. See [2] for details on a related hybrid algorithm in the context of quadratic programming. We briey sketch three popular Newton methods for solving the nonsmooth equation f + (x) = 0; which often produce Q-quadratically convergent sequences of iterates. To make comparisons easy, we use the general notion of a Newton path [28] which, given the iterate x k, is some function p k : [0; 1]! IR n with p k (0) = x k ; the next iterate x k+1 is dened as p k () for some 2 [0; 1] (details are given below). We say a Newton iterate or Newton step is taken if x k+1 = p k (1). We may not take a Newton step, however, if it does not yield \sucient progress". A simple damping strategy is used to ensure sucient progress: recall the constants 0 ; 2 (0; 1), and dene as the largest member of f1; ; 2 ; : : :g such that f + (p k ()) (1 0 ) f+ (x k ) : (27) k+1 def Then x = p k (); this is the damped Newton iterate. Newton path 1. Given k and x k, let O k be an orthant containing x k, M k = rf + j O k, d k = (M k ) 1 f + (x k ); p k () = x k + d k :