2 JOSE HERSKOVITS The rst stage to get an optimal design is to dene the Optimization Model. That is, to select appropriate design variables, an object

Size: px

Start display at page:

Download "2 JOSE HERSKOVITS The rst stage to get an optimal design is to dene the Optimization Model. That is, to select appropriate design variables, an object"

Sara O’Connor’
6 years ago
Views:

1 A VIEW ON NONLINEAR OPTIMIZATION JOSE HERSKOVITS Mechanical Engineering Program COPPE / Federal University of Rio de Janeiro 1 Caixa Postal 68503, Rio de Janeiro, BRAZIL 1. Introduction Once the concepts of a material object are established, the act of designing consists on choosing the values for the quantities that prescribe the object, or dimensioning. These quantities are called Design Variables. A particular value assumed by the design variables denes a conguration. The design must meet Constraints given by physical or others limitations. We have a feasible conguration if all the constraints are satised. A better design is obtained if an appropriate cost or objective function can be reduced. The objective is a function of the design variables and quanties the quality of the material object to be designed. The design is optimum when the cost is the lowest among all the feasible designs. If we call [x 1 ;x 2 ; :::; x n ] the design variables, f(x) the objective function and the feasible set, the optimization problem can be denoted minimize f(x); subject to x 2 : (1.1) This problem is said to be a Mathematical Program and, the discipline that studies the numerical techniques to solve Problem (1.1), Mathematical Programming. Even if mathematical programs arise naturally in optimization problems related to a wide set of disciplines that employ mathematical models, several physical phenomena can be modeled by means of mathematical programs. This is the case when the "equilibrium" is attained at the minimum of an energy function. 1 Partially written at INRIA, Institut National de Recherche en Informatique et en Automatique, Rocquencourt, France.

2 2 JOSE HERSKOVITS The rst stage to get an optimal design is to dene the Optimization Model. That is, to select appropriate design variables, an objective function and the feasible set. In engineering design, we generally have fx 2 R n =g i (x) 0;i=1; 2; :::; m; h i (x) =0;i=1; 2; :::; pg; where x [x 1 ;x 2 ; :::; x n ] t and g i and h i are the Inequality and Equality Constraints respectively. Since it involves nonlinear functions, Problem (1.1) is a Nonlinear Program. The optimization problem is said to be Unconstrained when R n. Mathematical Programming provides a general and exible formulation for engineering design problems. Once the optimization model was established, nonlinear programming algorithms only require the computation of f, g i, h i and their derivatives at each iteration. Nowadays, strong and ecient mathematical programming techniques for several kind of problems, based on solid theoretical results and extensive numerical studies, are available. Approximated functions, derivatives and optimal solutions can be employed together with optimization algorithms to reduce the computer time. The aim of this paper is not to describe the state of the art of nonlinear programming, but to explain in a simple way most of the modern techniques applied in this discipline. With this objective, we include the corresponding algorithms in the framework of a general approach, based on Newton - like iterations for nonlinear systems. These iterations are used to nd points verifying rst order optimality conditions. Some favorable characteristics of optimization problem, conveniently explored, lead to strong algorithms with global convergence. There exists a very large bibliography about nonlinear programming. We only mention the books by Bazaraa and Shetty [7], Dennis and Schnabel [10], Fletcher [14], Gill et al. [19], Hiriart - Urruty and Lemarechal [32], Luenberger [35], [36] and Minoux [39]. Several books written by engineering and/or structural designers include numerical optimization techniques in the framework of optimal design, such as Arora [1], Haftka et al. [20], Haug and Arora [22], Kirsch [34], Fox [15] and Vanderplaats [32]. We discuss some basic concepts of mathematical programming in the next section and, in the following one, optimality conditions are studied. A view of Newton - like algorithms for nonlinear systems is given in section 4. Unconstrained and equality constrained optimization techniques are studied in sections 5 and 6. Sequential Quadratic Programming method is discussed in section 7, where in addition to a classical algorithm based on this method, a feasible directions algorithm is presented. A Newton - like

3 A VIEW ON NONLINEAR OPTIMIZATION 3 approach for interior points algorithms, proposed by the author [30], [31], is presented in the last section. 2. Some basic concepts We deal with the nonlinear optimization problem minimize f(x) subject to g i (x) 0; i =1; 2; :::; m and h i (x) =0;i =1; 2; :::; p; 9 = ; (2.1) where f;g and h are smooth functions in R n and at least one of these functions is nonlinear. Any inequality constraint is said to be Active if g i (x) =0 and Inactive if g i (x) < 0. Denoting g(x) = [g 1 (x);g 2 (x); :::; g m (x)] t and h(x) =[h 1 (x);h 2 (x); :::; h p (x)] t,wehave minimize f(x) subject to g(x) 0 and h(x) =0: 9 = ; (2.2) We introduce now the auxiliary variables 2 R m and 2 R p, called Dual Variables or Lagrange Multipliers and dene the Lagrangian Function associated with Problem (2.1) as l(x; ; ) f(x)+ t g(x)+ t h(x): Following, some denitions concerning Problem (2.1) and the methods to solve this problem are presented. First, the meaning of the statement of the optimization problem is discussed. DEFINITION 2.1. A point x 2 isalocal Minimum (or Relative Minimum)off over if there exists a neighborhood fx 2 = k x,x k g such that f(x) f(x ) for any x 2. If f(x) >f(x ) for any x 2, then x is a Strict Local Minimum. DEFINITION 2.2. A point x 2 isaglobal Minimum (or Absolute Minimum) off over if f(x) f(x ) for any x 2. Note that a global minimum is also a local minimum.the nature of optimization implies the search of the global minimum. Unfortunately, the global minimums can be characterized only in some particular cases, as in Convex Programming. Usually Nonlinear Programming methods are iterative. Given an initial point x 0, a sequence of points, fx k g, is obtained by repeated applications of an algorithmic rule. This sequence must converge to a solution x of the

4 4 JOSE HERSKOVITS problem. The convergence is said to be asymptotic when the solution is not achieved after a nite number of iterations. Except in some particular cases, like linear or quadratic programming, such is the case in nonlinear optimization. DEFINITION 2.3. An iterative algorithm is said to be Globally Convergent if for any initial point x 0 2 R n (or x 0 2 ) it generates a sequence of points converging to a solution of the problem. DEFINITION 2.4. An iterative algorithm is Locally Convergent if there exists a positive such that for any initial point x 0 2 R n (or x 0 2 ) verifying k x 0, x k, it generates a sequence of points converging to a solution of the problem. Modern Mathematical Programming techniques seek for globally convergent methods. Locally convergent algorithms are not useful in practice since the neighborhood of convergence is not known in advance. The major objective of numerical techniques is to have strong methods with global convergence. Once this is obtained, engineers are worried with eciency. In Design Optimization, evaluation of functions and derivatives generally takes more computer time than the internal computations of the algorithms itself. Then, the number of iterations gives a good idea of the computer time required to solve the problem. The following denition introduces a criterion to evaluate the speed of convergence of asymptotically convergent iterative methods. DEFINITION 2.5. The Order of Convergence of a sequence fx k g!x is the largest number p of the nonnegative numbers p satisfying lim k!1 k xk+1, x k k x k, x k p = <1: When p =1we have Linear Convergence with Convergence Ratio < 1. If = 0 the convergence is said to be Superlinear. The convergence is Quadratic in the case when p =2. Since they involve the limit when k!1, p and are a measure of the asymptotic speed of convergence. Unfortunately, a sequence with a good order of convergence may be very "slow" far from the solution. The convergence is faster when p is larger and is smaller. Near the solution, if the convergence is linear the error is multiplied by at each iteration, while the reduction is squared for quadratic convergence. The methods that will be studied here have rates varying between linear and quadratic.

5 A VIEW ON NONLINEAR OPTIMIZATION 5 Figure 1. (a) Descent directions (left); (b) Feasible directions (right) In general, globally convergent algorithms dene at each point a search direction and look for a new point on that direction. DEFINITION 2.6. A vector d 2 R n is a Descent Direction of a real function f at x 2 R n if there exists a >0 such that f(x + td) <f(x) for any t 2 (0;). If f is dierentiable at x and d t rf(x) < 0, it is easy to prove [7] that d is a descent direction of f. In Figure 1.a, constant value contours of f(x) are represented. We note that f(x) decreases in any direction that makes an angle greater than 90 degrees with rf(x). The set of all descent directions constitutes the half space D. DEFINITION 2.7. Avector d 2 R n is a Feasible Direction of the problem (2.1), at x 2, if for some >0wehave x + td 2 for all t 2 [0;]. In Figure 1.b the feasible region of an inequality constrained problem is represented. The vector d is a feasible direction, since it supports a non zero segment [x; x + d]. Any direction is feasible at an interior point. At the boundary, the feasible directions constitutes a cone F that we call Cone of Feasible Directions. This cone is not necessarily closed. 3. Optimality conditions A rst requirement to solve optimization problems is to characterize the solutions by conditions that are easy to verify. These conditions will be useful to identify a minimum point and, frequently, will be at the heart of numerical techniques to solve the problem.

6 6 JOSE HERSKOVITS Figure 2. Illustration of the rst order optimality condition Optimality conditions are based on dierential calculus, and they are said to be of rst order if they involve only rst derivatives. They are of second order if second derivatives are also required. In what follows we describe a series of optimality conditions for unconstrained and constrained problems based on Luenberger [36], where they are proved. Since dierential calculus gives only a local information about the problem and we do not include convexity assumptions, only relative minima can be characterized. All these conditions can be proved by considering that any feasible curve x(t) passing through a local minimum x of problem (2.1) has a local minimum of f[x(t)] at x. The results are obtained by applying optimality conditions of one dimensional problems to f[x(t)]. The following theorem gives a geometric interpretation of optimality conditions of a large class of problems. THEOREM 3.1. First and second order necessary conditions. If x 2 is a local minimum of f over then, for any feasible direction d 2 R n,itis i) d t rf(x ) 0 ii) if d t rf(x ) = 0, then d t r 2 f(x )d 0 2 The rst result means that every improving direction of f at x is not a feasible direction, that is F \ D 0. This is the case in Figure 2, where x is a minimum while x is not. In fact, walking along d from x, we can get a new feasible point with lower f. In what follows, optimality conditions for unconstrained, equality constrained and inequality constrained optimization are discussed.

7 A VIEW ON NONLINEAR OPTIMIZATION UNCONSTRAINED OPTIMIZATION Since any d 2 R n is a feasible direction, the well known optimality conditions for unconstrained optimization are easily derived from Theorem 3.1. COROLLARY 3.2. First and second order necessary conditions. If x is a local minimum of f over R n, then i) rf(x )=0: ii) for all d 2 R n, d t r 2 f(x )d 0. That is, r 2 f(x ) is positive semidenite. 2 A sucient local optimality condition for this problem is stated below. It requires the calculus of two derivatives of f. THEOREM 3.3. Sucient optimality conditions. Let f be a twice continuously dierentiable scalar function in R n and x such that i) rf(x )=0 ii) r 2 f(x ) is positive denite. Then, x is a strict local minimum point of f EQUALITY CONSTRAINED OPTIMIZATION We consider now the equality constrained problem minimize f(x) subject to h(x) = 0 (3.1) and introduce some denitions concerning the constraints of this problem. DEFINITION 3.4. Let x be a point in and consider all the continuously dierentiable curves in that pass through x. The collection of all the vectors tangent to these curves at x is said to be the Tangent Set to at x. DEFINITION 3.5. A point x 2 is a Regular Point of the constraints if the vectors rh i (x), for i =1; 2; :::; p, are linearly independent.

8 8 JOSE HERSKOVITS Regularity will be a requirement for most of the theoretical results and numerical methods in constrained optimization. At a regular point x, it is proved that i) The Tangent Set to at x constitutes a subspace, called Tangent Space. ii) The tangent space at x is T fy=rh t (x)y =0g: For a given constraint geometry, even in the case when the tangent set constitutes a subspace, a point may or may not be regular depending on how h is, dened [36]. The following optimality conditions are stated, LEMMA 3.6. Let x, a regular point ofthe constraints h(x) =0,be a local minimum of Problem (3.1). Then rf(x ) is orthogonal to the tangent space. 2 Since rf(x ) is orthogonal to all the constraints, it can be expressed as a linear combination of their gradients. Then, it gives the following result. THEOREM 3.7. First Order Necessary Conditions. Let x, a regular point of the constraints h(x) = 0, be a local minimum of Problem (3.1). Then, there is a vector 2 R p such that rf(x )+rh(x ) =0 h(x )=0 2 This theorem is valid even in the trivial case when p = n. Introducing now the Lagrangian, we see that a feasible point satisfying the rst order optimality conditions, can be obtained by solving the nonlinear system in (x; ) r x l(x; ) =0 r l(x; ) =0 (3.2) Then, the Lagrangian plays a role similar to that of the objective function in unconstrained optimization. This is also true when we consider the following second order conditions. THEOREM 3.8. Second Order Necessary Conditions. Let x, a regular point of the constraints h(x) = 0, be a local minimum of Problem (3.1).

9 A VIEW ON NONLINEAR OPTIMIZATION 9 Then there is a vector 2 R p such that the result of Theorem 3.7 is true and the matrix H(x ; ) = r 2 f(x ) + px i=1 i r2 h i (x ) is positive semidenite on the tangent space, that is, y t H(x ; )y 0 for all y 2 T. 2 The matrix H(x ; ) plays a very important role in constrained optimization. When the constraints are linear, we have H(x ; ) = r 2 f(x ), and it follows from the previous theorem that r 2 f(x ) is positive semidenite on the space dened by the constraints. This is a natural result in view of Theorem 3.3. When there are nonlinear constraints, H(x ; ) takes into account their curvature. THEOREM 3.9. Second Order Suciency Conditions. Let the point x satisfy h(x ) = 0. Let 2 R p be a vector such that rf(x )+rh(x ) =0 and H(x ; ) be positive denite on the tangent space. Then x is a Strict Local Minimum of Problem INEQUALITY CONSTRAINED OPTIMIZATION Let us consider the inequality constrained problem minimize f(x) subject to g(x) 0: (3.3) We call I(x) fi=g i (x) = 0g the Set of Active Constraints at x and say that x is a Regular Point if the vectors rg i (x) for i 2 I(x) are linearly independent. The Number of Active Constraints at x is Card[I(x)]. It is easy to prove that, if d t rg i (x) < 0 for i 2 I(x), then d is a feasible direction of the constraints at x. Suppose now that x, a regular point, is a local minimum of Problem (3.3). It is clear that x is also a local minimum of f(x) subject to g i (x) =0, for i 2 I(x ). Then, it follows from Theorem 3.7 that there is a vector 2 R m such that rf(x )+rg(x ) =0; (3.4) where i = 0 for i 62 I(x ).

10 10 JOSE HERSKOVITS The condition i = 0 for i 62 I(x ) is called Complementarity Condition and can be represented by means of the following equalities i g i(x )=0; for i =1; 2; :::; m: If we dene G(x) diag[g(x)], a diagonal matrix such that G ii (x) = g i (x), the complementarity condition is expressed as G(x ) =0: An additional necessary condition, 0; is obtained as a consequence of the rst result of Theorem 3.1. In eect, let us assume that the condition is not true. Then, for some l 2 I(x ), it is l < 0. As x is a regular point, given i < 0, i 2 I(x ), we can nd a feasible direction d verifying d t rg i (x )= i. It follows from (3.4) that d t rf(x )=, X i2i(x ), i6=l i i, l l : Taking now i, for i 2 I(x ) and i 6= l, small enough, we can get a feasible direction d such that d t rf(x ) < 0. Then, d is a descent direction of f, but this conclusion is in contradiction with Theorem 3.1. These results constitute Karush - Kuhn - Tucker optimality conditions. THEOREM First Order Necessary Conditions. Let x, a regular point of the constraints g(x) 0, be a Local Minimum of Problem (3.1). Then, there is a vector 2 R m such that rf(x )+rg(x ) =0 G(x ) =0 0 g(x ) 0: In Figure 3 wehave the convex cone F, dened by all the positive linear combinations of the gradients of the active constraints [7]. The previous theorem implies that, if x is a local minimum, then,rf(x ) 2 F GENERAL CONSTRAINED OPTIMIZATION The optimality conditions discussed above are easily generalized to optimization problem (2.1), with equality and inequality constraints. 2

11 A VIEW ON NONLINEAR OPTIMIZATION 11 Figure 3. Illustration of Karush-Kuhn-Tucker conditions DEFINITION A point x 2 isaregular Point of the constraints if the vectors rh i (x), for i =1; 2; :::; p, and rg i (x) for i 2 I(x) are linearly independent. The tangent space at x is now T fy=rg t i(x)y for i 2 I(x) and rh t (x)y =0g: THEOREM Karush - Kuhn - Tucker First Order Necessary Conditions. Let x, a regular point of the constraints g(x) 0 and h(x) =0,be a Local Minimum of Problem (2.1). Then, there is a vector 2 R m and a vector 2 R p such that rf(x ) + rg(x ) + rh(x ) = 0 (3.5) G(x ) = 0 (3.6) h(x ) = 0 (3.7) g(x ) 0 (3.8) 0: (3.9) THEOREM Second Order Necessary Conditions. Let x, a regular point of the constraints g(x) 0 and h(x) = 0, be a local minimum of Problem (2.1). Then there is a vector 2 R m andavector 2 R p such that the result of Theorem 3.12 is true and the matrix H(x ; ; ) = r 2 f(x ) + mx i=1 i r 2 g i (x ) + px i=1 i r 2 h i (x ) 2

12 12 JOSE HERSKOVITS Figure 4. Iterations of successive approximations method is positive semidenite on the tangent space, that is, y t H(x ; ; )y 0 for all y 2 T. 2 THEOREM Second Order Suciency Conditions. Let the point x satisfy g(x ) 0 and h(x ) = 0. Let there be a vector 2 R m, 0 and a vector 2 R p such that rf(x )+rg(x ) + rh(x ) =0 and H(x ; ; ) be positive denite on the tangent space. Then x is a Strict Local Minimum of Problem Newton-like Algorithms for Nonlinear Systems In this section we discuss about iterative methods for solving where : R n! R n is continuously dierentiable. Let us write (4.1) as follows: (y) =0; (4.1) y = y, (y): (4.2) To nd a solution of (4.1) by Successive Approximations, we give an initial trial and make repeatedly substitutions in the right side of (4.2) obtaining the sequence y k+1 = y k, (y k ): (4.3) Under appropriate conditions, this sequence converges to a solution of the system.

13 A VIEW ON NONLINEAR OPTIMIZATION 13 ALGORITHM 4.1 Successive Approximations Method Data. Initial y 2 R n. Step 1. Computation of the step d. Step 2. Update. Set d =,(y): (4.4) y := y + d: Step 3. Go back tostep 1. 2 The above algorithm is said to be a Fixed Point Algorithm because, if a solution is attained, then d = 0 and the rest of the sequence stays unchanged. We dene now the function (y) y, (y). The theorem that follows gives conditions for global convergence and results about the speed of convergence of Algorithm 4.1. These results are proved in [35]. THEOREM 4.1. Let be kr (y) k <1onR n, then there is a unique solution y of (4.1) and the sequence generated by Successive Approximations Method converges to y for any initial y 0 2 R n. The order of convergence is linear and the rate is equal to. 2 The assumptions of the previous theorem restrict the application of successive approximations method to a particular class of problems. In Figure 4 the process of solving a one-dimensional equation is illustrated. This is equivalent to nd the point ofintersection of z = (y) with z = y. In Figure 4.a the slope of (y) is less than the unity and the iterates converge, while in 4.b the process diverges. Since the order of convergence is linear, Algorithm 4.1 is also called Linear Iterations Algorithm. Let y k be an estimate of y. In a neighborhood of y k,wehave (y) (y k )+r(y k ) t (y, y k ): Then, a better estimate y k+1 can obtained by making (y k )+r(y k ) t (y k+1, y k )=0; (4.5)

14 14 JOSE HERSKOVITS Figure 5. Newton's iterations which denes Newton's Method iteration. ALGORITHM 4.2 Newton's Method Data. Initial y 2 R n Step 1. Computation of the step d. Solve the linear system for d [r(y)] t d =,(y): (4.6) Step 2. Update. Set y := y + d: Step 3. Go back tostep 1. 2 The process of Newton's method is illustrated in Figure 5. At a given estimate of y the function is approximated by its tangent. A new estimate is then taken at the point where the tangent crosses the y axis. The theorem below gives the conditions for local convergence and studies the speed of convergence. It can be proved in a similar way as in [9], [35] and [36]. THEOREM 4.2. Let (y) betwice continuously dierentiable and y a solution of (y) = 0. Assume that r(y ),1 exists. Then if started close enough from y, Algorithm 4.2 doesn't fail and it generates a sequence that

15 A VIEW ON NONLINEAR OPTIMIZATION 15 converges to y. The convergence is at least quadratic. 2 The algorithm doesn't fail if (4.6) has a unique solution. This method's major advantage comes from its speed of convergence. However it requires the evaluation of the Jacobian r and the solution of a linear system at each iteration, which can be very expensive in terms of computer eort. Moreover, global convergence is not assured. The analytic Jacobian can be replaced by a nite - dierence approximation, but this is also costly since n additional evaluations of the function per iteration are required. With the objective of reducing computational eort, quasi - Newton method generates an approximation of the Jacobian or of its inverse. The basic idea of most quasi - Newton techniques is to try to construct an approximation of the Jacobian, or of its inverse, using information gathered as the iterates progress. Let be B k the current approximation of r(y k ), a new approximation B k+1 is obtained from Since B k is dened in such away that B k+1 = B k +B k : (4.7) (y k+1 ), (y k ) r(y k ) t (y k+1, y k ); (y k+1 ), (y k )=[B k+1 ] t (y k+1, y k ): (4.8) Substitution of (4.7) in (4.8) gives n conditions to be satised by B k. Since B k has n 2 elements, these conditions are not enough to determine it. Several updating rules for B k+1 were proposed [10], Broyden's Rule being the most successful, B k+1 = B k +(, B k ) t = t ; (4.9) where = y k+1, y k and = (y k+1 ), (y k ). ALGORITHM 4.3 Quasi - Newton Method Data. Initial y 2 R n and B 2 R nn Step 1. Computation of the step d. Solve the linear system for d B t d =,(y): (4.10)

16 16 JOSE HERSKOVITS Step 2. Update. and (i) Set y := y + d B := B +B: Step 3. Go back tostep 1. 2 In Step 2, B can be updated using (4.9) or other rules. The following theorem is proved in [10]. THEOREM 4.3. Let (y) betwice continuously dierentiable and y a solution of (y) = 0. Assume that r(y ),1 exists. Then, if started close enough to y and the initial B is close enough from r(y ), Algorithm 4.3 doesn't fail and it generates a sequence that converges to y. The convergence is superlinear. 2 Although quasi - Newton methods have the advantage of avoiding the computation of r(y), the initial B must be a good approximation of r(y) to have local convergence. Looking to Algorithms 4.1 to 4.3 we note that they have a similar structure. All of them dene a step by the expression Sd =,(y); where S I in the linear iterations, S is an approximation of r(y) in quasi - Newton methods or S r(y) in Newton's method. The rate of convergence goes from linear to quadratic. We call this kind of iterations Newton - like algorithms. 5. Unconstrained Optimization Let us consider the unconstrained optimization problem minimize f(x); x 2 R n : (5.1) According to Corollary 3.3, a local minimum x is a solution of the system of equations rf(x) =0: (5.2) In this section we show that the best known techniques for unconstrained optimization can be obtained by applying the Newton - like algorithms studied above to solve (5.2). Some favorable characteristics of

17 A VIEW ON NONLINEAR OPTIMIZATION 17 optimization problems, conveniently explored, lead to globally convergent algorithms. This system is generally nonlinear in x, being linear only when f is quadratic. The Jacobian r 2 f(x) is symmetric. As a consequence of second order optimality conditions r 2 f(x) is positive denite, or at least positive semidenite, at a local minimum. Following the techniques discussed in the previous section, (5.2) can be solved using Newton - like iterations. The change of x, called d, is given now by the expression Sd =,rf(x); (5.3) where S can be taken equal to the identity,tor 2 f(x) or to a quasi - Newton approximation of r 2 f(x). In general this procedure is not globally convergent. To get global convergence each iterate must be nearer the solution than the previous one. In unconstrained optimization this happens if the function is reduced at each new point. If S is positive denite, d is a descent direction of f at x. In eect, it follows from (5.3) that d t rf(x) =,d t Sd. Then, d t rf(x) < 0. This result means that d points towards the lower values of the function, but it does not imply that f(x + d) <f(x). Since d is a descent direction, we can nd a new point x + td in a way to have a satisfactory reduction of f. In this case, d is called the Search Direction and the positive number t, the Step Length. The procedure to nd t is called the Line Search. Dierent Line Search Criteria can be adopted to decide whether the step length is adequate or not. As the search direction is zero only at a solution of (5.2), the line search cannot allow step lengths that are null or that go to zero. Otherwise premature convergence to points that are not a solution would be obtained. The algorithm that follows is based on these ideas. It is globally convergent to points satisfying rst order optimality conditions. ALGORITHM 5.1 A Basic Algorithm for Unconstrained Optimization Data. Initialize x 2 R n and S 2 R nn symmetric and positive denite. Step 1. Computation of the search direction d. Solve the linear system for d Sd =,rf(x): (5.4)

18 18 JOSE HERSKOVITS Step 2. Line search. Find a step length t satisfying a given line search criterium. Step 3. Update. Set x := x + td and dene a new S 2 R nn symmetric and positive denite. Step 4. Go back tostep 1. 2 Particular versions of this algorithm can be obtained by choosing S and a line search procedure. The best alternative depends on the problem to be solved, the available information about f and the desired speed of convergence. Even if far from a local minimum r 2 f(x) is not necessarily positive denite, to get descent search directions, S must be taken positive denite. Moreover, we make the following assumption about S. ASSUMPTION 5.1. There exists positive numbers 1 and 2 such that for any d 2 R n ABOUT THE LINE SEARCH 1 k d k 2 d t Sd 2 k d k 2 At a given point x, once the search direction d is dened, f(x +td) becomes a function of the single variable t. The rst idea is to walk on d until the minimum on that direction is reached; that is, to nd t that minimizes f(x + td). This procedure is called Line Search by Exact Minimization, even though in practice the exact minimum can be rarely obtained. Exact minimization is done in an iterative way and is very costly. Modern algorithms include Inaccurate Line Search Criteria that also ensure global convergence [10], [32], [36]. The line search can be completed in a few number of iterations. A very simple procedure is due to Armijo. Armijo's Line Search Dene the step length t as the rst number of the sequence f1;; 2 ; 3 ; :::g satisfying f(x + td) f(x)+t 1 rf t (x)d; (5.5)

19 A VIEW ON NONLINEAR OPTIMIZATION 19 Figure 6. (a) Armijo's search (left);(b) Wolfe's criterion (right) where 1 2 (0; 1) and 2 (0; 1). 2 Condition (5.5) on t ensures that the function is reduced at least 1 times the reduction of a linear function, tangent to f at t = 0. In Figure 6.a we have p 1 (t) =f(x)+t 1 rf t (x)d. The acceptable step is in [t l ;t u ]. It is easy to deduce that t l = inf(1;t u ). Wolfe's inaccurate line search criterion also establishes bounds on the step length by requiring a reduction of the function and, at the same time, a reduction of its directional derivative. Wolfe's Criterion Accept a step length t if (5.5) is true and rf t (x + td)d 2 rf t (x)d (5.6) where 1 2 (0; 1=2) and 2 2 ( 1 ; 1). 2 This criterion is illustrated in Figure 6.b, where the slope is 2 rf t (x)d at t l. Condition (5.5) denes an upper bound on the step length and (5.6) a lower bound. Then, a step t is too long if (5.5) is false and too short if (5.6) is false. A step length satisfying Wolfe's criterion can be obtained iteratively [32]. Given an initial t, if it is too short, extrapolations are done until a good or a too long step is obtained. If a too long step was already obtained, interpolations based on the longest short step and the shortest long step are done, until the criterion is satised. Since the function and the directional derivative are evaluated for each new t, cubic interpolations of f can be done. As the criterion of acceptance is quite wide, the process generally requires very few iterations.

20 20 JOSE HERSKOVITS Figure 7. Steepest Descent iterates The number of iterations required by Armijo's line search is usually greater than using Wolfe's criterion, but it is simple to code and it does not require calculation of rf. Goldstein's Criterion, described in [36], also does not require rf.very ecient line search algorithms can be obtained by combining polynomial interpolations with Goldstein's criterion CONSIDERATIONS ABOUT GLOBAL CONVERGENCE Global convergence of Algorithm 5.1 can be proved if Assumption 5.1 is true and any of the previously discussed line search criteria is adopted. We are not going to present the proof in this paper. In the case of Armijo's search this result is a particular case of Theorem 4.7 in [40] FIRST ORDER ALGORITHMS Taking S I in Algorithm 5.1, we have d =,rf(x). That is, the search direction is opposite to the gradient. It is easy to prove that the downhill slope of f is maximum in the direction of d. One of the most widely known methods for unconstrained optimization is the Steepest Descent Algorithm, that includes an exact minimization in the line search. In Figure 7 we illustrate the iterative process to solve a two dimensional problem described by some level lines f(x) = constant. Since the search directions are normal to the level line at the current point and rf t (x + td)rf(x) = 0 in the line search, each direction is orthogonal to the previous one. Modern rst order algorithms include inaccurate line search procedures instead of the exact minimization. Although the number of iterations is not smaller, there is generally reduction in the overall computer time. Let r be the Condition Number of r 2 f(x ), dened as the ratio of the largest to the smallest eigenvalue. If the steepest descent algorithm

21 A VIEW ON NONLINEAR OPTIMIZATION 21 generates a sequence fx k g converging to x, then the sequence of objective values f(x k ) converges linearly to f(x ) with a convergence ratio no greater than = r,1 2. r+1 This result is proved in refs. [19] and [36]. It implies that the convergence is slower as the conditioning of r 2 f(x ) becomes worse NEWTON'S METHOD Newton's algorithm is obtained by taking S r 2 f(x). To have descent search directions, S must be positive denite. This is not necessarily true at any point even though, according to Theorem 3.3, r 2 f(x) is positive denite at a strict local minimum. It follows from Theorem 4.2 that, if f(x) is three times continuously dierentiable at x and there exists K>0 such that t k = 1 for k>k, then the convergence of Algorithm 5.1 with S r 2 f(x) is at least quadratic. When Armijo's line search orwolfe's criterion is adopted, since r 2 f(x )is positive denite, it is easy to prove that a unit step length can be obtained near the solution. This is a requirement of quasi - Newton and Newton algorithms to have superlinear and quadratic convergence. To make S positive denite, Newton method is modied by taking S r 2 f(x) +I, where > 0 is large enough to satisfy Assumption 5.1 and! 0 [10], [19], [36]. Since the search direction is now a combination of steepest descent and Newton's directions, the speed of convergence is not as good as Newton's method. This approach is known as Levenberg - Marquardt method. The major diculty is to determine, that is not too large, in a way to perturb Newton's iteration as little as possible QUASI-NEWTON METHOD We dene S as B, a quasi Newton approximation matrix for r 2 f(x) [9]. Since the Hessian is symmetric, it is reasonable to generate B symmetric. Then, given an initial symmetric B, we need symmetric B. Rank One Updating Rule adds to B a rank one matrix of the form B = zz t ; where is a number and z avector in R n. By imposing (4.8) the following rule is obtained [36], B k+1 = B k + (, Bk )(, B k ) t t (, B k ; (5.7) ) where now itis = x k+1, x k and = rf(x k+1 ),rf(x k ). To obtain descent directions in Algorithm 5.1, B is required to be positive denite. If B k is positive denite, we need t (, B k ) > 0 to have

22 22 JOSE HERSKOVITS B k+1 positive denite. Unfortunatelly this is not always true. Several Rank Two Updating Rules overcome this problem. This is the case of Broyden - Fletcher - Shanno - Goldfarb (BFGS) formulae B k+1 = B k + t t, Bk t B k t B k : (5.8) It can be proved that, if B k is positive denite, then t >0 (5.9) is a sucient condition to have B k+1 positive denite. It can be easily shown that this condition is automatically satised if the line search in Algorithm 5.1 is an exact minimization or if Wolfe's criterion is adopted. As a consequence of Theorem 4.3, the convergence is superlinear. A unit step length near the solution is also required. 6. Equality Constrained Optimization The techniques for unconstrained optimization studied above can be extended to the problem minimize f(x) subject to h(x) = 0: (6.1) The pair (x ; ), satisfying rst order optimality conditions, is obtained by solving with Newton - like iterations the nonlinear system rf(x)+rh(x) =0 (6.2) h(x) =0; (6.3) where the unknowns x and are called Primal and Dual Variables respectively. Tohave unique in (6.2), x must be a regular point of the problem. Due to this fact, the algorithms that solve rst order optimality conditions, require the assumption that all the iterates x k are regular points of the problem. Denoting y (x; ) and we have (y) rf(x)+rh(x) h(x) H(x; ) rh r(y) t (x) rh(x) 0 ; (6.4) ; (6.5)

23 where H(x; ) = r 2 f(x) + A VIEW ON NONLINEAR OPTIMIZATION 23 px i=1 i r 2 h i (x) is the Hessian of the Lagrangian dened in Theorem 3.8. A Newton's iteration that starts at (x k ; k ) and gives a new estimate (x k+1 ; k+1 ) is then stated as follows: " H(x k ; k ) rh(x k ) rh t (x k ) 0 # " # " x k+1, x k rf(x k+1, k =, k )+rh(x k ) k h(x k ) (6.6) As the system (6.4) is linear in, when x is known Newton's method gives in one iteration. In fact, taking x k = x in (6.6), we have that (x k+1 ; k+1 ) = (x ; ) for any k. We conclude that k+1! when fx k g! x. This remark suggests that a line search concerning only the primal variables x is enough to get global convergence in (x; ). In a similar way as in Section 4, r can be substituted by a quasi - Newton approximation or by the identity matrix. Since anyway, in the present problem we need rh to evaluate, it seems more ecient to substitute only H(x k ; k )inr by it's quasi Newton approximation or by the identity. Calling S k this matrix and d k the change in x, wehave the linear system of equations S k d k + rh(x k ) k+1 =,rf(x k ) (6.7) # : rh t (x k )d k =,h(x k ) (6.8) that gives d k, a search direction in x, and k+1, a new estimate of. Ifx k is a regular point and S k is positive denite, then it can be proved that the solution of (6.7), (6.8) is unique. In unconstrained optimization we know that a minimum is approached as the objective function decreases. This is not always true in constrained minimization since an increase of the function can be necessary to obtain feasibility. Then, an appropriate objective for the line search is needed. With this purpose we dene the auxiliary function (x; r) = f(x)+ px i=1 r i kh i (x)k; (6.9) where r i are positive constants. It can be shown that (x; r) is an Exact Penalty Function of the equality constrained problem if r i are large enough [36]. In other words, there exists a nite r such that, for r i r i, the unconstrained minimum of (x; r) occurs at the solution of the problem (6.1). When compared with others penalty functions, is numerically very

24 24 JOSE HERSKOVITS advantageous since it does not require penalty parameters going to innite [35]. However, is not dierentiable at points on the constraints, requiring nonsmooth optimization techniques [32]. Denoting by SG(h) a diagonal matrix such that SG ii (h) sg(h i ), where sg(:) = (:)=j(:)j and sg(0) = 0, we can write (x; r) =f(x)+r t SG[h(x)]h(x): (6.10) Suppose now that x k is a regular point, S is positive denite and r i k k+1 i k; i =1; 2; :::; p: (6.11) Then, d k given by (6.7) and (6.8) is a descent direction of (x; r) at x k. In eect, it exists > 0 such that sg[h i (x k + td k )] doesn't change for any t 2 (0;]. Calling (x k ;td k ) (x k + td k ;r), (x k ;r), we have (x k ;td k )=td kt rf(x k )+td kt rh(x k )SG[h(x k + td k )]r + o(t); for any t 2 (0;], where o(t)! 0 faster than t. As a consequence of (6,7) and (6,8), d kt rf(x k )=,d kt S k d k + h t (x k ) k+1 : (6.12) Then, considering (6.8) and (6.12), we get (x k ;td k )=,td kt S k d k + th t (x k )f k+1, SG[h(x k + td k )]rg + o(t) and there exists 2 (0;] such that (x k + td k ;r) < (x k ;r) for any t 2 [0;), what proves the assertion above. The following globally convergent algorithm takes (x; r) as the objective of a line search along d k. ALGORITHM 6.1 A Basic Algorithm for Equality Constrained Optimization Data. Initial x 2 R n and S 2 R nn symmetric and positive denite and r 2 R p, r>0. Step 1. Computation of the search direction d and an estimate of the Lagrange multipliers. Solve the linear system in (d; ) Sd + rh(x) =,rf(x) (6.13) rh t (x)d =,h(x) (6.14)

25 Step 2. Line search. A VIEW ON NONLINEAR OPTIMIZATION 25 i) If r i k i k, then set r i > k i k, for i =1; 2; :::; p: ii) Find a step length t satisfying a given line search criterion on the auxiliary function Step 3. Update. Set (x; r) = f(x)+r t SG[h(x)]h(x): x := x + td and dene a new S 2 R nn symmetric and positive denite. Step 4. Go back tostep 1. 2 Assumption 5.1 is also adopted in this algorithm. The same line search procedures as in Algorithm 5.1 for unconstrained optimization can be employed. However some precautions must be taken here since is nonsmooth [32] FIRST ORDER METHOD First order algorithms are obtained by taking S I in Algorithm 6.1. In the case when the constraints are linear, a natural extension of gradient methods to equality constrained optimization consists on taking an initial point on the constraints and a search direction obtained by projecting,rf(x k ) on the constraints. This direction is known as the Projected Gradient Direction [36], [43] and denoted here by d. A new point on the constraints is then obtained. We have that,rf(x) can be written as the sum of its projection on the tangent space and a vector orthogonal to all the constraints. Then, if the constraints are regular, there exists 2 R p such that,rf(x) =d + rh(x) (6.15) and, since d is orthogonal to the set frh 1 ; rh 2 ; :::; rh p g,wehave rh t (x)d =0: (6.16) It can be concluded that taking S I in Algorithm 6.1, the search direction at points where h(x) = 0 becomes the Projected Gradient.

26 26 JOSE HERSKOVITS The results about the speed of convergence of gradient algorithms for unconstrained optimization are easily extended to the projected gradient method. The convergence is linear and the convergence ratio no greater than = r,1 r+1 2, where now r is the ratio of the largest to the smallest eigenvalue of r 2 f(x ) as measured on the constraints. This is not surprising since the iterates "walk" on the constraints. The Projected Gradient Method was extended by Rosen [20], [44] to problems with nonlinear constraints.given a point x k on the constraints, a better point x k on the projected gradient is obtained. Since the projected gradient no longer follows the constraints surface, x k is generally infeasible. The new iterate x k+1 is the projection of x k on the constraints. This projection is done in an iterative way, which is very costly. When the constraints are nonlinear, is the ratio of the largest to the smallest eigenvalue of the Hessian of the Lagrangian H(x ; ) as measured on the tangent space that determines the speed of convergence [36]. The eect on the rate of convergence due to the curvature of the constraints is included in H(x ; ). Algorithm 6.1, when applied to nonlinearly constrained problems, is much more ecient than the projected gradient method. Since the iterates are not required to be feasible, the projection stage is avoided NEWTON'S METHOD Taking S H(x k ; k ) we have a Newton's equality constrained optimization algorithm. Even though in the majority of applications the computation of the second derivatives of f and g is very expensive, in several problems they are available or easy to obtain. The Hessian of the Lagrangian function, H(x; ), is not necessarily positive denite. Theorem 3.9 only ensures that H(x ; ) is positive denite on the tangent space. To have positive denite S, a procedure similar to Levenberg - Marquardt method for unconstrained optimization can be employed QUASI-NEWTON METHOD In this class of algorithms, S is dened as a quasi - Newton approximation matrix of the Hessian of the Lagrangian H(x ; ), that we call B. In principle, B can be obtained using the same updating rules as in unconstrained optimization, but taking r x l(x; ) instead of rf(x). As H(x; ) is not necessarily positive denite at (x ; ), it is not always possible to get B positive denite. To overcome this diculty, Powell [42]

27 A VIEW ON NONLINEAR OPTIMIZATION 27 proposed a modication of BFGS updating rule that takes and If then it is computed and taken = x k+1, x k = r x l(x k+1 ; k ),r x l(x k ; k ): t <0:2 t B; 0:8t B = t B, t = +(1, )B: Finally, BFGS updating rule (5.8) is employed ABOUT THE SPEED OF CONVERGENCE Since the techniques studied in this section are based on iterative algorithms for nonlinear systems, (x k ; k ) converges to (x ; ) at the same speed as the original algorithm. Then, the convergence of (x k ; k ) is quadratic, superlinear or linear, depending if a Newton's, a quasi - Newton or a rst order algorithm for equality constrained optimization is employed. In practice, we are more interested on the speed of convergence of x k than (x k ; k ). Several authors studied this point, in particular Powell [41], Gabay [16], Hoyer [33], Gilbert [18]. Basically the same results are obtained, but in two steps, i. e., Powell proved that quasi - Newton algorithms are two-steps superlinearly convergent. That is lim k!1 k xk+2, x k k x k, x k =0: As in unconstrained optimization, when applying Newton's or quasi - Newton algorithms, the step length must be the unity near the solution. There are examples showing that, taking t = 1 in Step 3) of Algorithm 6.1, it is not always possible to get a reduction of near the solution. This is known as Maratos' eect [37]. Several researches have been looking for methods to avoid Maratos' eect [18], [24], [27]. 7. Sequential Quadratic Programming Method To extend the ideas presented above in a way to solve the general nonlinear programming problem (2.1), one major diculty has to be overcome. While

28 28 JOSE HERSKOVITS in unconstrained or in equality constrained optimization, a point satisfying rst order necessary optimality condition can be obtained by solving a system of equations, problems including inequality constraints require the solution of the system of equations and inequations (3.5) - (3.9). That is, a solution of the system of equation (3.5) - (3.7), that satises the inequalities (3.8) and (3.9) has to be found. Sequential Quadratic Programming (SQP), that is at the moment the largest employed method for nonlinear constrained optimization, is a quasi - Newton technique based on an idea proposed by Wilson [50] in 1963 and interpreted by Beale [8] in A Quadratic Program is a class of constrained optimization problems such that the objective is a convex quadratic function and the constraints are linear. Ecient techniques to solve this problem are available, even when inequality constraints are included. The exact solution is obtained after a nite number of iterations [36]. To explain Wilson's idea, we take x constant and consider the following quadratic programming problem that has d 2 R n as unknown, minimize 1 2 dt Sd + rf t (x)d subject to rh t (x)d + h(x) =0: (7.1) Since (7.1) is a convex problem, the global minimum satises Karush - Kuhn - Tucker optimality conditions Sd + rf(x)+ rh(x) = 0 (7.2) rh t (x)d + h(x) =0; (7.3) where are the Lagrange multipliers. Then, (d; ) in Algorithm 6.1 can be obtained by solving the quadratic program (7.1) instead of the linear system (6.13), (6.14). Based on this fact, to solve the problem minimize f(x) subject to g(x) 0 and h(x) =0; 9 = ; (7.4) Wilson proposed to dene the search direction d and new estimates of the Lagrange multipliers and by solving at each iteration minimize 1 2 dt Sd + rf t (x)d subject to rg t (x)d + g(x) 0 and rh t (x)d + h(x) =0: 9 = ; (7.5) Wilson's is a Newton algorithm. Garcia Palomares and Mangasarian proposed later a quasi - Newton technique [17], Han obtained a globally convergent algorithm [21] and Powell proved superlinear convergence [41].

29 A VIEW ON NONLINEAR OPTIMIZATION 29 The exact penalty function (x; s; r) = f(x)+ mx i=1 s i fsup[0;g i (x)]g + px i=1 r i kh i (x)k (7.6) is taken as the objective of the line search. If r satises (6.11) and s i i ; for i =1; 2; :::; m; (7.7) then d that solves (7.5) is a descent direction of (x; s; r). This result is proved in [36]. In Sequential Quadratic Programming algorithms, the matrix S is de- ned as a quasi - Newton approximation of the Hessian of the Lagrangian. Most of the optimizers employ BFGS rule modied by Powell, as explained in the last section. SQP algorithm can be stated as follows, ALGORITHM 7.1 Sequential Quadratic Programming Parameters. r 2 R p and s 2 R m positive. Data. Initialize x 2 R n and B 2 R nn symmetric and positive denite. Step 1. Computation of the search direction d and an estimate of the Lagrange multipliers and. Solve the quadratic program for d, Step 2. Line search. minimize 1 2 dt Sd + rf t (x)d subject to rg t (x)d + g(x) 0 and rh t (x)d + h(x) =0 9 = ; (7.8) i) If r i k i k, then set r i > k i k, for i =1; 2; :::; p: ii) If s i i, then set r i > i, for i =1; 2; :::; m: iii) Find a step length t satisfying a given line search criterion on the auxiliary function (x; s; r) = f(x)+ mx i=1 s i fsup[0;g i (x)]g + px i=1 r i kh i (x)k

30 30 JOSE HERSKOVITS Step 3. Updates. Let be = td and = r x l(x + td; ; ),r x l(x; ; ). i) If t <0:2 t B; then compute = 0:8t B t B, t and set = +(1, )B. ii) Set and B := B + t t, Bt B t B x := x + td Step 4. Go back tostep 1. 2 This algorithm generates sequences that are globally convergent to Karush - Kuhn - Tucker points of the problem. However it fails at points where the quadratic program has no a solution. In eect, since the constraints of the quadratic program solved in Step 1 are linear approximations of the constraints of the original problem, the feasible region may be empty. The asymptotic speed of convergence has similar properties as quasi - Newton algorithms for equality constrained optimization and Maratos' eect can also occur A FEASIBLE DIRECTIONS ALGORITHM Feasible directions algorithms are an important class of methods for solving constrained optimization problems. At each iteration, the search direction is a feasible direction of the inequality constraints and, at the same time, a descent direction of the objective or an other appropriate function. A constrained line search is then performed to obtain a satisfactory reduction of the function without loosing the feasibility. The fact of giving feasible points makes feasible directions algorithms very ecient in engineering design, where functions evaluation is in general very expensive. Since any intermediate design can be employed, the iterations can be stopped when the cost reduction per iteration becomes small enough. There are also several examples that deal with an objective function, or constraints, that are not dened at infeasible points. This is the case of size

31 A VIEW ON NONLINEAR OPTIMIZATION 31 and shape constraints in structural optimization. When applying feasible directions algorithms to real time problems, as feasibility is maintained and cost reduced, the controls can be activated at each iteration. In what follows we describe an SQP feasible directions algorithm based on a technique presented by Herskovits in [26] and by Herskovits and Carvalho in [27]. By solving the quadratic program (7.5), this algorithm denes rst (d 0 ; 0 ), where d 0 is a descent direction in the primal space of l(x; 0 ). However, d 0 is not necessarily a feasible direction since, if an inequality constraint of problem (7.4) is active and the corresponding constraint in (7.5) is also active, then d 0 is tangent to the feasible set. In a second stage, the algorithm obtains a feasible and descent search direction d by solving a modied quadratic program with equality constraints only. To present the method, we consider the inequality constrained optimization problem minimize f(x) subject to g(x) 0; (7.9) whose feasible set is fx 2 R n =g(x) 0g, and introduce the following denition: DEFINITION 7.1. Avector eld d(x) dened on is said to be a uniformly feasible directions eld of the problem (7.9), if there exists a step length > 0 such that x + td(x) 2 for all t 2 [0;]. This condition is much stronger than the simple feasibility of d(x) for any x 2. When d(x) constitutes a uniformly feasible directions eld, it supports a feasible segment [x; x + (x)d(x)], such that (x) is bounded below inby > 0. As a consequence of the feasibility requirement, the search directions of feasible directions algorithms must constitute a uniformly feasible directions eld. Otherwise, the step length may go to zero, forcing convergence to points which are not KKT points. We state now the algorithm ALGORITHM 7.2 SQP Feasible Directions Algorithm Parameters. 2 (0; 1) and '>0. Data. Initialize x 2 R n feasible and B 2 R nn symmetric and positive denite.

Nonlinear Optimization: What s important?

Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global