Static unconstrained optimization

Static unconstrained optimization 2 In unconstrained optimization an objective function is minimized without any additional restriction on the decision variables, i.e. min f(x) x X ad (2.) with X ad R n the set of admissible decisions. 2. Optimality conditions In the following, conditions are derived to determine (local) minimizers x of f(x) in X ad so that f(x) f(x ) (2.2) for x X ad or x U ɛ X ad with U ɛ a sufficiently small ɛ neighborhood of x (cf. Definition.2). For this, recall the mean value theorem (Theorem.3) and assume that f(x) C (X ad ) and that the line segment [x, x + δx] X ad for sufficiently small δx. Then taking into account (.24) we have f(x + δx) = f(x ) + (δx) T ( f)(x + ( r)δx) for some r [, ]. In case of a minimizer x the inequality (2.2) with x = x + δx has to be fulfilled for all δx sufficiently small. This yields f(x ) + (δx) T ( f)(x + ( r)δx) f(x ) or equivalently (δx) T ( f)(x + ( r)δx). It will be shown by contradiction that the latter inequality implies ( f)(x ) =. Hence, let first ( f)(x ) and define δx = ( f)(x ). Then (δx) T ( f)(x ) = ( f)(x ) 2 <. Since f is by assumption continuous in a neighborhood of x, there exists a scalar τ such that δx T ( f)(x + tδx) < (2.3) for all t [, τ]. Hence, for any t (, τ] the mean value theorem implies f(x + tδx) = f(x ) + tδx T ( f)(x + ( r) tδx) (2.4) for some r [, ]. Noting that r [, ] and hence t := ( r) t [, τ], substitution of (2.3) into (2.4) yields f(x + tδx) < f(x ) 23

for all t (, τ]. With this, a new direction is obtained away from x along which f(x) decreases. Thus x is not a local minimizer and a contradiction is obtained, which implies the following result. Theorem 2.: First order necessary optimality condition Let X ad R n be the set of admissible decisions and assume f(x) C (X ad ). If x X ad is a local minimizer, then ( f)(x ) =. (2.5) Example 2.. Consider the minimization problem min f(x) = sin(x x 2 ) tan(x ) (2.6) x X ad with X ad = [, ] [, ]. Evaluation of (2.5) provides [ ] ( f)(x x2 cos(x ) = x 2 ) cos 2 (x ) =, x cos(x x 2 ) which results in x = [, ] T. In order to ensure that x is a local minimizer it is required to show that = f(x ) f(x) for all x in a neighborhood of x. Therefore let x = x + [ɛ, ] T for ɛ and evaluate f(x) = f(x + [ɛ, ] T ) = sin(ɛ) tan(ɛ). Thus, f(x) < f(x ) for x (, ɛ] and f(x) > f(x ) for x [ ɛ, ) given x 2 =, which yields that x satisfying (2.5) is not a local minimizer. This example illustrates that the conditions of Theorem 2. are only necessary but not sufficient. In addition, (2.5) only implies that x is an extremum, subsequently often called stationary point, and is fulfilled for a minimum but also for a maximum and a saddle point (cf. Figure 2.). By taking into account higher order terms, assuming that f(x) is at least twice continuously f f f x 2 (a) Minimum. x x 2 (b) Maximum. x x 2 x (c) Saddle point. Figure 2.: Examples of minimum, maximum and saddle point. differentiable in X ad, the higher order mean value theorem (.25) can be considered to improve Theorem 2.. 24 Chapter 2 Static unconstrained optimization

Theorem 2.2: Second order necessary optimality conditions Let X ad R n be the set of admissible decisions and assume f(x) C 2 (X ad ). If x X ad is a local minimizer, then ( f)(x ) = and ( 2 f)(x ). (2.7) The second condition in (2.7) refers to the Hessian of f(x) being positive definite at the point x. The proof of Theorem 2.2 is left as an exercise to the reader. Example 2.2. Consider the minimization problem min f(x) = x 2 4x + 3x 2 2 6x 2 (2.8) x X ad with X ad = {x R 2 : x, x 2 }. Evaluation of (2.7) provides ( f)(x ) = [ ] 2x 4 6x 2 6 =, which is satisfied for x = [2, ] T and ( 2 f)(x ) = [ ] 2. 6 Since the Hessian matrix is positive definite (eigenvalues at λ = 2 and λ 2 = 6), the objective function f(x) does fulfill the necessary optimality conditions of Theorem 2.2. Example 2.3. Consider the minimization problem min x X ad f(x) = x 2 + x x 2 2x 2 2 (2.9) with X ad = R 2. Evaluation of (2.7) results in [ ] ( f)(x 2x ) = + x 2 x =, 4x 2 which yields x = and hence [ ] ( 2 f)(x 2 ) =. 4 The Hessian matrix is indefinite (one positive and negative real eigenvalue) so that neither the necessary optimality conditions for a minimum nor maximum are fulfilled. It is left to the reader to show that x = is a saddle point as depicted in Figure 2.(c). The following result provides sufficient conditions that guarantee that a point x interior to X ad is a strict local minimizer [9]. 2. Optimality conditions 25

Theorem 2.3: Second order sufficient optimality condition Let X ad R n be the set of admissible decisions and let f(x) C 2 (X ad ). If x X ad and the conditions ( f)(x ) = and ( 2 f)(x ) >. (2.) are fulfilled, then x is a strict local minimizer of f(x). If the objective function f(x) in (2.) is a convex function, local and global minimizers can be easily characterized as is summarized below [9]. Theorem 2.4 If f(x) is a convex function on the convex set X ad, then any local minimizer x is a global minimizer and the set of minima G = arg min{f(x) : x X ad } is convex. If in addition f(x) C (X ad ), then any stationary point x X ad is a global minimizer. The proof of Theorem 2.4 is left to the reader and can be, e.g., found in [2]. Example 2.4. Consider a minimization problem involving a quadratic form, i.e. min x X ad f(x) = 2 xt P x + q T x + r for X ad = R n with P = P T R n n, q R n and r R. From Example.6 the gradient and Hessian matrix follow as ( f)(x) = P x + q, ( 2 f)(x) = P. We note that f(x) is strictly convex if P is positive definite. There is a unique stationary point x = P q, which, due to the differentiability of f(x), is a global minimizer by Theorem 2.4. 2.2 Numerical minimization algorithms The necessary optimality conditions require the determination of stationary points x as solutions to an in general nonlinear system of n coupled equations given by ( f)(x ) =. As a result, an analytical solution can be expected only in special cases so that numerical techniques are needed to accurately approximate stationary points x. For this, various algorithms are available, which in principle are based on the computation of a sequence of values (x k ) k N starting at an initial point x such that f(x) is decreased in each iteration step, i.e. f(x k+ ) < f(x k ), k =,,... (2.) with the desire to achieve convergence of the sequence to the (local) minimizer lim x k x. k (2.2) The algorithms are often referred to as iterative descent algorithms. 26 Chapter 2 Static unconstrained optimization

Remark 2. It should be mentioned that also nonmonotone algorithms exist that do not require the decrease of f(x) in every iteration but after a certain prescribed number of iterations. Also information of earlier iterates x, x,..., x k can be used to determine x k+. In the following, some preliminaries from numerical analysis are summarized, which are required to properly analyze so called line search and trust region methods. Finally so called direct search strategies are briefly introduced. 2.2. Preliminaries Convergence is the essential question and preliminary in any iterative technique. For a proper definition, the contraction property of a mapping in a suitable complete space has to be taken into account by defining a suitable metric, i.e. a measure of distance from the iterate to the fixed point of the mapping. The reader is referred to [3, ] for further details. Subsequently, only the notion of convergence order is introduced as a measure of convergence speed. Definition 2.: Order of convergence Let (x k ) be a sequence converging towards the limit x. The order of convergence of the sequence (x k ) k N is the supremum of all nonnegative numbers p for which lim k x k+ x x k x = µ <. (2.3) p The constant µ is called the asymptotic error constant. It is obvious from (2.3) that larger values of p correspond to a higher speed of convergence since the distance of the iterate x k+ to x is for large k reduced by the p th power. Example 2.5. The sequence ( k + k)k N converges to with the order of convergence p = since lim k k + 2 k + k + k = lim k ( k+ k+2 ) 4 ( k+ k+2 ) 4 ( k k+2 ) 4 =. One typically distinguishes between the two major cases { p =, µ (, ), linear convergence p = 2, µ <, quadratic convergence and { p =, µ =, superlinear convergence p =, µ =, sublinear convergence. This in particular illustrates that any algorithm with convergence order p > is superlinear. 2.2 Numerical minimization algorithms 27

Exercise 2.2. Determine the convergence order p and asymptotic error constant µ of the sequence {k k } k N. Solution 2.2. The sequence converges superlinearly to zero. When analyzing a sequence of vectors (x k ) k N converging to a limit x, as is the case in the considered minimization algorithms, the determination of the rate of convergence requires the proper mapping of this sequence into a sequence of scalars. If f(x) is the objective function according to (2.), then typically the convergence of the sequence (f(x k )) k N to f(x ) is analyzed. In this context f(x) is also referred to as error function. Alternatively also the norm x k x can be considered or another suitable map from R n to R. However, the rate of convergence of a vector valued sequence is in general independent of the choice of the error function. 2.2.2 Line search methods The principle operation of line search methods is illustrated in Figure 2.2. In each iteration of a line search method a search direction s k is computed and the algorithm decides how far to move into this direction by determining a suitable step length α k >, i.e. x k+ = x k + α k s k. (2.4) Most line search algorithms require s k to be a descent direction, i.e. one for which s T k ( f)(x k) <, since this property guarantees that f(x) can be reduced along this direction such that f(x k+ ) = f(x k + α k s k ) < f(x k ). (2.5) To illustrate this, the following proposition is proved subsequently. Proposition 2. (Direction of steepest descent). The search direction s k = ( f)(x k ) is the direction of steepest descent, i.e. among all directions at x k it is the one along which f(x) decreases most rapidly. Proof. Let f C 2 (X ad ). Then the mean value theorem (.25) implies that there exists an r [, ] such that f(x k+ ) = f(x k + α k s k ) = f(x k ) + α k s T k ( f)(x k ) + 2 α2 ks T k ( 2 f)(x k + ( r)α k s k )s k. }{{} =t k Herein, t k [, α k ] since r [, ]. The rate of change in f is the coefficient in the term in α k, i.e. s T k ( f)(x k). Hence, the unit direction of s k of most rapid decrease is the solution to the minimization problem min s k R st n k ( f)(x k ) subject to s k =. Evaluation of the scalar product yields s T k ( f)(x k ) = s k ( f)(x k ) cos θ }{{} = 28 Chapter 2 Static unconstrained optimization

f x 2 Figure 2.2: Illustration of a line search method. x with θ the angle between s k and ( f)(x k ). The desired minimum is obviously attained for cos θ = so that s k = ( f)(x k) ( f)(x k ) is the (unit) direction of steepest descent starting at x k. This also illustrates that f(x) can be reduced along any direction s k fulfilling the property that s T k ( f)(x k) <. Depending on the selection of the search direction s k different algorithms can be distinguished, which are summarized below. These, moreover, depend on the suitable determination of the second degree of freedom, namely the step length α k >. For this, it would be ideal to find the global minimizer of the scalar minimization problem min g(α k) = f(x k + α k s k ) (2.6) α k > for fixed x k and s k. However, this is in general computationally too expensive so that other techniques have to be taken into account to locally address (2.6). The schematic realization of line search methods is summarized in Algorithm below. Algorithm : Schematic line search method. input : x (starting value) ɛ (stopping criteria) initialize : k = repeat Compute search direction s k Find an appropriate step length α k Compute x k+ = x k + α k s k Update k = k + until f(x k+ ) ɛ or x k+ x k ɛ; 2.2 Numerical minimization algorithms 29

2.2.2. Determination of step length It should be mentioned that simply asking for (2.5), i.e. f(x k + α k s k ) < f(x k ) is not enough to achieve convergence to the minimizer x. As the following example illustrates sufficient decrease conditions are required to solve (2.6). Example 2.6. Let f(x) = (x ) 2 and consider the sequence (x k ) k N with x k = + ( ) k 2/k +. Then f(x k ) = 2/k so that f(x k+ ) < f(x k ) but as k the sequence f(x k ) approaches since x k will start alternating between and 2. However, the minimum f(x ) = for x = is not reached. Subsequently, different conditions and related algorithms are provided, which enable to determine an appropriate step length α k in the line search method assuming that the starting point x k of the line search and a search direction (descent direction) s k are given. Armijo conditions The Taylor series of g(α k ) = f(x k + α k s k ) around α k = results in g(α k ) = g() + g ()α k + O 2 (α k ) = f(x k + α k s k ) = f(x k ) + α k s T k ( f)(x k ) + O 2 (α k ). In the Armijo condition the step length α k, the directional derivative s T k ( f)(x k) and the reduction in f( ) are connected by the inequality f(x k + α k s k ) f(x k ) + ɛ α k s T k ( f)(x k ) (2.7) for some constant ɛ (, ), typically chosen small, e.g. ɛ.. With this, an upper bound on the step length is imposed. To ensure that α k does not become too small an additional inequality is introduced f(x k + α k s k ) f(x k ) + ɛ ɛ α k s T k ( f)(x k ) (2.8) with the parameter ɛ >. Figure 2.3(a) shows a graphical illustration of (2.7) and (2.8). Herein recall that s k is by assumption a descent direction with s T k ( f)(x k) <. In practice one starts for fixed x k and s k with an initial choice of α k = α () k : (i) If the initial value satisfies (2.7), then α k is successively increased by a factor ɛ > until at say α (j+) k condition (2.7) is violated. (ii) If the initial value does not satisfy (2.8), then successively decrease α k by a factor ɛ > until α (j) k = α (j ) k /ɛ fulfills (2.8). (iii) Finally assign the determined α k = α (j) k as step length for the line search algorithm. Wolfe conditions A slight modification of the Armijo conditions leads to the so called Wolfe conditions. Besides (2.7) a curvature condition is introduced different from (2.8) to exclude unacceptable small values of α k, i.e. g (α k ) ɛ 2 g () or equivalently s T k ( f)(x k + α k s k ) ɛ 2 s T k ( f)(x k ) 3 Chapter 2 Static unconstrained optimization

g g() + ɛ α k g () g g() + ɛ α k g () ɛ 2 g () g (α k ) α k α k g() + ɛ ɛ α k g () (a) Armijo conditions (2.7), (2.8) (b) Wolfe conditions (2.9) Figure 2.3: Illustration of Armijo and Wolfe conditions. Admissible areas are marked by the double arrows. for some constant ɛ 2 (ɛ, ). This condition ensures that the slope of g( ) at α k is ɛ 2 times greater than the initial slope at α k =. Figure 2.3(b) provides a graphical illustration and confirms that this selection is useful since if the slope g (α k ) = s T k ( f)(x k + α k s k ) is strongly negative, then f can be further reduced by moving along the search direction s k with α k. On the other hand, if g (α k ) = s T k ( f)(x k + α k s k ) is only slightly negative or positive, then one can in general no longer assume that f can be further reduced in this search direction so that line search can be terminated with this s k. In summary, the introduced two sufficient conditions are known as the Wolfe condition and read f(x k + α k s k ) f(x k ) + ɛ α k s T k ( f)(x k ) s T k ( f)(x k + α k s k ) ɛ 2 s T k ( f)(x k ) (2.9a) (2.9b) for constants ɛ (, ) and ɛ 2 (ɛ, ). Typical values of ɛ 2 are.9 when the search direction s k is determined by a Newton or quasi Newton method and. if a nonlinear conjugate gradient method is chosen to obtain s k. The so called strong Wolfe conditions are obtained by modifying the curvature condition, i.e. f(x k + α k s k ) f(x k ) + ɛ α k s T k ( f)(x k ) s T k ( f)(x k + α k s k ) ɛ 2 s T k ( f)(x k ) (2.2a) (2.2b) for constants ɛ (, ) and ɛ 2 (ɛ, ). This more restrictive formulation enforces that α k attains a value so that x k+ = x k + α k s k lies in (at least) a large neighborhood of a local minimizer or stationary point. Remark 2.2 It can be shown under the assumption of continuous differentiability of f(x) that there always exist step lengths α k satisfying the Wolfe and the strong Wolfe conditions. For further details the reader is referred to, e.g., [9]. 2.2 Numerical minimization algorithms 3

The so called Goldstein conditions are rather similar to the Wolfe condi- Goldstein conditions tions and read as f(x k ) + ( ɛ)α k s T k ( f)(x k ) f(x k + α k s k ) f(x k ) + ɛα k s T k ( f)(x k ) (2.2) for a constant ɛ (, /2). The Goldstein conditions are often used in Newton type methods but show the disadvantage compared to the Wolfe conditions that the first inequality may exclude all minimizers of g(α k ). Backtracking As is argued above, the decrease condition (2.9a) alone is not sufficient to guarantee that the algorithm makes reasonable progress in the considered search direction. Nevertheless, if the candidate step lengths are chosen appropriately by using a so called backtracking approach, then the curvature condition (2.9b) can be neglected and only (2.9a) may be used to terminate the line search procedure. The most basic form of this technique is summarized in Algorithm 2. The initial step αk is chosen to be in Newton and quasi Newton methods but can Algorithm 2: Backtracking algorithm. input : αk > (starting value) ρ (, ) (backtracking parameter) ɛ (, ) (descent parameter) initialize : α k = α k repeat α k ρα k until f(x k + α k s k ) f(x k ) + ɛ α k s T k ( f)(x k); have different values in other algorithms such as steepest descent or conjugate gradient. On the one hand the backtracking algorithm ensures that α k will in a finite number of trials become sufficiently small so that the decrease condition (2.9a) is fulfilled. On the other hand, α k will not become too small, preventing progress of the algorithm, due to the successive reduction by ρ (, ). Applications illustrate that backtracking is well suited for Newton s method but less appropriate for quasi Newton and conjugate gradient methods. Nested intervals A less heuristic technique for the determination of the step length α k minimizing (2.6) is provided by nested intervals. The underlying idea is illustrated in Figure 2.4. For this, it is assumed that g(α k ) is unimodal in an interval α k [l, r ] so that g(α k ) has a unique local minimum in the open interval (l, r ). To determine the interval [l, r ] start from a sufficiently small l and increase the value of the right interval boundary r until g(r) starts increasing for some r = r. Interval nesting is an iterative procedure to successively decrease the interval [l j, r j ] including the local minimum of g(α k ) as j increases. Consider now the j th iteration step. Based on l j and r j new interval boundaries l + j, r+ j with l j < l + j < r + j < r j are computed using l + j = l j + ( ɛ)(r j l j ) (2.22a) r + j = l j + ɛ(r j l j ) (2.22b) The function f(x) is called unimodal for x X if it has unique local minimum in X. 32 Chapter 2 Static unconstrained optimization

g g g(r j ) g(l j ) g(l j+ ) g(r j+ ) α k l j l + j r + j r j α k l j+ r j+ (a) Step j. (b) Step j +. Figure 2.4: Example of nested intervals. with the parameter ɛ (/2, ). The remaining procedure is based on the following lemma. Lemma 2. Let l j < l + j < r + j < r j and let g(α k ) be an unimodal function on the interval [l j, r j ]. Let α k denote the local minimum of g(α k) in (l j, r j ). Then α k [l j, r + j ] if g(l+ j ) g(r+ j ) or α k [l+ j, r j] if g(l + j ) g(r+ j ). Proof. Consider the case g(l j + ) g(r+ j ). We follow a contradiction argument assuming that the local minimizer satisfies αk > r+ j, which implies that l+ j < αk. Since g(l+ j ) g(r+ j ) there exists a point αk (l+ j, α k ) such that g(α k ) = max α k [l + j,α k ] g(α k). Hence αk denotes a local maximizer in the interval [l j, r j ], which contradicts the assumption that g(α k ) is unimodal in [l j, r j ]. The case g(l j + ) g(r+ j ) follows analogously. Lemma 2. implies that r j + is dropped for the iteration step j + if g(l j + ) g(r+ j ) so that the new interval [l j+, r j+ ] is given by l j+ = l j and r j+ = r j +. This case is shown in Figure 2.4. If g(l j + ) g(r+ j ), then the new interval [l j+, r j+ ] is obtained as l j+ = l j + and r j+ = r j. For the scenario of Figure 2.4 evaluate (2.22) for j = j +, which yields l + j+ = l j + ɛ( ɛ)(r j l j ), r + j+ = l j + ɛ 2 (r j l j ). (2.23) By imposing the constraint ɛ 2 = ( ɛ), i.e. ɛ = 5 2.68, (2.24) the equality r j+ + = l+ j is obtained, so that in each iteration only one new boundary has to be computed. Note that the fraction /ɛ =.68 is also known as the golden ratio. If g(l j + ) g(r+ j ), then (2.24) similarly ensures l j+ + = r j to reduce the number of computational steps. The local minimizer αk is finally obtained by averaging the final iteration results, i.e. α k = (l k + r k )/2 or by quadratic interpolation (see below) using the three smallest values of the four values of g at l j, r j, l j +, and r+ j. The method of nested intervals is an easily implementable and 2.2 Numerical minimization algorithms 33

numerically robust procedure to compute α k at the cost of a typically larger number of iteration steps. Quadratic interpolation One very efficient method to solve the minimization problem (2.6) is given by quadratic interpolation. For this, choose three pairwise distinct values αk, α2 k and αk 3 and evaluate g j = g(α j k ). The quadratic interpolation function passing through these three points is given by q(α k ) = 3 j= i j g (α k αk i ) j i j (αj k (2.25) αi k ). The minimizer α k of q(α k) follows as α k = 2 ( g α 2 k αk)( 3 α 2 k + αk) 3 ( + g2 α 3 k αk)( α 3 k + αk) ( + g3 α k αk)( 2 α k + αk 2 ) ( g α 2 k αk) 3 ( + g2 α 3 k αk) ( + g3 α k αk 2 ). (2.26) 2.2.2.2 Determination of the search direction The convergence of the line search methods not only depends on the selection of the step length α k but also on the chosen search direction s k, which has to be a descent direction such that s T k ( f)(x k) <. In the following, different approaches for the proper choice of s k are presented together with the resulting convergence rates. Steepest descent or gradient method Proposition 2. shows that the search direction s k = ( f)(x k ) (2.27) is the direction of steepest descent, i.e. among all directions at x k it is the direction along f(x) decreases most rapidly. For the analysis of convergence of the steepest descent method x k+ = x k α k ( f)(x k ) (2.28) with (2.27) consider first the quadratic minimization problem min f(x) = x R n 2 xt P x b T x (2.29) for P symmetric and positive definite. It was shown in Example.6 that f(x) is strictly convex since ( 2 f)(x) = P is positive definite so that Property (iv) of convex functions applies. Taking into account Theorems 2.2 and 2.4 it follows from ( f)(x) = P x b = that x = P b is a global minimizer of (2.29). Given (2.29) the method of steepest descent (2.28) evaluates to x k+ = x k α k (P x k b). (2.3) The minimizer of (2.6), i.e. min αk > g(α k ) = f(x k + α k s k ), and hence the optimal step length α k can be computed explicitly from min f(x k+α k s k ) = ( ) T ( ) xk α k (P x k b) P xk α k (P x k b) b T ( ) x k α k (P x k b) α k > 2 }{{}}{{}}{{} =( f)(x k ) =( f)(x k ) =( f)(x k ) 34 Chapter 2 Static unconstrained optimization

taking the derivative of f(x k + α k s k ) with respect to α k. This yields α k = ( f)t (x k )( f)(x k ) ( f) T (x k )P ( f)(x k ). (2.3) Exercise 2.3. Verify (2.3). With α k as above the steepest descent method for the quadratic minimization problem reads x k+ = x k ( f)t (x k )( f)(x k ) ( f) T (x k )P ( f)(x k ) ( f)(x k). (2.32) For the convergence analysis, introduce a suitably weighted norm by defining x 2 P = xt P x. This in particular implies with x = P b that 2 x x 2 P = f(x) f(x ). (2.33) The introduced norm is a measure of the difference between the current objective function and the minimal value. Exercise 2.4. Verify (2.33). Consider the weighted distance of x k+ defined in (2.32) and the minimizer, i.e. x k+ x P, which evaluates to x k+ x 2 P = x k ( f)t (x k )( f)(x k ) 2 ( f) T (x k )P ( f)(x k ) ( f)(x k) x P [ ( f) = x k x 2 T (x k )( f)(x k ) ] 2 P ( f) T (x k )P ( f)(x k ) ( [ ( f) T (x k )( f)(x k ) ] 2 ) = ( f) T (x k )P ( f)(x k ) x k x 2 x k x 2 P P ( [ ( f) T (x k )( f)(x k ) ] 2 ) = [ ( f) T (x k )P ( f)(x k ) ][ ( f) T (x k )P ( f)(x k ) ] x k x 2 P. } {{ } = ( ) (2.34) Herein, ( f)(x k ) = P (x k x ) is used, which follows from x = P b and hence b = P x. The term ( ) describes the decrease in each iteration step so that the convergence properties of the steepest descent method can be deduced from this expression. For its interpretation, Kantorovich s inequality is used. Lemma 2.2: Kantorovich s inequality Let P R n n be a symmetric positive definite matrix. For every x R n the inequality (x T x) 2 (x T P x)(x T P x) 4λ minλ max (λ min + λ max ) 2 (2.35) 2.2 Numerical minimization algorithms 35

holds with λ min and λ max referring to the smallest and largest eigenvalues of P. Note that the eigenvalues of a symmetric and positive definite matrix are real and positive. Exercise 2.5. Prove Lemma 2.2. These preliminaries allow to conclude the following theorem [7]. Theorem 2.5: Convergence of steepest descent for quadratic objective function For any initial value x R n the steepest descent method (2.32) converges linearly to the global minimum of the strict convex objective function (2.29) with the error norm satisfying x k+ x 2 P ( ) κ 2 x k x 2 P (2.36) κ + with κ = λ max /λ min the spectral condition number of P. Proof. The result is a direct consequence of (2.35) applied to (2.34), i.e. x k+ x 2 P x k x 2 P = [ ( f) T (x k )( f)(x k ) ] 2 [ ( f) T (x k )P ( f)(x k ) ][ ( f) T (x k )P ( f)(x k ) ] 4λ minλ max (λ min + λ max ) 2 with λ min and λ max referring to the smallest and largest eigenvalues of P. Hence, x k+ x 2 P x k x 2 P (λ min λ max ) 2 ( ) κ 2 (λ min + λ max ) 2 = κ + which equals (2.36). This result admits a geometric interpretation. At first, it is obvious that convergence is achieved in a single step if κ =, i.e. if all eigenvalues λ j = λ of P coincide so that P = λe. In this case the contours of the objective function f(x) = x T P x b T x are circles and the steepest descent direction always points at the global minimizer. This case is visualized in Figure 2.5(a).If κ increases, then the contours approach ellipsoids and convergence degrades due to a zigzagging behavior of the line search algorithm with steepest descent as is shown in Figure 2.5(b). Note that the zigzagging will increase with the spectral condition number κ. The rate of convergence remains in principle unchanged if the minimization problem (2.) is considered with a general objective function f(x) [9]. Theorem 2.6: Convergence of steepest descent for general objective function Let f(x) C 2 (R n ) and let x denote the local minimizer of (2.). Moreover, assume that ( 2 f)(x ) is positive definite and let λ min and λ max denote its smallest and largest (positive real) eigenvalue. Assume that the sequence of iterations (x k ) k N generated by the steepest descent method x k+ = x k α k ( f)(x k ) 36 Chapter 2 Static unconstrained optimization

converges to the local minimizer x for suitable step lengths α k. Then the sequence (f(x k )) k N converges linearly to f(x ) with a rate of convergence larger than (κ ) 2 /(κ+ ) 2, where κ = λ max /λ min is the spectral condition number of the Hessian matrix. α s x x2 x α 2 s2 x2 x x 3 x 2 x x 4 α s x x 2 x (a) Ideal conditioning with κ =. x (b) Conditioning with κ. Figure 2.5: Line search with steepest descent for quadratic strict convex objective function. For poorly conditioned problems with large κ an appropriate scaling might be used to improve the iterations. This approach exploits the fact that the determination of the minimum of the objective function f(x) is equivalent to the determination of the minimum of the objective function g(z) = f(v z) with x = V z and V regular. With this, the minimizer x is mapped according to z = V x. Hence, in the new state z the gradient and Hessian of g(z) are related to those of f(x) by ( g)(z) = V T ( f)(v z), ( 2 g)(z) = V T ( 2 f)(v z)v, (2.37) which in particular implies ( g)(z ) = V T ( f)(x ) and ( 2 g)(z ) = V T ( 2 f)(x )V. The proper selection of the transformation matrix V may lead to an improvement of the spectral condition number of the Hessian matrix ( 2 g)(z) compared to ( 2 f)(x). Nevertheless, these so called pre conditioning techniques should be only applied with caution as is remarked, e.g., in [, p. 34f]. Pros and cons of line search with steepest descent or gradient method can be summarized as follows: (+) Simple with low computational burden since the explicit evaluation of the Hessian matrix ( 2 f)(x k ) is not needed; (+) Convergence can be achieved also for starting values x not close to the local minimizer x ; ( ) Slow convergence depending on the conditioning (and scaling); ( ) Linear convergence only. 2.2 Numerical minimization algorithms 37

Conjugated gradient method The conjugated gradient (CG) method aims at combining quadratic convergence (as in Newton s method below) with the low computational burden of the steepest descent method. Herein, information of the present and previous iteration are used to appropriately determine the search direction, i.e. s k = ( f)(x k ) + β k s k, k s = ( f)(x ). (2.38) Different formula exist for the determination of the parameter β k. One version is given by the Fletcher Reeves formula, where β F R k = ( f)t (x k )( f)(x k ) ( f) T (x k )( f)(x k ). (2.39) Moreover, the Polak Ribière formula should be mentioned in this context, where β P R k = ( f)t (x k )[( f)(x k ) ( f)(x k )] ( f) T. (2.4) (x k )( f)(x k ) While the convergence properties of CG methods are well understood for linear and quadratic problems, in the general nonlinear setting surprising convergence properties can be observed, as is, e.g., pointed out in [9]. The reader is referred to this reference or [] for further details and analysis. Newton s method Newton s iterative method is based on the analysis of f(x k+ ) for x k+ = x k + s k, i.e. unit step length α k =. Evaluation of the Taylor series at x k neglecting terms of order 3 and larger yields f(x k+ ) = f(x k ) + s T k ( f)(x k ) + 2 st k ( 2 f)(x k )s k. (2.4) The search direction, also called Newton direction, is obtained by minimizing the right hand side of (2.4) with respect to s k. Taking into account Theorem 2. and noting that the right hand side is a quadratic form in s k, implies so that ( sk f)(x k+ ) = ( f)(x k ) + ( 2 f)(x k )s k = s k = ( 2 f) (x k )( f)(x k ). (2.42) Hence, Newton s method can be interpreted as minimizing the quadratic function approximation of the objective function f(x). For x k in a sufficiently small neighborhood of a strict local minimizer x it follows from Theorem 2.3 that the Hessian matrix ( 2 f)(x k ) is positive definite and hence invertible. In this case, Newton s method is well defined and (2.42) defines a descent direction. 38 Chapter 2 Static unconstrained optimization

Theorem 2.7: Convergence of Newton s method Let f C 2 (R n ) and let ( 2 f)(x) be locally Lipschitz continuous in a neighborhood of x for which the second order sufficient optimality conditions (2.) are satisfied. If the starting point x is sufficiently close to the minimizer x, then the Newton iteration x k+ = x k ( 2 f) (x k )( f)(x k ) (2.43) converges to x with an order of convergence p of at least 2. In addition, the sequence of gradient norms ( ( f)(x k ) ) k N converges quadratically to zero. The proof of this theorem is omitted but can be, e.g., found in [9, Chap. 3.3]. Remark 2.3 Let f : R n R m satisfy the inequality f(x ) f(x 2 ) L x x 2, L (, ) (2.44) for all x, x 2 B r (y) = {x R n : x y r}. Then f(x) is called locally Lipschitz continuous on B r (y) R n. If the inequality holds for all x, x 2 R n, then f(x) is called globally Lipschitz continuous. Note that if f(x) and ( f)(x) are continuous on B r (y) R n, then f(x) is locally Lipschitz continuous. In view of Theorem 2.7 and the considered scalar case with f(x) the local Lipschitz continuity of ( 2 f)(x) in a neighborhood of x is given provided that f(x) C 3 (R n ). For the practical implementation typically a certain step length α k is introduced so that (2.43) is replaced by x k+ = x k α k ( 2 f) (x k )( f)(x k ). (2.45) Herein α k is also referred to as damping coefficient and the damped Newton method is often called Newton Raphson method. Strategies for the suitable determination of α k are discussed in Section 2.2.2. above. It is crucial to observe that the positive definiteness of the Hessian matrix ( 2 f)(x k ) might be lost if x k is not sufficiently close to x. In this case, s k defined in (2.42) is no longer a descent direction and ( 2 f)(x k ) is not necessarily invertible. To address this issue, the search direction is modified so that the iteration rule reads x k+ = x k α k N k ( f)(x k), N k = ( 2 f)(x k ) + ɛ k E (2.46) with the unit matrix E R n n and a suitable ɛ k. For ɛ k = Newton s method is recovered while for large ɛ k the iteration (2.46) approaches the method of steepest descent. The proper selection of ɛ k is not trivial. One typically begins with a starting value and successively increases ɛ k until N k is positive definite. According to Theorem.2 definiteness can be checked, e.g., by computing the eigenvalues of N k. Numerically more efficient techniques such as the Cholesky factorization can be used, which imply positive definiteness if and only if the matrix can be factorized into N k = D k D T k with D k a lower triangular matrix with strictly positive entries on its diagonal [3]. 2.2 Numerical minimization algorithms 39

Exercise 2.6. Verify that line search with Newton s method does converges in a single step independent of the starting point x for the quadratic minimization problem min f(x) = x R n 2 xt P x b T x with P positive definite. Pros and cons of line search with Newton s method can be summarized as follows: (+) Quadratic convergence if the Hessian matrix ( 2 f)(x k ) is positive definite; ( ) Loss of positive definiteness of the Hessian matrix ( 2 f)(x k ) if x k is not in a sufficiently small neighborhood of the minimizer x ; ( ) Requires evaluation of the Hessian matrix ( 2 f)(x k ) and the computation of its inverse (not explicitly but by solving a linear system of equations at each x k ) Quasi Newton methods In the quasi Newton method the evaluation and in particular inversion of the Hessian matrix ( 2 f)(x k ) is replaced by an iterative procedure, which makes the approach suitable also for medium and large scale systems with n. The underlying idea makes use of (2.4), i.e. a quadratic model of the objective function given by f(x k+ ) = f(x k ) + s T k ( f)(x k ) + 2 st k B k s k, (2.47) with the difference that ( 2 f)(x k ) is replaced by the (n n) matrix B k, which is assumed symmetric and positive definite. Proceeding as in Newton s method, the search direction is chosen as s k = B k ( f)(x k) (2.48) and minimizes the quadratic (convex) approximation (2.47). With this, the next iterate is x k+ = x k α k B k ( f)(x k) (2.49) with the step length chosen to satisfy the Wolfe conditions (2.9). The crucial point is now to determine B k from the knowledge of B k, ( f)(x k ) and ( f)(x k ). For this, let f(x) C 2 (R n ) and recall the integral mean value theorem (.26), which implies ( f)(x k+ ) ( f)(x k ) = ( 2 f)(x k + r(x k+ x k ))(x k+ x k )dr ( 2 f)(x k+ )(x k+ x k ). In view of the approximation of the Hessian matrix ( 2 f)(x k+ ) by B k+ this motivates ( f)(x k+ ) ( f)(x k ) = B k+ (x k+ x k ). From a numerical point of view it is advantageous to select the approximation of the Hessian matrix so that rankb k+ B k is small [3]. Quasi Newton methods can be hence characterized by the following three properties B k (x k+ x k ) = ( f)(x k ) (2.5a) 4 Chapter 2 Static unconstrained optimization

B k+ (x k+ x k ) = ( f)(x k+ ) ( f)(x k ) (2.5b) B k+ = B k + B k, rank B k = m (2.5c) for k N {}. Eqn. (2.5b) is also known as the secant condition. The idea behind (2.5c) is to minimize the distance between B k+ and B k in some suitable norm. Typically m = or m = 2 is chosen leading to so called rank and rank 2 corrections B k. Introducing p k = x k+ x k and q k = ( f)(x k+ ) ( f)(x k ) properties (2.5) imply the frequently used relations B k+ p k = q k (2.5a) (B k+ B k )p k = ( f)(x k+ ) (2.5b) ( f)(x k+ ) = q k B k p k. (2.5c) Since B k is assumed positive definite and as such is invertible for any k =,,... it is reasonable to impose that the rank perturbation B k does not interfere with this assumption (for a detailed discussion on this topic the reader is referred to the analysis of matrix perturbations). Hence, instead of determining B k+ = B k + B k we will seek for H k+ = H k + H k inverting B k+. A straightforward rank correction is obtained using H k = γ k z k z T k z k z T k is at most of rank. Substitution into (2.5a) results in since the dyadic product p k = H k+ q k = H k q k + γ k z k z T k q k. (2.52) From this one obtains (p k H k q k )(p k H k q k ) T = γ 2 kz k z T k q k q T k z k z T k = γ 2 k( z T k q k ) 2zk z T k = γ k ( z T k q k ) 2 Hk. Solving for H k hence yields H k+ = H k + (p k H k q k )(p k H k q k ) T γ k ( z T k q k ) 2. (2.53) This expression can be further simplified by taking the scalar product of (2.52) with q T k, i.e. q T k p k = q T k H k q k + γ k q T k z k z T k q k = q T k H k q k + γ k ( z T k q k ) 2. Solving for the latter term and substituting into (2.53) results in the so called good Broyden method H k+ = H k + (p k H k q k )(p k H k q k ) T q T k (p k H k q k ). (2.54) Various convergence results are available for Broyden s method proving superlinear convergence under certain conditions. For details, the reader is referred to, e.g., [3, 5, 6]. The main problem with (2.54) is that positive definiteness of H k+ is only preserved if q T k (p k H k q k ) >. One of the most elegant techniques to ensure this property is provided by the Davidson Fletcher Powell (DFP) method. The technique is summarized in Algorithm 3 below and essentially relies on the initialization of the algorithm by a positive definite matrix H. 2.2 Numerical minimization algorithms 4

It can be shown that H k remains positive definite as long as H is positive definite and the Algorithm 3: Quasi Newton method with DFP update. input : H (symmetric, positive definite matrix) x (starting value) ɛ x, ɛ f (stopping criteria) initialize : k = repeat Compute search direction s k = H k ( f)(x k ) Apply line search to solve min αk f(x k + α k s k ) (taking into account the Wolfe conditions (2.9)) Compute x k+ = x k + α k s k, p k = x k+ x k and q k = ( f)(x k+ ) ( f)(x k ) Update using H k+ = H k + p kp T k p T k q H kq k q T k H k k q T k H (2.55) kq k until x k+ x k ɛ x f(x k+ ) f(x k ) ɛ f ; condition q T k p k > is satisfied. Since the approximation of the inverse Hessian matrix is in any step corrected by two rank matrices one refers in case of the DFP update also to a rank 2 correction. An alternative to the DFP update is given by the so called Broyden Fletcher Goldfarb Shanno (BFGS) method. Herein, the iterative determination of the inverse Hessian matrix in Algorithm 3 is replaced by H k+ = ( E p kq T k q T k p k )H k ( E q kp T k q T k p k ) + p kp T k q T k p. (2.56) k In general superlinear convergence is achieved by making use quasi Newton methods involving the DFP or the BFGS update laws. While convergence of Newton s method is faster, its cost per iteration is higher due to the need for second order derivatives and the explicit inversion of the Hessian matrix. For further analysis and information regarding implementation of the quasi Newton method the reader is referred to, e.g., [9]. 2.2.3 Trust region methods Trust region methods are somewhat similar to line search methods in the sense that they both generate steps based on a quadratic model of the objective function. They, however, differ in the way the model is exploited. While line search methods rely on the determination of a search (descent) direction and a suitable step length to move along this direction trust region methods define a region around the current iterate. In this region, the quadratic model is trusted to be an adequate approximation of the objective function, i.e. m(s k ) = f(x k ) + s T k ( f)(x k ) + 2 st k B k s k f(x k + s k ) (2.57) 42 Chapter 2 Static unconstrained optimization

with B k an appropriate symmetric and uniformly bounded matrix. Application of Taylor s formula (.27) reveals that the error of approximation is of the order s k 2 or even s k 3 if B k = ( 2 f)(x k ). The trust region around the iterate x k, which is subsequently characterized by the parameter k, can be interpreted as the region, where f(x k +s k ) is supposed to be sufficiently accurate represented by m(s k ). In trust region methods, the minimization problem min s k R n m(s k) = f(x k ) + s T k ( f)(x k ) + 2 st k B k s k f(x k + s k ) s.t. s k k (2.58) is solved in each iteration k for suitable trust region radius k. The solution s k of (2.58) is hence the minimizer of m(s k ) in the ball of radius k. Contrary to line search both search direction and step length are determined simultaneously. The proper choice of the degree of freedom k is crucial in a trust region method. For this, the agreement between the model function m(s k ) and the objective function f(x k ) at previous iterations is considered in terms of the ratio ϱ(s k ) = f(x k) f(x k + s k ). (2.59) m() m(s k ) Herein, the numerator is called actual reduction and the denominator is the predicted reduction. Note that the predicted reduction is always nonnegative since s k minimizes m(s k ) inside the trust region that includes s k =. As a result, if ϱ(s k ) <, then the new value f(x k + s k ) of the objective function is greater than the current value f(x k ) so that the step must be rejected and the trust region must be shrunk. For ϱ(s k ) the agreement between model and objective function is good so that the trust region may be expanded for the next iteration. If < ϱ(s k ), then the trust region is shrunk in the next iteration by reducing k. The principle process is summarized in Algorithm 4 [9]. Thereby, refers to the overall bound on the trust region radius. The radius is increased only if s k reaches the boundary of the trust region, i.e. when s k = k. For the implementation of trust region methods and the update of the Hessian matrix B k+ the reader is referred to [9, ]. 2.2.4 Direct search methods Direct (derivative free) methods are characterized by the fact that no explicit knowledge of the gradient or the Hessian matrix for the objective function f(x) is needed to compute the minimum. Herein, a series of function values is computed for a set of samples to determine the subsequent iteration point. One of the most famous methods in this context is the so called simplex method of Nelder and Mead [8]. In the case of two decision variables x R 2 a simplex is a triangle and the method makes use of the comparison of function values at the triangle s three vertices. The worst vertex characterized by the largest value of the objective function f(x) is rejected and replaced with a new vertex. With this, a new triangle is formed to continue the search. In the course of the process a sequence of triangles, in general of different shape, is generated with decreasing function values at the vertices. Since the size of the triangles is reduced in each step the coordinates of the minimizer can be found. 2.2 Numerical minimization algorithms 43

Algorithm 4: Trust region method. input : >, (, ) (starting trust region radius) η [, ) 4 ɛ x, ɛ f (stopping criteria) initialize : k = repeat Determine s k by (approximately) solving (2.58) Evaluate ϱ(s k ) from (2.55) if ϱ(s k ) < 4 then k+ = 4 k else if ϱ(s k ) > 3 4 and s k = k then k+ = min{2 k, } else k+ = k end if ϱ(s k ) > η then x k+ = x k + s k (next iterate) B k+ = B k +... (update Hessian matrix) else x k+ = x k (repeat iteration with k+ < k ) end k = k + until x k+ x k ɛ x f(x k+ ) f(x k ) ɛ f ; Remark 2.4 The simplex algorithm of Nelder and Mead should not be confused with the conceptually different simplex method introduced by G.B. Dantzig in linear programming [4]. In the n dimensional setting a simplex is the convex hull 2 (cf. Definition.7), which is spanned by n + points x k,j, j =,..., n in the k th iteration. Denote by x k,min and x k,max those points x k,j, j =,..., n, where the objective function attains a minimum or maximum, i.e. f(x k,min ) = min f(x k,j), f(x k,max ) = max f(x k,j). (2.6) j=,...,n j=,...,n The centroid of the simplex x k is defined by x k = ( n ) x k,j x k,max n j= (2.6) The algorithm replaces the point x k,max in the simplex by another point with lower value of the objective function. In particular x k,max is replaced by a new point on the line x ref k = x k + α ( x k x k,max ) 2 It reduces to a straight line if n =, a triangle for n = 2, a tetrahedron for n = 3, etc.. (2.62) 44 Chapter 2 Static unconstrained optimization

x k,3 x k,3 x k,3 x k, x k x k,2 x k, x k x k,2 x k, x k x k,2 x k,con x k,ref x k,exp (a) Reflection. (b) Expansion. (c) Outer contraction. x k,3 x k,3 x k,con x k, x k x k,2 x k, x k,2 (d) Inner contraction. (e) Shrinkage. Figure 2.6: Operations involved in the simplex algorithm of Nelder and Mead. depending on α. For this, various operations on the simplex are defined, that are summarized in Figure 2.6. During the iteration the simplex moves in the direction of the minimizer and is thereby successively contracted. Algorithm 5 summarizes the general procedure. Implementations of the Nelder/Mead simplex algorithm are available, e.g., in MATLAB and OCTAVE in terms of the function fminsearch. Convergence of the simplex algorithm of Nelder and Mead cannot be guaranteed in general and the algorithm might even approach an non minimizer. However, in practical applications the simplex algorithm yields good results at the cost of a rather slow convergence. 2.3 Benchmark example For the evaluation of the different techniques subsequently Rosenbrock s problem is considered as a benchmark example []. Herein, the minimization problem is considered for the objective function min f(x) = ( x 2 x 2 ) 2 ( x R 2 + x ) 2. (2.63) Figure 2.7 shows the profile of f(x) and the corresponding isoclines. Exercise 2.7. Verify that x = [, ] T is a local minimizer of (2.63). Analyze whether this minimizer is global and unique. Is f(x) a convex function? 2.3 Benchmark example 45

Algorithm 5: Simplex algorithm of Nelder and Mead. input : x,j, j =,..., n (initial simplex) α ref > (reflection coefficient [α ref = ]) α exp > (expansion coefficient [α exp = ]) α con (, ) (contraction coefficient [α con = /2]) ɛ x, ɛ f (stopping criteria) initialize : k = repeat Compute x k,min, x k,max Compute centroid x k Reflection step x k,ref = x k + α ref ( x k x k,max ) if f(x k,ref ) < f(x k,min ) then Expansion step x k,exp = x k,ref + α exp (x k,ref x k ) if f(x k,exp ) < f(x k,ref ) then x k,new = x k,exp else x k,new = x k,ref end else if f(x k,ref ) > max j=,...,n, xk,j x k,max f(x k,j ) then if f(x k,max ) f(x k,ref ) then Inner contraction x k,new = α con x k,max + ( α con ) x k else Outer contraction x k,new = α con x k,ref + ( α con ) x k end else Preserve reflection point x k,new = x k,ref end if f(x k,new ) f(x k,max ) then Shrinkage step x k+,j = 2 (x k,j + x k,min ), j =,..., n else x k,max = x k,new, x k+,j = x k,j, j =,..., n end k = k + until x k+ x k ɛ x f(x k+ ) f(x k ) ɛ f ; In the following, it is desired to evaluate the properties and convergence behavior of the line search, trust region and direct search methods introduced in the paragraphs above. For this, the Optimization Toolbox of MATLAB provides the two functions fminunc, implementing quasi Newton as line search method as well as a trust region method fminsearch, implementing the simplex method of Nelder and Mead. 46 Chapter 2 Static unconstrained optimization

.5 f = const. 5 5 5 25 5 5 25 5 f 4 2 8 6 4 2 x2.5.5 5 25 5 25 5 25 25 5 5 5 5 25 25 5 5 5 25 5 25 5 x 2 x 5 25.5.5.5.5.5 x 25 5 Figure 2.7: Rosenbrock s banana or valley function: profile and isoclines. Similarly, the Optim Package of OCTAVE enables, e.g., the use of the functions d2_min, implementing Newton s method; minimize, implementing Newton s method as well as the BFGS method as an example of quasi Newton methods; fminsearch and nelder_mead_min, implementing the simplex method of Nelder and Mead. The reader is also referred to the user supplied function minfunc, which can be obtained from [2] and provides a large selection of line search methods including those discussed in previous sections. This function is used subsequently to evaluate the different line search methods for the Rosenbrock problem. Herein, the strong Wolfe conditions (2.2) are by default used for the step length determination provided that the user does not manually set a different option. Item Method Iter. f(x ) ( f)(x ) 2 #eval(f) Line search: steepest descent 5.2433.495 56 2 Line search: conjugated gradient 28 7.3648 2 7.729 65 3 Line search: Newton 2 3.8289 6 7.3242 7 32 4 Line search: quasi Newton 27 9.6395 5 2.558 6 34 with BFGS 5 Trust region 25 2.627 8 2.67 8 26 6 Direct method: Nelder Mead 67 5.393.3 25 Table 2.: Comparison of line search, trust region and direct search methods for the Rosenbrock problem (2.63). Line search methods are evaluated using the function minfunc [2], fminunc is used for the trust region approach and fminsearch for the simplex algorithm of Nelder and Mead. Table 2. summarizes the results of a comparison of the different algorithms using the functions minfunc, fminunc and fminsearch. The initial value is always set to x = [, ] T. The corresponding behavior of the iterates is depicted in Figure 2.8. The weak performance of steepest descent is directly visible. In particular the local minimizer x = [, ] T is not even closely reached after 5 iterations. This behavior is illustrated in Figure 2.9, where the progress in the successive iterations is depicted for 25 iterations. The steepest descent direction is orthogonal to the respective isocline with the gradient ( f)(x k ) still attaining a reasonable 2.3 Benchmark example 47