Review of Classical Optimization

Size: px

Start display at page:

Download "Review of Classical Optimization"

Norman Jordan
5 years ago
Views:

1 Part II Review of Classical Optimization Multidisciplinary Design Optimization of Aircrafts 51

2 2 Deterministic Methods 2.1 One-Dimensional Unconstrained Minimization Motivation Most practical optimization problems involve many variables, so the study of single variable minimization may seem academic. However, the optimization of multi-variable functions can be broken into two parts: 1. Finding a suitable search direction; 2. Minimizing along that direction. The second part of this strategy, the so-called line search, is the motivation for studying single variable minimization. Multidisciplinary Design Optimization of Aircrafts 52

3 Consider a scalar function, f, that depends on a single independent variable, x. Suppose we want to find the value of x where f(x) is a minimum value: minimize f(x) (2.1) by varying x R Furthermore, we want to do this with low computational cost (few iterations and low cost per iteration), low memory requirements, and low failure rate. Often the computational effort is dominated by the computation of f and its derivatives so some of the requirements can be translated into: evaluate f(x) and df/ dx as few times as possible. Multidisciplinary Design Optimization of Aircrafts 53

4 You end up having a few choices: Choose methods that need / do not need the evaluation of the function derivatives (if you can compute derivatives cheaply, you may want to consider using them); If the function is pathologically bad, you may want to avoid the use of derivatives; When using bracketing methods you want to choose the approach so that it provides faster rates of convergence in general; In multi-dimensional cases, choose between methods that require order N and order N 2 storage. Multidisciplinary Design Optimization of Aircrafts 54

5 2.1.2 Types of Minima The point x is a strong local minimizer, if f(x ) < f(x) for all x near x ; weak local minimizer, if f(x ) f(x) for all x near x ; strong global minimizer, if f(x ) < f(x), for all x; weak global minimizer, if f(x ) f(x), for all x. If a minimum does not exist, the function is not bounded below. Multidisciplinary Design Optimization of Aircrafts 55

6 2.1.3 Optimality Conditions Taylor s theorem is useful for identifying local minima. Theorem. [Taylor s theorem] θ (0 θ 1) such that If f(x) is n times differentiable, then there exists f(x + h) = f(x) + hf (x) h2 f (x) (n 1)! hn 1 f n 1 (x) + 1 n! hn f n (x + θh) }{{} O(h n ) Assuming f is twice-continuously differentiable and a minimum of f exists at x, then Taylor s theorem, using n = 2 and x = x, leads to f(x + ε) = f(x ) + εf (x ) ε2 f (x + θε) (2.2) For a local minimum at x, it requires that f(x + ε) f(x ) for a range δ ε δ, where δ is a positive number. Given this definition, and the Taylor series expansion (2.2), for f(x ) to be a local minimum, it requires εf (x ) ε2 f (x + θε) 0. For any finite values of f (x ) and f (x ), the value ε can be always be chosen small enough such that εf (x ) 1 2 ε2 f (x ). Multidisciplinary Design Optimization of Aircrafts 56

7 For εf (x ) to be non-negative, then f (x ) = 0, because the sign of ε is arbitrary. This is the first-order optimality condition. A point that satisfies the first-order optimality condition is called a stationary point. Besides minima, other type of stationary points include maxima and inflection points. Because the first derivative term is zero, the second derivative term must be considered. This term must be non-negative for a local minimum at x. Since ε 2 is always positive, then f (x ) 0. This is the second-order optimality condition. Higher-order can always be made smaller than the second-order term by choosing a small enough ε. Discontinuities: Many optimizers fail in the presence of discontinuities. This is specially critical for gradient-based optimizers or others that look only at the local region of design space. Multidisciplinary Design Optimization of Aircrafts 57

8 Necessary conditions (for a local minimum): f (x ) = 0; f (x ) 0 (2.3) Sufficient conditions (for a strong local minimum): f (x ) = 0; f (x ) > 0 (2.4) The optimality conditions can be used to Verify that a point is a minimum (sufficient conditions); Realize that a point is not a minimum (necessary conditions); Define equations that can be solved to find a minimum (in simple cases). Gradient-based minimization methods find a local minima by finding points that satisfy the optimality conditions. Multidisciplinary Design Optimization of Aircrafts 58

9 2.1.4 Rate of Convergence The rate of convergence is a measure of how fast an iterative method converges to the numerical solution. An iterative method is said to converge with order r when r > 0 is the largest number such that x k+1 x 0 < lim < (2.5) k x k x r where k is iteration number. This is to say that the above limit must be a positive constant. This constant is the asymptotic error constant. x k+1 x lim = γ (2.6) k x k x r If the limit is zero when r = 1, we have a special case called superlinear convergence. When r = 2, the method sequence converges quadratically, this means that the number of correct figures roughly doubles with each iteration. When solving real problems, the exact x is not known in advance, but it is useful to plot x k+1 x k and g k (i.e. the norm of the gradient) in a log-axis plot versus k. Multidisciplinary Design Optimization of Aircrafts 59

10 Some examples from Gill et al. [29] Example 2.1: x k = c 2k, for 0 c < 1 Each member is the square of the previous, the limit is zero. x k+1 0 x k 0 2 = c so r = 2 (quadratic convergence) and γ = c. Example 2.2: y k = c 2 k, for c 0 Each member is the square root of the previous, the limit is 1. c 2 (k+1) 1 c 2 k 1 = 1 c 2 (k+1) 1 = 1 2 so r = 1 (linear convergence) and γ = 1/2. Multidisciplinary Design Optimization of Aircrafts 60

11 2.1.5 Unimodality and Bracketing the Minimum Line search methods using bracketing require that the function f to be unimodal, that is, it monotonically decreases as we approach x from the left and then monotonically increases to the right of x (it has a single local minimum). Example of a unimodal function Multidisciplinary Design Optimization of Aircrafts 61

12 The first step in the process of finding the minimum is to bracket it in an interval. Input: function f, starting point x 1, step size, expansion parameter γ 1 Output: Three point pattern x 1,x 2,x 3 such that f 1 f 2 < f 3 begin set x 2 = x 1 + evaluate f 1 and f 2 if f 2 > f 1 then interchange f 1 and f 2, x 1 and x 2, and set = end repeat if f 3 not null then rename f 2 as f 1, f 3 as f 2, x 2 as x 1, x 3 as x 2 end set = γ, x 3 = x 2 +, and evaluate f 3 until f 3 > f 2 end Pseudo-code 1: Bracketing Algorithm Common values for γ are 2 (step size doubled at each successive iteration) or (the golden section ratio). The three-point-pattern is needed for all interval reduction methods. Multidisciplinary Design Optimization of Aircrafts 62

13 2.1.6 Interval Reduction Methods These methods for function minimization start with an interval of uncertainty containing the minimum, that could have been determined using the bracketing algorithm, and successively reduce its size to a desired tolerance. These methods should be robust and efficient, that is, they should converge to the minimum using a reduced number of function evaluations. Among the most common methods are: Fibonacci Method; Golden Section Method; Polynomial-Based Methods. Multidisciplinary Design Optimization of Aircrafts 63

Fibonacci Method The Fibonacci method is the strategy that yields the maximum reduction in the interval of uncertainty for a given number of function evaluations.

14 Fibonacci Method The Fibonacci method is the strategy that yields the maximum reduction in the interval of uncertainty for a given number of function evaluations. Leonardo Pisa (nicknamed Fibonacci) found a sequence of numbers that describe the evolution of a population of rabbits: Rabbit population and Fibonacci numbers The first few numbers of this sequence are 1, 1, 2, 3, 5, 8, 13,.... In general, the sequence of Fibonacci numbers can be generated using F 0 = F 1 = 1 (2.7) F k = F k 1 + F k 2, k = 2,..., n (2.8) Multidisciplinary Design Optimization of Aircrafts 64

15 Say we have an interval of uncertainty and the function has been evaluated at the boundaries. In order to reduce the interval of uncertainty, we have to evaluate two new points inside the interval. Then (assuming the function is unimodal) the interval that has lower function evaluation inside its boundaries is the new interval of uncertainty. The most efficient way of reducing the size of the interval would be to: (1) ensure that the two possible intervals of uncertainty are the same size, and (2) once the new interval is chosen, the point inside the domain can be used and only one more function evaluation is required for selecting a new interval of uncertainty. Fibonacci: Sequence of intervals Multidisciplinary Design Optimization of Aircrafts 65

16 The interval sizes, I k are such that I 1 = I 2 + I 3 I 2 = I 3 + I 4. I k = I k+1 + I k+2 (2.9). I N 4 = I N 3 + I N 2 = 8I N I N 3 = I N 2 + I N 1 = 5I N I N 2 = I N 1 + I N = 3I N I N 1 = 2I N Recognizing the Fibonacci numbers, the following relation holds I n j = F j+1 I n, j = 1, 2,..., n 1 (2.10) Multidisciplinary Design Optimization of Aircrafts 66

17 To find the successive interval sizes, we need to start from the last interval in reverse order, only after this can we start the search. When using the Fibonacci search, we have to decide on the number of function evaluations a priori. This is not always convenient, as the termination criteria is often the variation of the function values in the interval of uncertainty. Furthermore, this method requires that the sequence be stored. Fibonacci search is the optimum because in addition to yielding two intervals that are the same size and reusing one point, the interval between the two interior points converges to zero, and in the final iteration, the interval is divided into two (almost exactly), which is the optimum strategy for the last iteration. A detailed description of this method can be found in [10], pp Multidisciplinary Design Optimization of Aircrafts 67

18 Input: function f, starting values x 1 and x 4 bracketing the minimum, tolerance ε or maximum iterations N (condition: x 1 < x 4 ) Output: interval of size smaller ε that contains the minimum of f(x) begin I 1 x 4 x 1 if ε is given then N minn : ε = I N = I 1 F N end I 2 F N 1 F N I 1 x 2 x 4 I 2 for k = 1 N 1 do I k+1 I k 1 + I k x 3 x 1 + I k+1 if f(x 2 ) < f(x 3 ) then x 4 = x 1 ; x 1 = x 3 else x 1 = x 2 ; x 2 = x 3 end end output [x 1, x 4 ] is the final interval end Pseudo-code 2: Fibonacci Algorithm Multidisciplinary Design Optimization of Aircrafts 68

19 Golden Section Method In the golden section search, the interval reduction strategy is uniform, thus, it is independent of the number of iterations. The interval sizes, I k, are such that Then, I 1 = I 2 + I 3 I 2 = I 3 + I 4. I 2 = I 3 = I 4 = = τ I 1 I 2 I 3 The positive solution of this equation is the golden section ratio, τ 2 + τ 1 = 0 (2.11) τ = lim k F k 1 F k (2.12) Therefore, the Fibonacci search also reaches this value in the limit. Multidisciplinary Design Optimization of Aircrafts 69

20 The basic golden section algorithm is similar to the Fibonacci algorithm, except that α is replaced by the golden ratio τ. Both Fibonacci and golden section methods offer always two equal intervals and reuses a previous interior point, but the latter does not use an optimal strategy for the last iteration. There is no last iteration. The interval is always divided in the same proportions. Assuming the initial uncertainty interval is I 1 = [0, 1]. The function is then evaluated at 1 τ and τ. The two possible intervals are [0, τ] and [1 τ, 1], and they are of the same size. If, say [0, τ] is selected, then the next two interior points would be τ(1 τ) and ττ, but τ 2 = 1 τ which has already been evaluated. Golden section: Sequence of intervals Multidisciplinary Design Optimization of Aircrafts 70

21 Similarly to the Fibonacci method, the golden section method has linear convergence, meaning that successive significant figures are won linearly with additional function evaluations. This method can also be integrated with the three-point bracketing algorithm by choosing the expansion parameter as 1 + τ. A detailed description of this method can be found in [10], pp Multidisciplinary Design Optimization of Aircrafts 71

22 Polynomial-Based Methods More efficient procedures use information about f gathered during iteration. One way of using this information is to produce an estimate of the function which we can easily minimize. The lowest order function that we can use for this purpose is a quadratic, since a linear function does not have a minimum. Suppose we approximate f by f = 1 2 ax2 + bx + c. (2.13) If a > 0, the minimum of this function is x = b/a. Multidisciplinary Design Optimization of Aircrafts 72

23 To generate a quadratic approximation, three independent pieces of information are needed. For example, if we have the value of the function, its first derivative, and its second derivative at point x k, we can write a quadratic approximation of the function value at x as the first three terms of a Taylor series f(x) = f(x k ) + f (x k )(x x k ) f (x k )(x x k ) 2 (2.14) If f (x k ) is not zero, and setting x = x k+1 this yields This is Newton s method used to find a zero of the first derivative. x k+1 = x k f (x k ) f (x k ). (2.15) Robust algorithms are obtained when polynomial fit and sectioning ideas are merged, such as Brent s Quadratic Fit-Sectioning Algorithm. Multidisciplinary Design Optimization of Aircrafts 73

24 Brent s Quadratic Fit-Sectioning Algorithm Brent [11] devised a method fits a quadratic polynomial and accepts the quadratic minimum when the function is cooperative, and uses the golden section method otherwise. At any particular stage, Brent s algorithm keeps track of six points (not necessarily all distinct), a, b, u, v, w and x, defined as follows: The minimum is bracketed between a and b; x is the point with the least function value found so far (or the most recent one in case of a tie); w is the point with the second least function value; v is the previous value of w; u is the point at which the function was evaluated most recently. Multidisciplinary Design Optimization of Aircrafts 74

25 The general idea is the following: parabolic interpolation is attempted, fitting through the points x, v, and w. To be acceptable, the parabolic step must (1) fall within the bounding interval (a, b), and (2) imply a movement from the best current value x that is less than half the movement of the step before last. This second criterion insures that the parabolic steps are converging, rather than, say, bouncing around in some non-convergent limit cycle. The minimum of the quadratic that fits f(x), f(v) and f(w) is u = x 1 2 (x w) 2 [f(x) f(v)] (x v) 2 [f(x) f(w)] (x w)[f(x) f(v)] (x v)[f(x) f(w)]. (2.16) Brent s method converges superlinearly, meaning that the rate at which successive significant figures are liberated increases with each successive function evaluation. A detailed description of this method can be found in [10], pp Multidisciplinary Design Optimization of Aircrafts 75

26 Input: function f, three-point-pattern a, b and x bracketing the minimum, tolerance (condition: f a f x < f b ) Output: interval of size smaller 2ε that contains the minimum of f(x) begin w, v x repeat if x, w and v distinct then try quadratic fit for x, w and v, and determine minimum u using (2.16) if u close to a, b or x, adjust to larger of [a, x] or [x, b] so it stays ε away from x else calculate u using golden sectioning of the larger interval [a, x] or [x, b] end evaluate f u among a, b, x, w, v, u, determine the new a, b, x, w, v until larger interval [a, x] or [x, b] < 2ε end Pseudo-code 3: Brent s Algorithm for Minimum Multidisciplinary Design Optimization of Aircrafts 76

27 Akima Splines Polynomial interpolation often leads to spurious oscillations, specially when the original function exhibits abrupt changes in curvature. Hiroshi Akima, in 1970, published a one-dimensional fitting method that has some very desirable properties [2]. Akima claims that his method is closer to a manually drawn curve than those drawn by other mathematical methods. In 1991, Akima published an update to his algorithm [4, 3] addressing some shortcomings of the original. The approach uses a cubic fit between the data points, so the slope is required at each data point in addition to the value of the point itself. The interpolating polynomial is written between the ith and i + 1st data points as: y = a 0 + a 1 (x x i ) + a 2 (x x i ) + a 3 (x x i ), (2.17) Multidisciplinary Design Optimization of Aircrafts 77

28 with coefficients defined by a 0 = y i (2.18) a 1 = y i a 2 = 3m i 2y i y i+1 x i+1 x i a 3 = y i + y i+1 2m i (x i+1 x i ) 2 and m i = y i+1 y i (2.19) x i+1 x i which is the slope of the line segment passing through the points. The method of determining the derivatives, y, is what makes the Akima methods unique. In the 1991 method, the derivative is y i = ωk f k ωk (2.20) where f k is the computed derivative at P i of a third-order polynomial passing through P i and three other nearby points: Multidisciplinary Design Optimization of Aircrafts 78

29 f 1 = F (P i 3, P i 2, P i 1, P i ) (2.21) f 2 = F (P i 2, P i 1, P i, P i+1 ) f 3 = F (P i 1, P i, P i+1, P i+2 ) f 4 = F (P i, P i+1, P i+2, P i+3 ) The weights are inversely proportional to the product of what Akima calls a volatility measure and a distance measure, ω k = 1 v k d k (2.22) The distance factor is the sum of squares of the distance from P i and the other three points: d 1 = (x i 3 x i ) 2 + (x i 2 x i ) 2 + (x i 1 x i ) 2 (2.23) d 2 = (x i 2 x i ) 2 + (x i 1 x i ) 2 + (x i+1 x i ) 2 d 3 = (x i 1 x i ) 2 + (x i+1 x i ) 2 + (x i+2 x i ) 2 d 4 = (x i+1 x i ) 2 + (x i+2 x i ) 2 + (x i+3 x i ) 2 The volatility factor, v k, is the sum of squares of deviation from a least-squares linear fit of the four points. Multidisciplinary Design Optimization of Aircrafts 79

30 2.1.7 Zero of a Function Solving the first-order optimality condition, that is, finding x such that g(x ) = f (x ) = 0, is equivalent to finding the roots of the first derivative of the function to be minimized. In addition, in constrained optimization, zero-finding problems also occur due to constraints, i.e., h(x) = 0. Therefore, root finding methods can be used to find stationary points and are useful in function minimization. Zero-finding algorithms can be classified in two basic categories, depending on the starting guess: Interval of uncertainty: Bisection method Arbitrary point: Newton s method Secant method Multidisciplinary Design Optimization of Aircrafts 80

Bisection Method This method for finding the zero of a function f starts with two guesses, forming an initial bracket [a, b] containing the root, for which the function values f(a) and f(b) have

31 Bisection Method This method for finding the zero of a function f starts with two guesses, forming an initial bracket [a, b] containing the root, for which the function values f(a) and f(b) have opposite sign. A new guess is then chosen at the midpoint, c = 1 2 (a + b). The procedure is repeated by using the new guess and the other guess that brackets the root. This process is then repeated until the desired accuracy is obtained. Multidisciplinary Design Optimization of Aircrafts 81

32 If [a, b] is the initial interval and N is the number of iterations, the final interval is δ = a b 2 N 2N = a b δ ( ) a b N = log 2 δ (2.24) therefore, this method is guaranteed to find the zero to a specified tolerance δ in about log 2 (a b)/δ function evaluations. Bisection yields the smallest interval of uncertainty for a specified number of function evaluations. It has the advantage that it always converges provided that the initial interval contains a zero. Because it is a bracketing method, it generates a set of nested intervals. The only drawback is that the rate of convergence is rather slow. Since δ k+1 = k /2, from the definition of rate of convergence, for r = 1, lim = δ k+1 = 1 k δ k 2 therefore, the bisection algorithm exhibits a linear rate of convergence (r = 1) and the asymptotic error constant 1/2. Multidisciplinary Design Optimization of Aircrafts 82

33 To find the minimum of a function using bisection, we would evaluate the derivative of f at each iteration, instead of the function value. Using machine precision, it is not possible find the exact zero, so we will be satisfied with finding an x that belongs to an interval [a, b] such that the function g( f ) satisfies g(a)g(b) < 0 and a b < δ where δ is a small tolerance. This tolerance might be dictated by the machine representation (using double precision this is usually ), the precision of the function evaluation, or a limit on the number of iterations we want to perform with the root finding algorithm. Multidisciplinary Design Optimization of Aircrafts 83

34 Input: function f, endpoint values a and b, tolerance ε or maximum iterations N (conditions: a < b and f(a)f(b) < 0) Output: value which differs from a root of f(x) = 0 by less than ε begin k 1 while k N do c (a + b)/2 if f(c) = 0 or (b a)/2 < ε then return c stop end k k + 1 if sign(f(c)) = sign(f(a)) then a c else b c end end output Method failed end Pseudo-code 4: Bisection Method Multidisciplinary Design Optimization of Aircrafts 84

35 Newton-Raphson Method Newton s method for finding a zero can be derived from the Taylor s series expansion of the function about some initial guess x k, f(x k+1 ) = f(x k ) + (x k+1 x k )f (x k ) + O((x k+1 x k ) 2 ) where x k+1 = x k + x. Setting the function to zero and ignoring the terms of order higher than two results f(x k ) + (x k+1 x k )f (x k ) 0 Solving for the new estimate, x k+1, yields x k+1 = x k f(x k) f (x k ) (2.25) Multidisciplinary Design Optimization of Aircrafts 85

36 This iterative procedure converges quadratically, so lim k x k+1 x x k x 2 = const. While having quadratic converge is a great property, this method is not guaranteed to converge, and only works under certain conditions. To minimize a function using Newton s method, we simply substitute the function for its first derivative and the first derivative by the second derivative, x k+1 = x k f (x k ) f (x k ) (2.26) Multidisciplinary Design Optimization of Aircrafts 86

37 Input: function f, starting value x 0, tolerance ε, maximum iterations N Output: value which differs from a root of f(x) = 0 by less than ε begin k 1 while k N do x k+1 x k f(x k )/f (x k ) if x k+1 x k < ε or f(x k+1 ) < ε then return x k+1 stop end x k x k+1 k k + 1 end output Method failed end Pseudo-code 5: Newton-Raphson Method Multidisciplinary Design Optimization of Aircrafts 87

38 Example 2.3: Function Minimization Using Newton s Method Solve the single-variable optimization problem using Newton s method. minimize f(x) = (x 3)x 3 (x 6) 4 w.r.t. x Newton s method with several different initial guesses. The x k are Newton iterates. x N is the converged solution. Multidisciplinary Design Optimization of Aircrafts 88

39 Secant Method Newton s method requires the first derivative for each iteration (and the second derivative when applied to minimization). In some practical applications, it might not be possible to obtain this derivative analytically or it might just be troublesome. If we use a backward-difference approximation for f (x k ), f (x k ) f(x k) f(x k 1 ) x k x k 1 and substitute into Newton s method, results x k+1 = x k f(x k ) ( xk x k 1 ) f(x k ) f(x k 1 ) (2.27) which is the secant method ( the poor-man s Newton method ). Under favorable conditions, this method has superlinear convergence (1 < r < 2), with r Multidisciplinary Design Optimization of Aircrafts 89

40 Input: function f, starting values x 0 and x 1 near the root, tolerance ε, maximum iterations N Output: value which differs from a root of f(x) = 0 by less than ε begin k 1 while k N do if f(x k 1 ) < f(x k ) then swap x k 1 and x k end x k x k 1 x k+1 x k f(x k ) f(x k ) f(x k 1 ) if x k+1 x k < ε or f(x k+1 ) < ε then return x k+1 stop end x k 1 x k x k x k+1 k k + 1 end output Method failed end Pseudo-code 6: Secant Method Multidisciplinary Design Optimization of Aircrafts 90

41 Linear Interpolation Method The bisection method is very simple, but generally quite inefficient, in part because it only makes use of the sign of the function f(x) at each evaluation, while ignoring its magnitude. Thus it ignores significant information which could be used accelerate the finding of the root. A method based on interpolation makes use of this information by approximating the function on the interval [x 1, x 2 ] by the chord joining the points (x 1, f(x 1 )) and (x 2, f(x 2 )), that is the straight line: y y 1 = y 2 y 1 (2.28) x x 1 x 2 x 1 Solving this linear equation for y = 0 yields the new interval endpoint within the interval [x 1, x 2 ]: ( ) x2 x 1 x 3 = x 1 f(x 1 ) f(x 2 ) f(x 1 ) (2.29) The choice between the two intervals [x 1, x 3 ] and [x 3, x 2 ] is decided by evaluating f(x 3 ) and discarding the interval whose endpoints have the same sign, as was done in the bisection method. This iteration process is repeated, but it will converge more quickly than the bisection method since the information about the magnitude of f(x) pushes the x 3 value more quickly towards the actual root. Multidisciplinary Design Optimization of Aircrafts 91

42 Input: function f, starting values x 0 and x 1 near the root, tolerance ε, maximum iterations N (condition: f(x 0 )f(x 1 ) < 0) Output: value which differs from a root of f(x) = 0 by less than ε begin k 1 while k N do x k x k 1 x k+1 x k f(x k ) f(x k ) f(x k 1 ) if x k+1 x k < ε or f(x k+1 ) < ε then return x k+1 stop end if f(x k 1 )f(x k+1 ) < 0 then x k x k+1 else x k 1 x k+1 end k k + 1 end output Method failed end Pseudo-code 7: Linear Interpolation Method Multidisciplinary Design Optimization of Aircrafts 92

43 2.1.8 Line Search Techniques Line search methods are related to single-variable optimization methods as they address the problem of minimizing a multi-variable function along a line, which is a subproblem in many gradient-based optimization methods. After a gradient-based optimizer has computed a search direction p k, it must decide how far to move along that direction. The step can be written as where the positive scalar α k is the step length. x k+1 = x k + α k p k (2.30) Most algorithms require that p k be a descent direction, i.e., that p k must have a projection in g k such that p T k g k < 0. This guarantees that f can be reduced by stepping along this direction. We want to compute a step length α k that yields a substantial reduction in f, but we do not want to spend too much computational effort in making the choice. Ideally, we would find the global minimum of f(x k + α k p k ) with respect to α k but in general, it is too expensive to compute this value. Even to find a local minimizer usually requires too many evaluations of the objective function f and possibly its gradient g. More practical methods perform an inexact line search that achieves adequate reductions of f at reasonable cost. Multidisciplinary Design Optimization of Aircrafts 93

44 Wolfe Conditions A typical line search involves trying a sequence of step lengths, accepting the first that satisfies certain conditions. A common condition requires that α k should yield a sufficient decrease of f, as given by the inequality f(x k + αp k ) f(x k ) + µ 1 αg T k p k (2.31) for a constant 0 µ 1 1. In practice, this constant is small, say µ 1 = Any sufficiently small step can satisfy the sufficient decrease condition, so in order to prevent steps that are too small we need a second requirement called the curvature condition, which can be stated as g(x k + αp k ) T p k µ 2 g T k p k (2.32) where µ 1 µ 2 1, and g(x k + αp k ) T p k is the derivative of f(x k + αp k ) with respect to α k. This condition requires that the slope of the univariate function at the new point be greater. Since we start with a negative slope, the gradient at the new point must be either less negative or positive. Typical values of µ 2 are 0.9 when using a Newton type method and 0.1 when a conjugate gradient methods is used. Multidisciplinary Design Optimization of Aircrafts 94

45 The sufficient decrease (2.31) and curvature (2.32) conditions are known collectively as the Wolfe conditions. We can also modify the curvature condition to force α k to lie in a broad neighborhood of a local minimizer or stationary point and obtain the strong Wolfe conditions f(x k + αp k ) f(x k ) + µ 1 αg T k p k. (2.33) g(x k + αp k ) T g T p k µ2, (2.34) where 0 < µ 1 < µ 2 < 1. The only difference when comparing with the Wolfe conditions is that these conditions do not allow points where the derivative has a positive value that is too large, and therefore exclude points that are far from the stationary points. If µ 2 = 0, then require g(x k + αp k ) T p k = 0 and we have an exact line search. k p k Figure 2.1: Acceptable steps for the Wolfe conditions Multidisciplinary Design Optimization of Aircrafts 95

46 Sufficient Decrease and Backtracking The curvature condition can be ignored by performing backtracking, i.e., by executing the following algorithm: 1. Choose a starting step length 0 < ᾱ < 1, and reduction ratio 0 < ρ < If f(x k + αp k ) f(x k ) + µ 1 αg T k p k then set α k = α and stop. 3. α = ρα. 4. Return to 2. When using Newton or quasi-newton methods, the starting step length ᾱ is usually set to 1. The step size reduction ratio, ρ, sometimes varies during the optimization process and is such that 0 < ρ < 1. In practice ρ is not set to be too close to 0 or 1. For steepest descent and conjugate gradient methods, which do not produce well-scaled search directions, need to use other information to guess a step length. One strategy is to assume that the first-order change in x k will be the same as the one obtained in the previous step. i.e, that ᾱg T k p k = α k 1 g T k 1 p k 1 and therefore: g T k 1 ᾱ = α p k 1 k 1 gk T p. (2.35) k Multidisciplinary Design Optimization of Aircrafts 96

47 Line Search Algorithm Using the Strong Wolfe Conditions This procedure is guaranteed to find a step length satisfying the strong Wolfe conditions for any parameters µ 1 and µ 2. This procedure has two stages: 1. Begins with trial α 1, and keeps increasing it until it finds either an acceptable step length or an interval that brackets the desired step lengths. 2. In the latter case, a second stage (the zoom algorithm) is performed that decreases the size of the interval until an acceptable step length is found. Define the univariate function φ(α) = f(x k + αp k ), so that φ(0) = f(x k ). Therefore, φ (α i ) is the derivative of f in the line direction taken with respect to α at α i. Multidisciplinary Design Optimization of Aircrafts 97

48 The first stage is as follows: 1. Set α 0 = 0, choose α 1 > 0 and α max. Set i = Evaluate φ(α i ). 3. If [φ(α i ) > φ(0) + µ 1 α i φ (0)] or [φ(α i ) > φ(α i 1 ) and i > 1] then, set α = zoom(α i 1, α i ) and stop (local minimum bracketed). 4. Evaluate φ (α i ). 5. If φ (α i ) µ 2 φ (0), set α = α i and stop. 6. If φ (α i ) 0, set α = zoom(α i, α i 1 ) and stop. 7. Choose α i+1 such that α i < α i+1 < α max. 8. Set i = i Return to 2. Multidisciplinary Design Optimization of Aircrafts 98

49 The second stage, the zoom(α lo, α hi ) function: 1. Interpolate (using quadratic, cubic, or bisection) to find a trial step length α j between α lo and α hi. 2. Evaluate φ(α j ). 3. If φ(α j ) > φ(0) + µ 1 α j φ (0) or φ(α j ) > φ(α lo ), set α hi = α j. 4. Else: (a) Evaluate φ (α j ). (b) If φ (α j ) µ 2 φ (0), set α = α j and stop. (c) If φ (α j )(α hi α lo ) 0, set α hi = α lo. (d) α lo = α j. Implementing an algorithm based on the strong Wolfe conditions (as opposed to the plain Wolfe conditions) has the advantage that by decreasing µ 2, we can force α to lie closer to the local minimum. More details can be found in Nocedal and Wright, pg [47]. Multidisciplinary Design Optimization of Aircrafts 99

50 Example 2.4: Line Search Algorithm Using Strong Wolfe Conditions The line search algorithm iterations. The first stage is marked with square labels and the zoom stage is marked with circles. Multidisciplinary Design Optimization of Aircrafts 100

51 2.2 Unconstrained Gradient-Based Minimization Many engineering problems involve the unconstrained minimization of a function of several variables. Unconstrained problems also arise when the constraints are eliminated and accounted for by suitable penalty functions. All these problems are of the form minimize by varying f(x) x R n The point x is a strong local minimum, if f(x ) < f(x) for all x near x ; weak local minimum, if f(x ) f(x) for all x near x ; strong global minimum, if f(x ) < f(x), for all x; weak global minimum, if f(x ) f(x), for all x. Note on convention: lowercase or bold roman letters are vectors, lowercase Greek letters are scalars, and uppercase roman letters are matrices. Multidisciplinary Design Optimization of Aircrafts 101

52 2.2.1 Gradient Vector and Hessian Matrix of a Multivariable Function Let f(x) be a real function where x = [x 1, x 2,..., x n ] T is a column vector of n real-valued design variables. The gradient vector of the function f(x) is given by the partial derivatives with respect to each of the independent variables, f x 1 f f(x) g(x) x 2 (2.36). f In the multivariate case, the gradient vector is perpendicular to the the hyperplane tangent to the contour surfaces of constant f: let the tangent plane be defined by t = [ x 1 / s, x 2 / s,, x n / s] T, where s is along a contour or isosurface; then, f(x) = const df ds = 0 df ds = f x 1 x 1 s + f x n x 2 x 2 s + + f x n x n s = 0 f T t = 0 therefore, the dot product of the gradient with the tangent to the contour surface is zero. Multidisciplinary Design Optimization of Aircrafts 102

53 Higher derivatives of multi-variable functions are defined as in the single-variable case, but note that the number of gradient components increase by a factor of n for each differentiation. While the gradient of a function of n variables is an n-vector, the second derivative of an n-variable function is defined by n 2 partial derivatives (the derivatives of the n first partial derivatives with respect to the n variables): 2 f x i x j, i j and 2 f, i = j. x 2 i If the partial derivatives f/ x i, f/ x j and 2 f/ x i x j are continuous and f is single valued, then 2 f/ x i x j exists and 2 f/ x i x j = 2 f/ x j x i. Therefore the second-order partial derivatives can be represented by a square symmetric matrix called the Hessian matrix, 2 f 2 f 2 2 x 1 x 1 x n f(x) H(x).. (2.37) 2 f 2 f, x n x 1 2 x n which contains n(n + 1)/2 independent elements. If f is quadratic, the Hessian of f is constant, and the function can be expressed as f(x) = 1 2 xt Hx + g T x + α. (2.38) Multidisciplinary Design Optimization of Aircrafts 103

54 2.2.2 Optimality Conditions As in the single-variable case, the optimality conditions can be derived from the Taylor-series expansion of f about x : f(x + εp) = f(x ) + εp T g(x ) ε2 p T H(x + εθp)p, (2.39) where 0 θ 1, ε is a scalar, and p is an n-vector. For x to be a local minimum, then for any vector p there must be a finite ε such that f(x + εp) f(x ), i.e. there is a neighborhood in which this condition holds. If this condition is satisfied, then f(x + εp) f(x ) 0 and the first and second order terms in the Taylor-series expansion must be greater than or equal to zero. As in the single variable case, and for the same reason, the first order terms are considered first. Since p is an arbitrary vector and ε can be positive or negative, every component of the gradient vector g(x ) must be zero. Multidisciplinary Design Optimization of Aircrafts 104

A point that satisfies f(x ) g(x ) = 0 is

55 A point that satisfies f(x ) g(x ) = 0 is called a stationary point, and it can be a minimum, maximum or saddle point. Stationary points Regarding the second order term, ε 2 p T H(x + εθp)p. For this term to be non-negative, H(x + εθp) has to be positive semi-definite, and by continuity, the Hessian at the optimum, H(x ) must also be positive semi-definite. Multidisciplinary Design Optimization of Aircrafts 105

56 Necessary conditions (for a local minimum): g(x ) = 0 and H(x ) is positive semi-definite. (2.40) Sufficient conditions (for a strong local minimum): g(x ) = 0 and H(x ) is positive definite. (2.41) Some definitions from linear algebra that might be helpful: The matrix H R n n is positive definite if p T Hp > 0 for all nonzero vectors p R n (If H = H T then all the eigenvalues of H are strictly positive) convex function The matrix H R n n is positive semi-definite if p T Hp 0 for all vectors p R n (If H = H T then the eigenvalues of H are positive or zero) convex and flat function The matrix H R n n is indefinite if there exists p, q R n such that p T Hp > 0 and q T Hq < 0. (If H = H T then H has eigenvalues of mixed sign.) saddle point Multidisciplinary Design Optimization of Aircrafts 106

Example 2.5: Find all stationary points of f(x) = 1.5x 2 1 + x2 2 2x 1x 2 + 2x 3 1 + 0.5x4 1.

0051 Global minimum 1/2( 3 + 7, 3 + 7) f = 9.

57 Example 2.5: Find all stationary points of f(x) = 1.5x x2 2 2x 1x 2 + 2x x4 1. Solve f(x) = 0, get three solutions: (0, 0) f = 0 Local minimum 1/2( 3 7, 3 7) f = Global minimum 1/2( 3 + 7, 3 + 7) f = Saddle point To establish the type of point, we have to determine if the Hessian is positive definite and compare the values of the function at the points. Multidisciplinary Design Optimization of Aircrafts 107

58 2.2.3 General Algorithm for Smooth Functions All algorithms for unconstrained gradient-based optimization can be described as follows: 1. Initial guess. Start with iteration number k = 0 and a starting point, x Test for convergence. If the conditions for convergence are satisfied, then we can stop and x k is the solution. 3. Compute a search direction. Compute the vector p k that defines the direction in n-space along which we will search. 4. Compute the step length. Find a positive scalar, α k such that f(x k + α k p k ) < f(x k ). 5. Update the design variables. Set x k+1 = x k + α k p k, k = k + 1 and go back to 2. There are two subproblems in this type of algorithm for each major iteration: computing the search direction p k and finding the step size (controlled by α k ). The difference between the various types of gradient-based algorithms is the method that is used for computing the search direction. Caution: in non-convex problems with multiple local minima, the solution obtained by gradient methods will only find the local minimum nearest to the starting point. Multidisciplinary Design Optimization of Aircrafts 108

59 2.2.4 Steepest Descent Method The earliest reference to this method is given by Cauchy in 1847 [14]. The steepest descent method uses the gradient vector at each point as the search direction for each iteration. The gradient vector at a point, g(x k ), is also the direction of maximum rate of change (maximum increase) of the function at that point. This rate of change is given by the norm, g(x k ). As mentioned previously, the gradient vector is orthogonal to the plane tangent to the isosurfaces of the function. If we use an exact line search, the steepest descent direction at each iteration is orthogonal to the previous one, i.e., df(x k+1 ) dα = f(x k+1) x k+1 = x k+1 α = T f(x k+1 )p k = 0 g T (x k+1 )g(x k ) = 0 (2.42) Therefore the method zigzags in the design space and is rather inefficient. Although a substantial decrease may be observed in the first few iterations, the method is usually very slow after that. In particular, while the algorithm is guaranteed to converge, it may take an infinite number of iterations. The rate of convergence is linear. Multidisciplinary Design Optimization of Aircrafts 109

60 Input: function f, starting point x 0 and convergence parameters ε g, ε a and ε r Output: local minimum of f begin repeat compute g(x k ) f(x k ) if g(x k ) ε g then converged else compute normalized search direction p k = g(x k )/ g(x k ) end perform line search to find step length α k in the direction of p k update the current point, x k+1 = x k + α k p k evaluate f(x k+1 ) if f(x k+1 ) f(x k ) ε a + ε r f(x k ) satisfied for two successive iterations then converged else set k = k + 1, x k = x k+1 end until converged stop end Pseudo-code 8: Steepest Descent Algorithm Multidisciplinary Design Optimization of Aircrafts 110

61 Here, f(x k+1 ) f(x k ) ε a + ε r f(x k ) is a check for the successive reductions of f. ε a is the absolute tolerance on the change in function value (usually small 10 6 ) and ε r is the relative tolerance (usually set to 0.01). If f is order 1, then ε r dominates. If f gets too small, then the absolute error take over. For steepest descent and other gradient methods that do not produce well-scaled search directions, we need to use other information to guess a step length. One strategy is to assume that the first-order change in x k will be the same as the one obtained in the previous step. i.e, that ᾱg T k p k = α k 1 g T k 1 p k 1 and therefore g T k 1 ᾱ = α p k 1 k 1 gk T p. (2.43) k Since steepest descent relies only on first order information, which is useful locally, it does not take into account neither previous iterations nor second order information, which helps in getting the bigger picture. Multidisciplinary Design Optimization of Aircrafts 111

62 Example 2.6: Steepest Descent Applied to a Quadratic Function Figure 2.2: Solution path of the steepest descent method Multidisciplinary Design Optimization of Aircrafts 112

63 2.2.5 Conjugate Gradient Method First presented by Fletcher and Reeves [27], this method includes a small modification to the steepest descent method that takes into account the history of the gradients to move more directly towards the optimum. It can find the minimum of a quadratic function of n variables in n iterations. Consider the problem of minimizing a (convex) quadratic function f(x) = 1 2 xt Ax c T x (2.44) where A is an n n matrix that is symmetric and positive definite. Differentiating with respect to x yields f(x) = Ax c (2.45) Thus, minimizing the quadratic is equivalent to solving a linear system. The conjugate gradient method is an iterative method for solving linear systems of equations such as this one. A set of nonzero vectors {p 0, p 1,..., p n 1 } is conjugate with respect to A if p T i Ap j = 0, for all i j. (2.46) Conjugate vectors are linearly independent. Multidisciplinary Design Optimization of Aircrafts 113

64 Suppose that we start from a point x 0 and a set of conjugate directions {p 0, p 1,..., p n 1 }. In this method, the gradients of f are used to generate the conjugate directions. Let g k = f(x k ) = Ax k c, being x k the current point at iteration k. The first direction is chosen as the steepest direction p 0 = g 0. (2.47) The sequence {x k } is generated by minimizing f(x) along p k, thus x k+1 = x k + α k p k (2.48) where α k is obtained from the line search problem minimize f(α) = f(x k + αp k ) (2.49) Setting df(α)/ dα = 0 yields p T k Ap kα k + p T k (Ax k + c) = 0 α k = pt k g k p T k Ap k (2.50) Also, the exact line search condition df(α)/dα yields p T k g k+1 = 0 (2.51) Multidisciplinary Design Optimization of Aircrafts 114

65 Now, the key step: choosing p k+1 to be of the form p k+1 = g k+1 + β k d k (2.52) where β introduces a deflection in the steepest descent direction. Requiring p k+1 to be a conjugate direction to p k, p T k+1 Ap k = g T k+1 Ap k + β k d T k Ap k = 0 (2.53) Manipulating x k+1 = x k + α k p k leads to Ap k = (g k+1 g k )/α k. (2.54) Rearranging equations yields β k = gt k+1 (g k+1 g k ) α k p T k Ap k (2.55) Right multiplying (2.52) with g k+1 and using (2.51) results p T k g k = g T k g k (2.56) Multidisciplinary Design Optimization of Aircrafts 115

66 Substituting (2.56) in (2.50) leads to α k = gt k g k p T k Ap k (2.57) Finally, replacing (2.57) in (2.55), β k = gt k+1 (g k+1 g k ) g T k g k (2.58) For any x 0, the sequence {x k } generated by the conjugate direction algorithm converges to the solution of the linear system in at most n steps, only for quadratic functions. This is referred as the (linear) Polak Rebiere algorithm. Convergence is also only guaranteed with exact line search and no round-off errors. In the case of general functions, a restart is made every n iterations wherein the steepest descent step is taken for computational stability. Multidisciplinary Design Optimization of Aircrafts 116

67 If we consider g T k+1 g k = g T k+1 ( p k + β k 1 p k 1 ) = β k 1 g T k+1 p k 1 = β k 1 (g T k + α kp T k A)p k 1 = 0 Substituting in (2.58), we obtain β k = gt k+1 g k+1 gk T g k which is the nonlinear CG algorithm, also known as the Fletcher Reeves method. (2.59) The only difference of CG relative to the steepest descent is that the each descent direction is modified by adding a contribution from the previous direction. Rate of convergence is linear, but can be superlinear, converges in n to 5n iterations, usually 2n. Several variants of the Fletcher Reeves CG method have been proposed. Most of these variants differ in their definition of β k. For example, Dai and Yuan [16] proposed β k = g k+1 2 (g k+1 g k ) T p k. (2.60) Multidisciplinary Design Optimization of Aircrafts 117

68 Input: function f, starting point x 0 and convergence parameters ε g, ε a and ε r Output: local minimum of f begin set k = 0 compute g(x k ) f(x k ) if g(x k ) ε g then converged end repeat compute conjugate gradient direction p k = g k + β k p k 1, where gt k g k β k = g k 1 T g k 1 perform line search to find step length α k in the direction of p k update the current point, x k+1 = x k + α k p k evaluate f(x k+1 ) if f(x k+1 ) f(x k ) ε a + ε r f(x k ) satisfied for two successive iterations then converged else set k = k + 1, x k = x k+1 end until converged end Pseudo-code 9: Nonlinear Conjugate Gradient Algorithm Multidisciplinary Design Optimization of Aircrafts 118

69 Example 2.7: Conjugate Gradient Applied to a Quadratic Function Figure 2.3: Solution path of the nonlinear conjugate gradient method Multidisciplinary Design Optimization of Aircrafts 119

70 2.2.6 Newton Methods Even though the Newton s method lacks robustness for optimization, its concepts lay down the basis for other powerful methods discussed subsequently. While the steepest descent and conjugate gradient methods only use first-order information (the function gradient or first derivative term in the Taylor series) to obtain a local model of the function, Newton methods use a second-order Taylor series expansion of the function (the function Hessian or second-derivative term in the Taylor series) about the current design point, i.e., a quadratic model f(x k + d k ) f k + g T k d k dt k H kd k, (2.61) where d k is the step to the minimum. Differentiating this with respect to d k and setting it to zero, we can obtain the step that minimizes this quadratic, H k d k = g k. (2.62) This is a linear system which yields the Newton step, d k, as a solution. Thus, the Newton method gives both the search direction and step size, i.e., p k = d k and α k = 1. Multidisciplinary Design Optimization of Aircrafts 120

71 When it converges, this method converges at a faster rate than first-oder methods. If the function f is quadratic with a positive definite Hessian matrix H k, then the method converges in one step. For general nonlinear functions, Newton s method converges quadratically if x 0 is sufficiently close to x and the Hessian is positive definite at x. Despite the excellent convergence rate, this method has two main disadvantages: As in the single variable case, difficulties and even failure may occur when the quadratic model is a poor approximation of the function f. If H k is not positive definite, the quadratic model might not have a minimum or even a stationary point. For some nonlinear functions, the Newton step might be such that f(x k + s k ) > f(x k ) and the method is not guaranteed to converge. Another disadvantage of Newton s method is the need to compute not only the gradient, but also the Hessian, which contains n(n + 1)/2 second-order derivatives. Multidisciplinary Design Optimization of Aircrafts 121

72 Input: function f, starting point x 0 and convergence parameters ε g, ε a and ε r Output: local minimum of f begin set k = 0 repeat compute g k f(x k ) if g(x k ) ε g then converged end compute H k 2 f(x k ) compute Newton step d k from H k d k = g k update the current point, x k+1 = x k + d k if f(x k+1 ) f(x k ) ε a + ε r f(x k ) satisfied for two successive iterations then converged else set k = k + 1, x k = x k+1 end until converged stop end Pseudo-code 10: Newton s Method Multidisciplinary Design Optimization of Aircrafts 122

73 Modified Newton s Method To address the two mentioned Newton s method main disadvantages, two modifications can be made: Ensure that the search direction is a descent direction of f at x k That is, ensure that f(x k ) T d k < 0, which using (2.62) means f(x k ) T [ 2 f(x k )] 1 f(xk ) < 0 (2.63) For the above to be satisfied, the Hessian of f has to be positive definite. One strategy is to replace the real Hessian with a symmetric positive definite matrix F k defined by F k = H k + γi, (2.64) where γ is chosen such that all the eigenvalues of F k are greater than a scalar δ > 0. The direction vector d k is now determined from the solution of F k d k = g k. (2.65) Multidisciplinary Design Optimization of Aircrafts 123

74 A step size parameter α k can be introduced to improve the approximation of highly nonlinear functions The step size α k is obtained from line search: minimize f(x k + α k d k ) The new point is then found to be x k+1 = x k + α k d k When using Newton or quasi-newton methods, the starting step length ᾱ is usually set to 1, since Newton s method already provides a good guess for the step size. The step size reduction ratio (ρ in the backtracking line search) sometimes varies during the optimization process and is such that 0 < ρ < 1. In practice ρ is not set to be too close to 0 or 1. Multidisciplinary Design Optimization of Aircrafts 124

75 Input: function f, starting point x 0, scalar δ > 0 and convergence param.ε g, ε a and ε r Output: local minimum of f begin set k = 0 repeat compute g k f(x k ) if g(x k ) ε g then converged end compute H k 2 f(x k ) and F k = H k + γi compute search direction d k from F k d k = g k compute step size α k from: minimize f(x k + α k d k ) update current point, x k+1 = x k + d k if f(x k+1 ) f(x k ) ε a + ε r f(x k ) satisfied for two successive iterations then converged else set k = k + 1, x k = x k+1 end until converged stop end Pseudo-code 11: Modified Newton s Method Multidisciplinary Design Optimization of Aircrafts 125

76 Example 2.8: Modified Newton s Method Applied to a Quadratic Function Figure 2.4: Solution path of the modified Newton s method Multidisciplinary Design Optimization of Aircrafts 126

77 2.2.7 Quasi-Newton Methods This class of methods uses first order information only, but builds second order information an approximate Hessian based on the sequence of function values and gradients from previous iterations. Most of these methods also force the Hessian to be symmetric and positive definite, which can greatly improve their convergence properties. Key to success of Newton s is the use of n-dimensional curvature information given by the Hessian, which allows a local quadratic model of f. Quasi-Newton contrasts with Newton in that with Newton, all gradients and all curvature terms are computed at a single point. The update formula is now x k+1 = x k α k V k f(x k ) (2.66) where V k is the inverse of the Hessian approximation, V k F 1 k, and the step size α k is determined by minimizing f(x k + αd k ) with respect to α, where d k = V k f(x k ). Multidisciplinary Design Optimization of Aircrafts 127

78 When using quasi-newton methods, the inverse Hessian is initialized to the identity matrix, V 0 = I. The update at each iteration is written as ˆV k and is added to the current one, V k+1 = V k + ˆV k. (2.67) Considering the Taylor-series expansion of the gradient function about x k, g(x k+1 ) = g k + H k s k + (2.68) where s k = x k+1 x k. Neglecting higher-order terms in this series yields where y k = g(x k+1 ) g(x k ). H k s k = y k, (2.69) Then, the new approximate inverse of the Hessian, V k+1 must satisfy the quasi-newton condition, V k+1 y k = s k. (2.70) The quasi-newton methods are the most widely used of the gradient optimization methods. Multidisciplinary Design Optimization of Aircrafts 128

79 Davidon Fletcher Powell (DFP) Method One of the first quasi-newton methods was devised by Davidon (1959) [20] and modified by Fletcher and Powell (1963) [26]. Instead of computing V k from scratch at every iteration, a quasi-newton method updates it in a way that accounts for the curvature measured during the most recent step. The DFP update for the inverse of the Hessian approximation can be shown to be V DF P k+1 = V k V ky k y T k V k y T k V ky k + s ks T k s T k y k (2.71) Notice that V k+1 remains symmetric and it can also be shown that it remains positive definite (assuming H k is positive definite). When applied to quadratic functions, the update formula results in the exact inverse of the Hessian matrix after n iterations; this implies convergence at the end of n iterations (same as CG method). For large problems, the storage and update of V may be a disadvantage of quasi-newton methods compared to the conjugate gradient method. Multidisciplinary Design Optimization of Aircrafts 129

80 The DFP Algorithm 1. Select starting point x 0, and convergence parameter ε g. Set k = 0 and V 0 = I. 2. Compute g(x k ) f(x k ). If g(x k ) ε g then stop. Otherwise, continue. 3. Compute the search direction, p k = V k g k. 4. Perform line search to find step length α k in the direction of p k (start with α k = 1). 5. Update the current point, x k+1 = x k + α k p k, set s k = α k p k, and compute the change in the gradient, y k = g k+1 g k. 6. Update V k+1 by computing A k = V ky k y T k V k y T k V ky k B k = s ks T k s T k y k 7. Set k = k + 1 and return to step 2. V k+1 = V k A k + B k Multidisciplinary Design Optimization of Aircrafts 130

81 Broyden Fletcher Goldfarb Shanno (BFGS) Method The DFP update was soon superseded by the BFGS formula [12, 13, 25, 31, 54], which is generally considered to be the most effective quasi-newton updates. The BFGS update formula for the inverse of the Hessian approximation can be shown to be BF GS Vk+1 = V k s ky T k V k + V k y k s T k s T k y k + ( 1 + yt k V ky k s T k y k ) sk s T k s T k y k (2.72) The relative performance between the DFP and BFGS methods is problem dependent. The BFGS update is better suited than the DFP update when using approximate line search. Multidisciplinary Design Optimization of Aircrafts 131

82 Example 2.9: BFGS Applied to a Quadratic Function Figure 2.5: Solution path of the BFGS method Multidisciplinary Design Optimization of Aircrafts 132

83 2.2.8 Trust Region Methods Trust region, or restricted step methods are a different approach to resolving the weaknesses of the pure form of Newton s method, arising from an Hessian that is not positive definite or a highly nonlinear function. One may interpret these problems as arising from trying to minimize the quadratic approximation of f(x), minimize q(x) = f(x k ) + f(x k ) (x x k) T 2 f(x k )(x x k ) in a region which is outside the validity region of the quadratic approximation. These difficulties can be overcome by minimizing the function within a region around x k wherein the second-order Taylor series approximation is valid, that is to say, where there is trust in the quadratic model. This region is called the trust region and can be denoted by Ω k = {x : x x k h h } where h k is the size of the trust region, which is dynamically adjusted. Multidisciplinary Design Optimization of Aircrafts 133

84 The quadratic approximation q(x) is minimized within Ω k, minimize q(s k ) = f(x k ) + g(x k ) T s k st k H(x k)s k w.r.t. s k (2.73) s.t. h k s k h k, i = 1,..., n which is a constrained minimization problem involving a quadratic objective function and linear constraints. This class of problems is called a quadratic programming (QP) problem and its solution are discussed in future sections. After obtaining s k, the actual and predicted changes in the objective function can be computed as f = f(x k ) f(x k + s k ) (2.74) q = f(x k ) q(s k ) (2.75) Then the accuracy with which q(s k ) approximates f(x k + s k ) can be measured by the ratio r k = f q (2.76) The closer r k is to unity, the better is the agreement. Multidisciplinary Design Optimization of Aircrafts 134

85 The size of the trust region is updated based on this ratio as follows: h k+1 = s k if r k < 0.25, 4 h k+1 = 2h k if r k > 0.75 and h k = s k, (2.77) h k+1 = h k otherwise. The initial value of h is usually taken as h 0 = 1. The quadratic model is reasonable when q(x k ) is close to the real value of the function f(x k ). A particular advantage of trust region methods is that they are not very sensitive to scaling because of the dynamic adjustment of the size of the trust region. They are also very robust. Multidisciplinary Design Optimization of Aircrafts 135

86 Input: function f, starting point x 0, conv.param. ε g, ε a and ε r, and initial size of trust region h 0 Output: local minimum of f begin set k = 0 repeat compute g k f(x k ) if g(x k ) ε g then converged end compute H k 2 f(x k ) and solve the quadratic subproblem (2.74) for s k evaluate f(x k + s k ) and compute quadratic model accuracy ratio r k (2.76) compute size for new trust region using (2.77) determine new point: x k+1 = x k if r k 0; x k+1 = x k + s k otherwise if f(x k+1 ) f(x k ) ε a + ε r f(x k ) satisfied for two successive iterations then converged else set k = k + 1, x k = x k+1 end until converged end Pseudo-code 12: Trust Region algorithm Multidisciplinary Design Optimization of Aircrafts 136

87 Example 2.10: Minimization of the Rosenbrock Function Minimize Rosenbrock s function, f(x) = 100 ( x 2 x 2 1) 2 + (1 x1 ) 2, starting from x 0 = [ ] Multidisciplinary Design Optimization of Aircrafts 137

88 Figure 2.6: Solution path of the steepest descent and conjugate gradient methods Multidisciplinary Design Optimization of Aircrafts 138

89 Figure 2.7: Solution path of the modified Newton and BFGS methods Multidisciplinary Design Optimization of Aircrafts 139

Single Variable Minimization

AA222: MDO 37 Sunday 1 st April, 2012 at 19:48 Chapter 2 Single Variable Minimization 2.1 Motivation Most practical optimization problems involve many variables, so the study of single variable minimization