Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1
Outline for week 7: Algorithms for unconstrained minimization A generic algorithm Rate of convergence Line search methods Dichotomous and golden section search Bisection Newton s method Search directions Gradient method Newton s method Methods of conjugate directions Powell s method Fletcher-Reeves method Quasi-Newton s method DFP update BFGS update Stopping criteria Optimization Group 2
Input: ǫ > 0 is the accuracy parameter; x 0 is a given (relative interior) feasible point; Generic algorithm for min x C f(x) Step 0: x := x 0, k = 0; Step 1: Find search direction s k s.t. δf(x k, s k ) < 0 (This should be a descending feasible direction in the constrained case.) Step 1a: If no such direction exists STOP, optimum found. Step 2: Line search : find λ k = argmin λ f(x k + λs k ); Step 3: x k+1 = x k + λ k s k, k = k + 1; Step 4: If stopping criteria satisfied STOP, else GOTO Step 1. Optimization Group 3
Algorithms: rate of convergence Definition: Let α 1, α 2,..., α k,... α be a convergent sequence. The rate of convergence is: p = sup { p : limsup k The larger p is, the faster the convergence. Let The rate of convergence is: β = limsup k α k+1 α α k α p < α k+1 α α k α p. }. linear if p = 1 and 0 < β < 1; super-linear if p = 1 and β = 0; quadratic if p = 2; sub-linear if β = 1. Optimization Group 4
Examples: order of convergence Example 1: The sequence α k = a k, where 0 < a < 1 converges linearly to zero while β = a. Example 2: The sequence α k = a (2k), where 0 < a < 1 converges quadratically to zero. Example 3: The sequence α k = 1 k converges sub-linearly to zero. Example 4: The sequence α k = ( 1 k )k converges super-linearly to zero. Optimization Group 5
Linesearch methods We assume throughout that f is a convex function. We are given a (feasible) search direction s at a feasible point x and we want to find So we are minimizing for λ 0. λ = argmin λ 0 f(x + λs). φ(λ) := f(x + λs) This is a one-dimensional problem. We deal with four different line search methods, that require different levels of information about φ(λ): The Dichotomous search and Golden section methods, that use only function evaluations of φ; Bisection, that evaluates φ (λ) (φ has to be continuously differentiable); Newton s method, that evaluates both φ (λ) and φ (λ). Optimization Group 6
Linesearch: Dichotomous search We assume that φ is convex and has a minimizer on the interval [a, b]. Our aim is to reduce the size of this interval of uncertainty by evaluating φ at points in [a, b]. Lemma 1 (Exercise 4.7) Let a < ā < b < b. If φ(ā) < φ( b) then the minimum of φ occurs in the interval [a, b]; if φ(ā) φ( b) the minimum of φ occurs in the interval [ā, b]. The lemma suggest a simple algorithm to reduce the interval of uncertainty. Optimization Group 7
Input: ǫ > 0 is the accuracy parameter; Linesearch: Dichotomous search a 0, b 0 are given such that [a 0, b 0 ] contains the minimizer of φ(λ), k = 0. Step 1: If a k b k < ǫ STOP. Step 2: Choose ā k (a k, b k ) and b k (a k, b k ), such that ā k < b k ; Step 3a: If φ(ā k ) < φ( b k ) set a k+1 = a k, b k+1 = b k ; Step 3b: If φ(ā k ) φ( b k ) set a k+1 = ā k, b k+1 = b k ; Step 4: Set k = k + 1. GOTO Step 1. We have not specified yet how we should choose the values ā k and b k in iteration k (Step 2 of the algorithm). There are many ways to do this. One is to choose ā k = 1 2 (a k + b k ) δ and b k = 1 2 (a k + b k ) + δ where δ > 0 is a (very) small fixed constant. Then the interval of uncertainty is reduced by a factor ( 1 2 + δ)t/2 after t function evaluations (Exercise 4.8). Optimization Group 8
Linesearch: Golden section search This is a variant of the Dichotomous search method where the constant δ is not constant but depends on k. In the k-th iteration we take δ = δ k, where δ k = ( α 1 2 ) (b k a k ), α = 1 2 ( 5 1 ) 0.618. Here α is the Golden ratio, i.e., the root of α 2 + α 1 = 0 with α [0,1]. We now have ā k = 1 2 (a k + b k ) δ k = 1 2 (a k + b k ) ( α 1 2 (b k a k ) = b k α(b k a k ) b k = 1 2 (a k + b k ) + δ k = 1 2 (a k + b k ) + ( α 1 ) 2 (b k a k ) = a k + α(b k a k ). If φ(ā k ) < φ( b k ), then we set b k+1 = b k and a k+1 = a k. In that case b k+1 = a k+1 + α ( b k+1 a k+1 ) = ak + α ( b k a k ) = ak + α 2 (b k a k ) = a k + (1 α)(b k a k ) = b k α(b k a k ) = ā k. So in the next iteration we only need to compute φ(ā k+1 ). Similarly, if φ(ā k ) φ( b k ), then we set a k+1 = ā k and b k+1 = b k, and it follows in a similar way that ā k+1 = b k. So in the next iteration we only need to compute φ( b k+1 ). In both cases one need to evaluate φ only once. See the course notes for graphical illustrations. When using Golden section search each iteration reduces the interval of uncertainty by a factor α 0.618 (Exercise 4.9). ) Optimization Group 9
Linesearch: Golden section search Suppose a k = 0 and b k = 1. We choose fixed α ( 1 2,1) and define ā k+1 = a k + (1 α)(b k a k ) = 1 α, b k+1 = b k (1 α)(b k a k ) = α. Suppose φ(1 α) < φ(α). Then we set a k+1 = 0 and b k+1 = α and ā k+2 = a k+1 + (1 α)(b k+1 a k+1 ) = (1 α)α b k+2 = b k+1 (1 α)(b k+1 a k+1 ) = α (1 α)α. We want one of these two points to be ā k+1 = 1 α, because we already know φ(1 α). This gives either or, equivalently, (1 α)α = 1 α or α α(1 α) = 1 α, α = 1 or α 2 + α 1 = 0. Since α ( 1 2,1), the only possible value for α is which is the Golden ratio! α = 1 2 ( 5 1 ) 0.618, Optimization Group 10
Linesearch: Bisection (or Bolzano s method) We assume that φ(λ) is differentiable (and convex). We wish to find a λ so that φ ( λ) = 0. Input: ǫ > 0 is the accuracy parameter; a 0, b 0 are given such that φ (a 0 ) < 0 & φ (b 0 ) > 0, k = 0; Step 1: If b k a k < ǫ STOP. Step 2: Let λ = 1 2 (a k + b k ); Step 3a: If φ (λ) < 0 then set a k+1 = λ, b k+1 = b k ; Step 3b: If φ (λ) > 0 then set a k+1 = a k, b k+1 = λ; Step 4: Set k = k + 1. GOTO Step 1. The algorithm needs log 2 b 0 a 0 ǫ evaluations of φ (Exercise 4.11). Optimization Group 11
Linesearch using the Newton-Raphson method The quadratic approximation of φ at λ k is q(λ) = φ(λ k ) + φ (λ k )(λ λ k ) + 1 2 φ (λ k )(λ λ k ) 2. The minimum of q is attained if q (λ) = 0, which gives Input: ǫ > 0 is the accuracy parameter; λ 0 is the given initial point; k = 0; λ k+1 = λ k φ (λ k ) φ (λ k ). Step 1: Let λ k+1 = λ k φ (λ k ) φ (λ k ) ; Step 2: If λ k+1 λ k < ǫ STOP. Step 3: k := k + 1, GOTO Step 1. Optimization Group 12
The Newton-Raphson method: Example 4.3 Let φ(λ) = λ log(1+λ). The domain of φ is ( 1, ). The first and second derivatives of φ are φ (λ) = λ 1 + λ, φ (λ) = 1 (1 + λ) 2. This makes clear that φ is strictly convex on its domain, and minimal at λ = 0. The iterates satisfy the recursive relation λ k+1 = λ k φ (λ k ) φ (λ k ) = λ k λ k (1 + λ k ) = λ 2 k. This implies quadratic convergence if λ 0 < 1 (see Exercise 4.12). On the other hand, Newton s method fails if λ 0 1. For example, if λ 0 = 1 then λ 1 = 1, which is not in the domain of φ! In general the method converges quadratically if the following conditions are met: 1. the starting point is sufficiently close to the minimizer; 2. in addition to being convex, the function φ has a property called self-concordance, which is introduced later. Optimization Group 13
Search Directions: The Gradient method Search direction: s = f(x k ) Steepest descent direction! δf(x, f(x)) = f(x) T f(x) = min s = f(x) { f(x) T s}. The (negative) gradient is orthogonal to the level curves (Exercise 4.14). The gradient method is not a finite algorithm, not even for linear or quadratic functions. Slow convergence ( zigzagging ) (Figure 4.4). The order of convergence is only linear. Optimization Group 14
Let f be continuously differentiable. Convergence of the gradient method Start from the initial point x 0 using exact line search; the gradient method produces a sequence { x k} such that f(x k ) > f(x k+1 ) for k = 0,1,2,. Assume that the level set D = {x : f(x) f(x 0 )} is compact, then any accumulation point x of the sequence { x k} is a stationary point (i.e. f( x) = 0) of f. If the function f is a convex function, then x is a global minimizer of f. If the function f is not convex, then x is a local minimizer of f. Optimization Group 15
Newton s method Newton s method is based on minimizing the second order approximation of f at x k. q(x) := f(x k ) + f(x k ) T (x x k ) + 1 2 (x xk ) T 2 f(x k )(x x k ). We assume that q(x) is strictly convex. So the Hessian 2 f(x k ) is positive definite. Hence the minimum is attained when q(x) = f(x k ) + 2 f(x k )(x x k ) = 0. We can solve x from 2 f(x k )(x x k ) = f(x k ) which gives the next iterate: x k+1 = x k ( 2 f(x k )) 1 f(x k ), So the Newton direction is s k = ( 2 f(x k )) 1 f(x k ). Exact when f is quadratic. Local quadratic convergence with full Newton steps (α = 1, so without any line search!). Good starting point is essential. Optimization Group 16
Trust region method If the function f(x) is not strictly convex, or if the Hessian is ill-conditioned, then the Hessian is not (or hardly) invertible. Remedy: trust region method. 2 f(x) is replaced by ( 2 f(x) + αi); s k = ( 2 f(x k ) + αi) 1 f(x k ). α is dynamically increased and decreased in order to avoid exact line search. If α = 0 then we have the Newton step, if α then we approach a (small) multiple of the negative gradient. Optimization Group 17
Newton s method for solving nonlinear equations Find solution of F(x) = 0, where F : R m R n Linearize at x k : F(x) F(x k ) + JF(x k )(x x k ) Jacobian of F: JF(x) ij = F i(x) x j. Solve x k+1 from JF(x k )(x k+1 x k ) = F(x k ). Minimizing f(x) is equivalent to solving f(x) = 0. 2 f(x k )(x k+1 x k ) = f(x k ). The Jacobian of the gradient is exactly the Hessian of the function f(x) hence it is positive definite and we have as we have seen above. x k+1 = x k ( 2 f(x k )) 1 f(x k ) Conclusion: Newton s optimization method Newton s method for nonlinear equations applied to the system f(x) = 0. Optimization Group 18
Methods using conjugate directions (1) Let A be an n n symmetric positive definite matrix and b R n. We consider min { q(x) = 1 2 xt Ax b T x : x R n}. The minimizer is uniquely determined by q(x) = Ax b = 0. But to find the minimizer we need to invert the matrix A. If n is large this is computationally expensive, and we want to avoid this. This can be done by using so-called conjugate search directions. If the subsequent search directions are s 0,..., s k then the iterates have the form x k+1 = x k + λ k s k, k = 1, 2,.... If we use exact line search, then we automatically have q(x k+1 ) T s k = 0, k = 1, 2,.... By requiring a little more, namely that the search vectors s i are linearly independent and q(x k+1 ) T s i = 0, 0 i k = 1, 2,... we can guarantee termination of the algorithm in a finite number of steps. Because then q(x n ) T s i = 0, i < n, whence, since the vectors s i are linearly independent, q(x n ) = 0. So no more than n steps are required. Optimization Group 19
Methods using conjugate directions (2) We denote q(x k ) as g k. Note that when using exact line search, we have (automatically) g j+1t s j = 0, j = 1, 2,.... Lemma 2 Let k {1,..., n}. The following two statement are equivalent: (i) g j+1t s i = 0, 0 i < j k; (ii) s it As j = 0, 0 i < j k. Proof: Since q(x) = Ax b, we have q(x j+1 ) = Ax j+1 b = A ( x j + λ j s j) b = q(x j ) + λ j As j = g j + λ j As j, j = 0, 1,... Therefore, for each i 0, g j+1t s i = q(x j+1 ) T s i = g jt s i + λ j s it As j, j = 0, 1,.... The proof can now be easily completed by induction on k, since λ j > 0 for each j. If (ii) holds then the vectors s 0,..., s k R n are called conjugate (or A conjugate). Note that if A = I then conjugate means orthogonal. Then s 0,..., s k are linearly independent. This also holds for A-conjugate vectors, since A is positive definite (Ex. 4.20). As we established before, if one uses A-conjugate directions to minimize the quadratic form q, then the minimizer of q is found in at most n iterations. Optimization Group 20
Easy method to generate conjugate directions Let s 0 = q(x 0 ) = g 0. Then we can get subsequent conjugate directions by taking s k = g k + α k s k 1, k = 1, 2,... for suitable values of α k. In order to make s k and s k 1 A-conjugate, we must have s kt As k 1 = 0 for k 1. This determines the coefficients α k uniquely: α k = gkt As k 1 k 1. s k 1T Ask 1, We proceed with induction on k. So we assume that s 0,..., s k 1 are conjugate. Using g kt s k 1 = 0 we find g kt s k = g kt ( g k + α k s k 1) = g k 2 < 0, proving that s k is a descent direction, provided g k 0. Our choice of α k implies s kt As k 1 = 0. So it remains to show that s kt As i = 0 for i < k 1. The induction hypothesis implies s kt As i = ( g k + α k s k 1) T As i = g kt As i. Since g i = q(x i ) = Ax i b, and x i+1 = x i + λ i s i, we have Hence, due to Lemma 2(ii), λ i As i = g i+1 g i = α i+1 s i s i+1 ( α i s i 1 s i). λ i g kt As i = g kt ( αi+1 s i s i+1 α i s i 1 + s i) = 0. This proves that s 0,..., s k are conjugate, provided g k 0 (otherwise x k is optimal!). Optimization Group 21
The case of nonquadratic functions In the case where f is (convex) quadratic then finite termination is guaranteed if ( ) ( ) α k = gkt As k 1 s k 1T As = gkt gk gk 1 k 1 s k 1T (g k g k 1 ) = gkt gk gk 1 g k 2 g k 1 2 = g k 1 2, k 1. Here we used that g k g k 1, as follows from g k 1 = s k 1 +α k s k 2 and g kt s k 2 = 0, by Lemma 2 (i) and g kt s k 1 = 0 by the choice of λ k. The algorithm is Step 0. Let s 0 = f(x 0 ) and x 1 := argmin f(x 0 + λs 0 ). Step k. Set s k = f(x k ) + α k s k 1 and x k+1 := argmin f(x k + λs k ). If f is not quadratic there is no guarantee that the method stops after a finite number of steps. Several choices for α k have been proposed (which are equivalent in the quadratic case): ( Hestenes-Stiefel (1952): α k = gkt g k g k 1) s k 1T (g k g k 1 ). g k 2 Fletcher-Reeves (1964): α k = g k 1 2. Polak-Ribière (1969): α k = gkt ( g k g k 1) g k 1 2. Optimization Group 22
Solving a linear system with the conjugate gradient method Assume we want to solve Ax = b with A positive definite. The solution is precisely the minimizer of q(x) = 1 2 xt Ax b T x, and hence can be solve by the conjugate gradient method. If A is not positive definite, but nonsingular, then A T A is positive definite. Hence we can solve Ax = b by minimizing q(x) = Ax b 2 = x T A T Ax 2b T Ax + b T b. Optimization Group 23
Powell s method We now deal with a conjugate direction method using only function values (no gradients!). Input A starting point x 0, a set of linearly independent vectors t 1,..., t n. Initialization Set t (1,i) = t i, i = 1,..., n. For k = 1,..., n do (Cycle k) Let z (k,1) = x k 1 and z (k,i+1) := argmin q ( z (k,i) + λt (k,i)), i = 1,..., n. Let x k := argmin q(z (k,n+1) + λs k ), where s k := z (k,n+1) x k 1. Let t (k+1,i) = t (k,i+1), i = 1,, n 1 and t (k+1,n) := s k. The algorithm consists of n cycles and terminates at the minimizer of q(x). Each cycle consists of n + 1 line searches and yields a search direction s k. The k-th direction s k is constructed at the end of cycle k. The search directions s 1,..., s n are conjugate (for a proof: see the course notes). Note that only function values are evaluated (no derivatives are used, unless the line searches do so). The number of line searches is n(n + 1). Therefore, Powell s method is attractive for minimizing black box functions where gradient and Hessian information is not available (or too expensive to compute). Optimization Group 24
Illustration of Powell s method 2 x 0 1.5 t 1 1 0.5 x 1 x 2 0 x 2 optimal 0.5 1 t 2 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2 x 1 Iterates of Powell s algorithm for f(x) = 5x 2 1 + 2x 1x 2 + x 2 2 + 7, starting at x0 = [1;2]. Optimization Group 25
Quasi-Newton methods Recall that the Newton direction at iteration k is given by: s k = [ 2 f(x k ) ] 1 f(x k ) = [ 2 f(x k ) ] 1 g k. Quasi-Newton methods use a positive definite approximation H k to [ 2 f(x k ) ] 1. The approximation H k is updated at each iteration, say H k+1 = H k + D k, where D k denotes the update. The algorithm has the following generic form. Step 0. Let x 0 be given and set H 0 = I. Step k. s k = H k g k and x k+1 = argmin λ f(x k + λs k ) = x k + λ k s k ; H k+1 = H k + D k and k = k + 1. Defining we require for each k that y k := g k+1 g k, σ k = x k+1 x k = λ k s k. I H k+1 is symmetric positive definite; II σ k = H k+1 y k (quasi-newton property); III σ i = H k+1 y i (i = 0,..., k 1) (hereditary property), Optimization Group 26
The quasi-newton property and hereditary property Let A be an n n symmetric PD matrix, and let f be the strictly convex quadratic function Then g k = q(x k ) = Ax k b, and hence q(x) = 1 2 xt Ax b T x. y k = g k+1 g k = q(x k+1 ) q(x k ) = A ( x k+1 x k) = Aσ k, whence σ k = A 1 y k. Recall that each H k should be a good approximation of the inverse of 2 q(x k ), which is A 1. Therefore we require that σ k = H k+1 y k, which is the quasi-newton property, and even more, that our approximation H k satisfies σ i = H k+1 y i, i = 0,..., k. which is the hereditary property. Note that the hereditary property implies σ i = H n y i, i = 0,..., n 1. If the σ i (i = 0,..., n 1) are linearly independent, this implies H n = A 1. But then the n + 1-th iteration is simply the Newton step at x n. Since q is quadratic, this yields the minimizer of q, and hence we find the minimum of q no more than n + 1 iterations. Optimization Group 27
A generic update D k (1) First consider the case where D k is a (possibly indefinite) matrix of rank 2, whence D k = αuu T + βvv T for suitable vectors u and v and scalars α, β. Then the quasi-newton property implies H k+1 y k = H k y k + αuu T y k + βvv T y k = σ k. Davidon, Fletcher and Powell (1963) recognized that this condition is satisfied if u = σ k = λ k s k, α = 1, v = H k y k 1, β = σ kt y k y kt H k y k, which yields the so-called DFP update: D k = λ k s k s kt s kt y k H kyk ykt Hk y kt H k y k. In the following we consider a slight more general update, namely D k = αuu T + βvv T + µ ( uv T + vu T ) = [ ] u v α µ µ β Exercise A Show that D k has rank at most 2. ut v T. Optimization Group 28
A generic update D k (2) D k = αuu T + βvv T + µ ( uv T + vu T) = [ u v ] [ α µ µ β ] [ u T v T ], u = σ k, v = H k y k. Lemma 3 If the above update D k satisfies the quasi-newton property then the subsequent directions are conjugate. (So a quasi-newton method is a conjugate gradient method!) Proof: We show by induction on k that H k y i = σ i = λ i s i, s kt As i = g kt s i = 0, 0 i < k. (1) This trivially holds if k = 0 (the condition is void). Assuming the quasi-newton property and (1) for k 0, and using y i = Aσ i and σ i = λ i s i for all i, we write for i < k: y kt H k+1 y i = σ kt y i = σ kt Aσ i = λ k λ i s kt As i = 0. Also σ T k y i = σ T k Aσ i = 0. Hence we obtain for all i < k, D k y i = [ u v ] [ ] [ ] α µ u T y i µ β v T = [ σ k H k y ] [ k α µ y i µ β ] [ σ kt y i y kt H k y i whence H k+1 y i = H k y i +D k y i = H k y i = σ i. Together with the quasi-newton property this gives H k+1 y i = σ i for 0 i < k + 1. Because λ i 0, s k+1 = H k+1 g k+1 and H k+1 y i = σ i we observe next that λ i s k+1t As i = s k+1t Aσ i = s k+1t y i = g k+1t H k+1 y i = g k+1t σ i. Hence it suffices for the rest of the proof if g k+1t s i = 0 for 0 i < k + 1. This certainly holds if i = k, because we use exact line search. For i < k we use the induction hypothesis again, and g k+1 = A(x k + λ k s k ) b = g k + λ k As k, which gives This completes the proof. g k+1t s i = g kt s i + λ k s kt As = 0. ] = 0, Optimization Group 29
The Broyden family of updates (1) D k = αuu T + βvv T + µ ( uv T + vu T) = [ u v ] [ α µ µ β ] [ u T v T ], u = σ k, v = H k y k. We now determine conditions on the parameters α, β and µ that guarantee that each H k is positive definite. The quasi-newton property (σ k = H k+1 y k = H k y k + D k y k for k 0) amounts to u = v + αuu T y k + βvv T y k + µ ( uv T + vu T ) y k To satisfy this condition it suffices if = v ( 1 + βv T y k + µu T y k) + u ( αu T y k + µv T y k). αu T y k + µv T y k = 1 βv T y k + µu T y k = 1. This linear system has multiple solutions. Introducing ρ = µu T y k R the solution is α = 1 u T y k Since v T y k = y T H k y k > 0 and ( 1 + ρ vt y k u T y k ), β = ρ 1 v T y k, ρ R. u T y k = σ k T ( g k+1 g k) = λ k s k T ( g k+1 g k) = λ k s k T g k = λ k g k T H k g k > 0, the above expressions are well defined. Optimization Group 30
The Broyden family of updates (2) D k = αuu T +βvv T +µ ( uv T + vu T), u = σ k, v = H k y k, α = 1 u T y k Substituting the values of u, v, α and β we find D k = λ k s k s kt s kt y k H kyk ykt Hk y kt H k y k +ρwwt, w = y kt H k y k ) (1 + ρ vt y k sk u T y k s kt y k, β = ρ 1 v T y k. H ky k y kt H k y k This class of updates is known as the Broyden family. Note that if ρ = 0, we get the DFP update that we have seen before. Lemma 4 If ρ 0 then H k is positive definite for each k 0. Proof: It suffices if H k H ky k y kt H k y kt H k y k is positive definite, since the other two terms forming H k+1 are positive semidefinite. This however is an (almost) immediate consequence of the inequality of Cauchy-Schwarz. If ρ = 0 we get the DFP update. The choice ρ = 1 was proposed by Broyden, Fletcher, Goldfarb and Shanno (1970). It is the most popular BFGS update.. Optimization Group 31
Stopping criteria The stopping criterion is a relatively simple but essential part of all algorithms. If both primal and dual feasible solutions are generated then we use the duality gap primal obj. value dual obj. value as a criterion. We then stop the algorithm if the duality gap is smaller than some prescribed accuracy parameter ǫ. In unconstrained optimization one often uses a primal algorithm and then there is no such obvious measure for the distance to the optimum. We then stop if there is no sufficient improvement in the objective value, or if the iterates stay too close to each other or if the length of the gradient or the length of the Newton step in an appropriate norm is small. All these criteria can be scaled (relative to) some characteristic number describing the dimensions of the problem. For example, the relative improvement at two subsequent iterates x k, x k+1 in the objective value is usually measured by f(x k ) f(x k+1 ) 1 + f(x k, ) and we may stop if it smaller than a prescribed accuracy parameter ǫ. Optimization Group 32