Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file of the Book [FKS]: Faigle/Kern/Still, Algorithmic principles of Mathematical Programming. on: http://wwwhome.math.utwente.nl/ stillgj/priv/ CO, Chapter 4 p 1/23

4.1 Introduction We consider the nonlinear minimization problem (with f C 1 or f C 2 ): (P) min f (x), x R n Recall: Usually in (nonconvex) unconstrained optimization we try to find a local minimizer. Global minimization is much more difficult. Theoretical method: (based on optimality conditions) Find a point x satisfying f (x) = 0 (critical point) Check whether 2 f (x) 0. CO, Chapter 4 p 2/23

CONCEPTUAL ALGORITHM: Choose x 0 R n. Iterate step k: Given x k R n, find a new point x k+1 with f (x k+1 ) < f (x k ). We hope that: x k x with x a local mininimizer. Def. Let x k x for k. The sequence (x k ) has a: linear convergence if with a constant 0 C < 1 and some K N: x k+1 x C x k x, k K. C is the convergence factor. quadratic convergence if with a constant c 0, x k+1 x c x k x 2, k N. superlinear convergence if lim k x k+1 x x k x = 0. CO, Chapter 4 p 3/23

4.2 General descent method (f C 1 ) Def. A vector d k R n is called a descent direction for f in x k if f (x k ) T d k < 0 ( ) Rem. If ( ) holds then: f (x k + td k ) < f (x k ) for t > 0 small. Abbreviation: g(x) := f (x), g k := g(x k ) Conceptual DESCENT METHOD: Choose a starting point x 0 R n and ɛ > 0. Iterate step k: Given x k R n, proceed as follows: if g(x k ) < ɛ, stop with x x k. Choose a descent direction d k in x k : g T k d k < 0 Find a solution t k of the (one-dimensional) minimization problem min t 0 f (x k + td k ) and put x k+1 = x k + t k d k. ( ) CO, Chapter 4 p 4/23

Remark By this descent method, minimization in R n is reduced to (line) minimization in R (in each step k). Steepest descent method: Use in the descent method as descent direction (see Ex.11.7): d k = f (x k ) Ex.11.7 Assuming f (x k ) 0, show that d k = [ f (x k )]/ f (x k ) solves the problem: min f (x d R n k ) T d s.t. d = 1 Convergence behavior: L.11.3 In the line-minimization step (**) we have f (x k+1 ) T d k = 0 For the steepest descent method this means: d T k+1 d k = 0 (ziggzagging) CO, Chapter 4 p 5/23

Th.11.1 Let f C 1. Apply the steepest descent method. If the iterates x k converge, i.e., x k x, then f (x) = 0 Ex.11.8 Given the quadratic function on R n, q(x) = x T Ax + b T x, A 0. Show that the minimizer t k of min q(x k + td k ) is given by t k = gt k d k. t 0 2dk T Ad k CO, Chapter 4 p 6/23

Speed of concergence: The next example shows that in general (even for min of quadratic functions), the steepest descent method cannot be expected to converge better than linearly. Ex.11.9. Apply the steepest descent method to q(x) = x T ( 1 0 0 r ) x, r 1 Then with x 0 = (r, 1) it follows ( ) r 1 k x k = (r, ( 1) k ). r + 1 (Linear convergence to x = 0 with factor C = (r 1)/(r + 1).) HINT: Make use of [FKS,L.11.8] and apply induction wrt. k. CO, Chapter 4 p 7/23

4.3 Method of conjugate directions Aim: Find an algorithm which (at least for quadratic functions) has better convergence than steepest descent. 4.3.1 Case: f (x) = q(x) := 1 2 x T Ax + b T x, A 0 (pd.) Idea. Try to generate d k s such that (not only q(x k+1 ) T d k = 0 but) q(x k+1 ) T d j = 0 0 j k Then, after n steps we have q(x n ) T d j = 0 0 j n 1 and (if the d j s are lin. indep.) q(x n ) = 0. So x n = A 1 b is the minimizer of q. CO, Chapter 4 p 8/23

L. 11.4 Apply the descent method to q(x). The following are equivalent: (i) g T j+1 d i = 0 for all 0 i j k; (ii) d T j Ad i = 0 for all 0 i < j k. Definition. Vectors d 0,..., d n 1 0 are called A-conjugate (or A-orthogonal) if: dj T Ad i = 0 i j. Ex. A collection of A-conjugate vectors d 0,..., d n 1 0 in R n are linearly independent. CO, Chapter 4 p 9/23

Construction of A-conjugate d k s. To construct vectors satisfying the conditions in L.11.4, simply try: d k = g k + α k d k 1 Then d T k Ad k 1 = 0 implies α k = gt k Ad k 1 d T k 1 Ad k 1. Th.11.3 Apply the descent method to q(x) with d k = g k + α k d k 1, α k = gt k Ad k 1 d T k 1 Ad k 1 Then the d k s are A-conjugate. In particular, the algorithm stops after (at most) n steps with the unique minimizer x = A 1 b of q. CO, Chapter 4 p 10/23

Conjugate Gradient Method (CG) INIT: Choose x 0 R n, ε > 0, d 0 := g 0 ; ITER: WHILE g k ε DO BEGIN Determine a solution t k for the problem END ( ) min f (x k + td k ) t 0 Set x k+1 = x k + t k d k. Set d k+1 = g k+1 + α k+1 d k. Ex.11.10 Under the assumptions of Th.11.3, show that the iteration point x k+1 is the (global) minimizer of the quadratic function q on the affine subspace S k = {x 0 + γ 0 d 0 +.. + γ k d k γ 0,.., γ k R} CO, Chapter 4 p 11/23

4.3.2 Case: non-quadratic functions f (x) Note that for quadratic function f = q we have: α k+1 = gt k+1 Ad k d T k Ad k = gt k+1 (g k+1 g k ) d T k (g k+1 g k ) = gt k+1 (g k+1 g k ) = g k+1 2 g k 2 g k 2 So, for non-quadratic f (x) in the CG-algorithm we can use d k+1 = g k+1 + α k+1 d k with: Hestenes-Stiefel (1952): α k+1 = gt k+1 (g k+1 g k ) d T k (g k+1 g k ) Fletcher-Reeves (1964): α k+1 = g k+1 2 g k 2 Polak-Ribiere (1969): α k+1 = gt k+1 (g k+1 g k ) g k 2 CO, Chapter 4 p 12/23

Application to sparse systems Ax = b, A 0 Def. A = (a ij ) is sparse if less than α% of the a ij -s are 0 with (say) α 5 CG-method: apply the CG-method to min 1 2 x T Ax b T x with solution x = A 1 b CO, Chapter 4 p 13/23

CG Method for sparse linear systems Ax = b, A 0 INIT: Choose x 0 R n and ε > 0 and set d 0 = g 0 ; ITER: WHILE g k ε DO BEGIN Set x k+1 = x k + t k d k Set g k+1 = g k + t k Ad k with t k = gt k d k d T k Ad k Set d k+1 = g k+1 + α k+1 d k with α k+1 = gt k+1 g k+1 gk T g k END Rem. Complexity: α 100 n2 flop s (floating point operations) per ITER. CO, Chapter 4 p 14/23

4.4 Line minimization In the general descent method (see Ch.4.2) we have to repeatedly solve: where h (0) < 0. min t 0 h(t) with h(t) = f (x k + td k ) This can be done by: exact line minimization of numerical analysis e.g., bisection, golden section, Newton-, secant method (see Ch.4.3, Ch.11.4.1) or more efficiently by inexact line search Goldstein-, Goldstein-Wolfe test (see Ch.11.4.2) CO, Chapter 4 p 15/23

4.5 Newton s method: General remark: Newton s method for solving systems of nonlinear equations is one of the most important tools of applied mathematics. Newtons s Iteration: for solving F (x) = 0, F : R n R n, F C 1, a system of n equations in n unknowns x = (x 1,..., x n ): start with some x 0 and iterate Th.11.4 x k+1 = x k [ F (x k )] 1 F (x k ) k = 0, 1,... (local convergence of Newton s method) Given F : R n R n, F C 2 such that F (x) = 0 and F (x) is non-singular. Then the Newton iteration x k converges quadratically to x for any x 0 sufficiently close to x. CO, Chapter 4 p 16/23

Newton for solving : min f (x) or F (x) := f (x) = 0, x k+1 = x k [ 2 f (x k )] 1 f (x k ) (local) quadratic convergence to x if: f C 3, f (x) = 0 with 2 f (x) non-singular. Problems with this Newton method for: x k x possibly a local maximizer. x k x k+1 with increasing f Newton descent method: min f (x) The Newton direction d k = [ 2 f (x k )] 1 f (x k ) is a descent direction at x k (g T k d k < 0) if (assume f (x k ) 0): [ 2 f (x k )] 1 or equivalently 2 f (x k ) is positive definite. CO, Chapter 4 p 17/23

Algorithm: (Levenberg-Marquardt variant) step k: Given x k R n with g k 0. 1. determine σ k > 0 such that ( 2 f (x k ) + σ k I) 0, compute d k = ( 2 f (x k ) + σ k I ) 1 gk ( ) 2. Find a minimizer t k of min t 0 f (x k + td k ) and put x k+1 = x k + t k d k. Ex.11.n1 [connection with the trust region method ] Consider the quadratic Taylor approximation of f near x k : q(x) := f (x k ) + f (x k ) T (x x k ) +1/2(x x k ) T 2 f (x k )(x x k ) Compute the descent step d k according to ( ) (Levenberg-Marquardt) and put x k+1 = x k + d k, τ := d k. Show that x k+1 is a local minimizer of the trust region problem: min q(x) s.t. x x k τ CO, Chapter 4 p 18/23

Disadvantage of the Newton methods: 2 f (x k ) needed work per step: linear system F k x = b k n 3 flop s CO, Chapter 4 p 19/23

4.6 Quasi-Newton method. Find a method which only makes use of first derivatives and only needs O(n 2 ) flop s per iter. Consider the descent method with: desired properties for H k : i H k 0 d k = H k g k ii H k+1 = H k + E k simple update rule iii for quadratic f conjugate directions d j iv the Quasi-Newton condition: (x k+1 x k ) = H k+1 (g k+1 g k ) Notation: δ k := (x k+1 x k ), γ k := (g k+1 g k ) CO, Chapter 4 p 20/23

Quasi-Newton Method INIT: Choose some x 0 R n, H 0 0, ε > 0 ITER: WHILE g k ε DO BEGIN Set d k = H k g k, Determine a solution t k for the problem END min f (x k + td k ) t 0 Set x k+1 = x k + t k d k H k+1 = H k + E k. and update For the update H k + E k we try: with α, β, µ R E k = αuu T + βvv T + µ(uv T + vu T ) ( ) where u := δ k, v := H k γ k Note that E k is symmetric with rank 2. CO, Chapter 4 p 21/23

L.11.5 Apply the Quasi-Newton method to q(x) = 1 2 x T Ax + b T x, A 0 with E k of the form ( ) and H k+1 satisfying iv: δ k = H k+1 γ k Then the directions d j are A-conjugate : d T j Ad i = 0 0 i < j k Last step in the construction of E k : Find α, β, µ in ( ) such that (iv) holds. This leads to the following update formula CO, Chapter 4 p 22/23

Broyden family: with Φ R H k+1 = H k + δ k δ T k δ T k γ k where w := ( δk δ T k γ k As special cases we obtain: H k γ k γ T k H k γ T k H k γ k + Φ ww T ( ) ) H k γ k (γ T γk T H k γ k k H k γ k ) 1 2. Φ = 0, the DFP-method (1963) (Davidon, Fletcher, Powell) Φ = 1, the BFGS-method (1970) (Broyden, Fletcher, Goldfarb, Shanno) Finally we show that property i), H k 0, is preserved. L.11.6 In the Quasi-Newton method, if we use ( ) with Φ 0, then: H k 0 H k+1 0 CO, Chapter 4 p 23/23