Part 2: Linesearch methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective function f : IR n IR assume that f C 1 (sometimes C 2 ) and Lipschitz often in practice this assumption violated, but not necessary
ITERATIVE METHODS in practice very rare to be able to provide explicit minimizer iterative method: given starting guess x 0, generate sequence {x k }, k = 1, 2,... AIM: ensure that (a subsequence) has some favourable limiting properties: satisfies first-order necessary conditions satisfies second-order necessary conditions Notation: f k = f(x k ), g k = g(x k ), H k = H(x k ). LINESEARCH METHODS calculate a search direction p k from x k ensure that this direction is a descent direction, i.e., g T k p k < 0 if g k 0 so that, for small steps along p k, the objective function will be reduced calculate a suitable steplength α k > 0 so that f(x k + α k p k ) < f k computation of α k is the linesearch may itself be an iteration generic linesearch method: x k+1 = x k + α k p k
STEPS MIGHT BE TOO LONG f(x) 3 2.5 (x 1,f(x 1 ) 2 1.5 1 (x 3,f(x 3 ) (x 5,f(x 5 ) (x 2,f(x 2 ) (x 4,f(x 4 ) 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 x The objective function f(x) = x 2 and the iterates x k+1 = x k + α k p k generated by the descent directions p k = ( 1) k+1 and steps α k = 2 + 3/2 k+1 from x 0 = 2 STEPS MIGHT BE TOO SHORT f(x) 3 2.5 (x 1,f(x 1 ) 2 1.5 1 (x 5,f(x 5 ) (x 2,f(x 2 ) (x 3,f(x 3 ) (x 4,f(x 4 ) 0.5 0 2 1.5 1 0.5 0 0.5 1 1.5 2 x The objective function f(x) = x 2 and the iterates x k+1 = x k + α k p k generated by the descent directions p k = 1 and steps α k = 1/2 k+1 from x 0 = 2
PRACTICAL LINESEARCH METHODS in early days, pick α k to minimize f(x k + αp k ) exact linesearch univariate minimization rather expensive and certainly not cost effective modern methods: inexact linesearch ensure steps are neither too long nor too short try to pick useful initial stepsize for fast convergence best methods are either backtracking- Armijo or Armijo-Goldstein based BACKTRACKING LINESEARCH Procedure to find the stepsize α k : Given α init > 0 (e.g., α init = 1) let α (0) = α init and l = 0 Until f(x k + α (l) p k ) < f k set α (l+1) = τα (l), where τ (0, 1) (e.g., τ = 1 2) and increase l by 1 Set α k = α (l) this prevents the step from getting too small... but does not prevent too large steps relative to decrease in f need to tighten requirement f(x k + α (l) p k ) < f k
ARMIJO CONDITION In order to prevent large steps relative to decrease in f, instead require f(x k + α k p k ) f(x k ) + α k βg T k p k for some β (0, 1) (e.g., β = 0.1 or even β = 0.0001) 0.14 0.12 0.1 f(x k )+αβg T k p k 0.08 0.06 0.04 0.02 0 f(x k +αp k ) 0.02 0.04 f(x k )+αg T k p k 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 α BACKTRACKING-ARMIJO LINESEARCH Procedure to find the stepsize α k : Given α init > 0 (e.g., α init = 1) let α (0) = α init and l = 0 Until f(x k + α (l) p k ) f(x k ) + α (l) βg T k p k set α (l+1) = τα (l), where τ (0, 1) (e.g., τ = 1 2) and increase l by 1 Set α k = α (l)
SATISFYING THE ARMIJO CONDITION Theorem 2.1. Suppose that f C 1, that g(x) is Lipschitz continuous with Lipschitz constant γ(x), that β (0, 1) and that p is a descent direction at x. Then the Armijo condition f(x + αp) f(x) + αβg(x) T p is satisfied for all α [0, α max(x) ], where α max = 2(β 1)g(x)T p γ(x) p 2 2 PROOF OF THEOREM 2.1 Taylor s theorem (Theorem 1.1) + = α 2(β 1)g(x)T p, γ(x) p 2 f(x + αp) f(x) + αg(x) T p + 1 2γ(x)α 2 p 2 f(x) + αg(x) T p + α(β 1)g(x) T p = f(x) + αβg(x) T p
THE ARMIJO LINESEARCH TERMINATES Corollary 2.2. Suppose that f C 1, that g(x) is Lipschitz continuous with Lipschitz constant γ k at x k, that β (0, 1) and that p k is a descent direction at x k. Then the stepsize generated by the backtracking-armijo linesearch terminates with α k min α init, 2τ(β 1)gT k p k γ k p k 2 PROOF OF COROLLARY 2.2 Theorem 2.1 = linesearch will terminate as soon as α (l) α max. 2 cases to consider: 1. May be that α init satisfies the Armijo condition = α k = α init. 2. Otherwise, must be a last linesearch iteration (the l-th) for which α (l) > α max = α k α (l+1) = τα (l) > τα max Combining these 2 cases gives required result.
GENERIC LINESEARCH METHOD Given an initial guess x 0, let k = 0 Until convergence: Find a descent direction p k at x k Compute a stepsize α k using a backtracking-armijo linesearch along p k Set x k+1 = x k + α k p k, and increase k by 1 GLOBAL CONVERGENCE THEOREM Theorem 2.3. Suppose that f C 1 and that g is Lipschitz continuous on IR n. Then, for the iterates generated by the Generic Linesearch Method, either or or g l = 0 for some l 0 lim f k = lim min ( p T k g k, p T ) k g k / p k 2 = 0.
PROOF OF THEOREM 2.3 Suppose that g k 0 for all k and that lim f k >. Armijo = f k+1 f k α k βp T k g k for all k = summing over first j iterations f j+1 f 0 j k=0 α k βp T k g k. LHS bounded below by assumption = RHS bounded below. Sum composed of -ve terms = Let lim α k p T k g k = 0 def K 1 = k α init > 2τ(β 1)gT k p k γ p k 2 where γ is the assumed uniform Lipschitz constant. & K 2 def = {1, 2,...} \ K 1 For k K 1, α k 2τ(β 1)gT k p k γ p k 2 = α k p T 2τ(β 1) g T 2 k p k k g k < 0 γ p k = p T k g lim k = 0. (1) k K 1 p k 2 For k K 2, α k α init = lim k K 2 pt k g k = 0. (2) Combining (1) and (2) gives the required result.
EXAMPLES Steepest-descent direction. p k = g k lim min ( p T k g k, p T k g k / p k 2 ) = 0 = lim g k = 0 Newton-like direction: p k = Bk 1 g k lim min ( p T k g k, p T ) k g k / p k 2 = 0 = provided B k is uniformly positive definite lim g k = 0 Conjugate-gradient direction: p k = any conjugate-gradient approximation to minimizer of f k + p T g k + 1 2p T B k p f(x k + p) lim min ( p T k g k, p T ) k g k / p k 2 = 0 = provided B k is uniformly positive definite lim g k = 0 STEEPEST DESCENT EXAMPLE 1.5 1 0.5 0 0.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Contours for the objective function f(x, y) = 10(y x 2 ) 2 + (x 1) 2, and the iterates generated by the Generic Linesearch steepest-descent method
METHOD OF STEEPEST DESCENT (cont.) archetypical globally convergent method many other methods resort to steepest descent in bad cases not scale invariant convergence is usually very (very!) slow (linear) numerically often not convergent at all NEWTON METHOD EXAMPLE 1.5 1 0.5 0 0.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Contours for the objective function f(x, y) = 10(y x 2 ) 2 + (x 1) 2, and the iterates generated by the Generic Linesearch Newton method
MORE GENERAL DESCENT METHODS (cont.) may be viewed as scaled steepest descent convergence is often faster than steepest descent can be made scale invariant for suitable B k