Global and derivative-free optimization Lectures 1-4

Size: px

Start display at page:

Download "Global and derivative-free optimization Lectures 1-4"

Rosalind Wright
5 years ago
Views:

1 Global and derivative-free optimization Lectures 1-4 Coralia Cartis, University of Oxford INFOMM CDT: Contemporary Numerical Techniques Global and derivative-free optimizationlectures 1-4 p. 1/46

2 Lectures 1-4:outline Brief overview of derivative-based methods for local NLO. Global optimization: definition and overview. Derivative-Free Optimization (DFO): motivation and applications. Overview of DFO algorithms Model-based (+with probabilistic models, later) Direct-search, pattern search, Nelder-Mead Implicit-filtering (in the context of an application) Overview of GO algorithms (briefly) Stochastic methods Deterministic methods Branch-and-Bound; Interval Methods; Response Surface Methods; Modern Branch-and-Bound Global and derivative-free optimizationlectures 1-4 p. 2/46

3 Nonlinear optimization: derivative-based algorithms minimize f(x) subject to x Ω R n. (P) f :Ω R is (generally) smooth and nonconvex. Ω feasible set (determined by finitely many constraints). guaranteed to find local minimizers of (P). rely heavily on accurate/exact derivative(s) information of f and the constraints optimality conditions: for example, when Ω=R n, x local minimizer of f = f(x )=0and 2 f(x ) 0 f(x )=0and 2 f(x ) 0 = x local min. of f. used as termination criteria for algorithms. Taylor expansions of f: at kth iterate x k x, f(x k + s) m k (s) =f(x k )+s T f(x k ) [ st 2 f(x k )s ] used in algorithm construction. Global and derivative-free optimizationlectures 1-4 p. 3/46

4 Nonlinear optimization: derivative-based algorithms... Methods for local unconstrained optimization [i.e., Ω=IR n in (P)] A Generic Method (GM) Choose ɛ>0 and x 0 R n. While (TERMINATION CRITERIA not achieved), REPEAT: compute the change x k+1 x k = α k s k, [linesearch, trust-region] to ensure f(x k+1 ) f(x k ); where α k [0, 1] and s k = arg min s R n m k(s) f(x k + s). set x k+1 := x k + α k s k, k := k +1. TC: f(x k ) ɛ; maybe also, λ min ( 2 f(x k )) ɛ. Global and derivative-free optimizationlectures 1-4 p. 4/46

5 Nonlinear optimization: derivative-based algorithms... Linesearch methods for local unconstrained optimization compute a descent direction s k from x k [i.e., (s k ) T f(x k ) < 0] set x k+1 = x k + α k s k to decrease f [α k (in)exact linesearch] s k = f(x k ) s k = 2 f(x k ) 1 f(x k ) s k min s= f(xk ) m k (s) linear. steepest descent s k min s m k (s) quadratic. Newton Global and derivative-free optimizationlectures 1-4 p. 5/46

6 Nonlinear optimization: derivative-based algorithms... Trust-region methods for local unconstrained optimization compute a step s k from x k to improve model m k (s) of f within the trust-region s k, s k (approx.)min s m k (s) subjectto s k. set x k+1 = x k + s k if m k and f agree at x k + s k otherwise set x k+1 = x k and reduce the radius k Global and derivative-free optimizationlectures 1-4 p. 6/46

7 Nonlinear optimization: derivative-based algorithms... How to compute/provide derivatives to a solver? Calculate derivatives by hand when easy/simple objective and constraints; user provides code that computes them. Calculate or approximate derivatives automatically: Automatic differentiation: breaks down computer code for evaluating f into elementary arithmetic operations + differentiate by chain rule. Software: ADIFOR, ADOL-C. Symbolic differentiation: manipulate the algebraic expression of f (if available). Software: symbolic packages of MAPLE, MATHEMATICA, MATLAB. Finite differencing approximate derivatives. See Nocedal & Wright, Numerical Optimization (2nd edition, 2006) for more details. Global and derivative-free optimizationlectures 1-4 p. 7/46

8 Nonlinear optimization: derivative-based algorithms... Advantages and successes global convergence to stationary points of (P) under mild assumptions on the problem class; fast local convergence for Newton-like variants. can solve large-scale problems n large (at least of order 10 3 ) efficiently, even when (P) has nonlinear constraints. Limitations only guaranteed to provide local solutions of (P) when (P) is nonconvex. requires accurate or exact first-, and sometimes even second-, derivatives of the objective f and constraints to be available. Global and derivative-free optimizationlectures 1-4 p. 8/46

9 Global and derivative-free optimization algorithms Attempt to overcome the limitations of derivative-based NLO algorithms for local minimization: Global Optimization (GO) the global minimizer of (P) is required; derivatives are allowed Derivative-Free Optimization (DFO) derivatives are unavailable, even if (P) may be smooth; use only function values to construct iterates that approach a (local) min. For the remainder, Ω=R n in (P), i.e., we solve minimize f(x) subject to x R n. (UP) [GO and DFO may not deal with nonlinear constraints, at best bounds] Comparison to local optimization (UP) : GO more difficult (generally, NP-hard); in DFO, we lose problem information. = for both GO and DFO, often content with improvement rather than optimization Global and derivative-free optimizationlectures 1-4 p. 9/46

10 Global optimization Consider (UP). When f convex, x local minimizer of f = x global minimizer of f. Hence, for such instances, local optimization=go. f nonconvex and bounded below: how to compute the global minimizer of f in the presence of local minimizers/high oscillations and sometimes noise? A local optimization algorithm gets trapped at local minimizers and cannot further advance towards the global solution. How do GO methods avoid this? when to terminate a GO algorithm? Global and derivative-free optimizationlectures 1-4 p. 10/46

11 Global optimization... Applications: many of the grand challenge problems of scientific computing such weather forecasting, electronic-structural design, protein folding, molecular dynamics, etc. Methods (to be addressed): branch-and-bound, multistart local search, randomized, etc. Limitations on average, can solve efficiently problems of (very) small scale (in the order of 10 variables); better if parallelism is employed. difficulties with incorporating nonlinear constraints, only bound constraints (more) straightforward. Global and derivative-free optimizationlectures 1-4 p. 11/46

12 Derivative-Free Optimization (DFO) Consider (UP). Even when f is smooth, for many applications: Exact first derivatives of f are unavailable: f(x) given by a black-box code, proprietary code or a simulation package. Computing f(x) for any given x is expensive: f(x) given by a time-consuming numerical simulation or lab experiments. Numerical approximation of f(x) is impractically expensive or slow: using finite-differencing for approximating f(x) when f(x) is expensive. The values of f(x) are noisy, i.e., the evaluation of f(x) is inaccurate. For example, when f(x) depends on discretization, sampling, inaccurate data, etc. Then gradient information is meaningless. Global and derivative-free optimizationlectures 1-4 p. 12/46

13 DFO: effect of noise on finite-differencing Effect of noisy function values on finite-differencing: Let F smooth and Ψ noise so that f(x) =F (x) +Ψ(x). Central-Difference (CD) formula for f(x) with stepsize h: f (x) f(x + hei ) f(x he i ), i = 1,n, x i h and let η(x, h) := sup z x h Ψ(z). = h f(x) F (x) L F h 2 + η(x,h) h, where L F is the Lipschitz constant of 2 F. If η(x, h) dominates in the RHS, then lucky if h f(x) descent. May use DFO methods in that case. Global and derivative-free optimizationlectures 1-4 p. 13/46

14 DFO: Illustrative application Tuning of algorithmic parameters Consider some nonlinear optimization solver (say trust region). Its performance depends on parameters choices: starting point, initial trust-region radius, successful step parameter, etc. For their automatic (optimal) adjustment, solve for instance min p R n p f(p) =CPU(solver; p) subject to p P, where p vector of all parameters to be tuned and P = {p : l p u}. derivative calculation hard, possibly nondifferentiable. Other applications. Automatic error analysis, engineering design, molecular geometry, etc. Global and derivative-free optimizationlectures 1-4 p. 14/46

15 DFO methods use only objective function values to construct iterates. do not essentially compute an approximate gradient. instead, form sample of points (less tightly clustered than for finite-differences); use associated function values to generate x k+1 so as to ensure descent; must also control geometry of sample sets. Algorithms (to be addressed): model-based trust-region, direct-search algorithms, etc. compute approximate (local) solution with few function evaluations; asymptotic speed irrelevant as no optimality conditions for termination. also suitable (but not guaranteed to be successful) for nonsmooth and for global optimization. Global and derivative-free optimizationlectures 1-4 p. 15/46

16 DFO methods... Limitations. With current state-of-the art DFO methods, expect to successfully solve problems provided: the problem is small-scale (in the order of 10 2 variables); f must be quite smooth; accurate finite-differencing cannot be achieved (f noisy or expensive etc); high accuracy not required (as the methods are slow asymptotically). Global and derivative-free optimizationlectures 1-4 p. 16/46

17 Derivative-free optimization Model-based derivative-free methods Model-based derivative-free algorithm Interpolation models Polynomial interpolation Geometry of the sample set Comments Global and derivative-free optimizationlectures 1-4 p. 17/46

18 Models in optimization methods minimize f(x) subject to x R n. (UP) derivative-based methods rely on (linear or quadratic) Taylor models of f: f(x k +s) f(x k )+s T f(x k ) ( + 1 ) 2 st 2 f(x k )s m k (s). need accurate gradient values f(x k ) [and maybe Hessians 2 f(x k )]. How to construct models of f when derivatives are unavailable / don t exist / cannot be approximated? by interpolation of f on a set of appropriately chosen sample points. Global and derivative-free optimizationlectures 1-4 p. 18/46

19 Models in derivative-free optimization methods Sample set: Y = {y 1,...,y q } for some q. {f(y 1 ),...,f(y q )} assumed to be known/computed. x k current iterate/estimate of minimizer x. x k Y and f(x k ) f(y i ), i = 1,q. Model: m k (s) =c + s T g ( st Hs ), where c R, g R n (and H R n n symmetric) unknown. Compute c R, g R n (and H R n n ) to satisfy the interpolation conditions: m k (y i x k )=f(y i ), i {1,...,q}. (IC) need q = n +1for m k linear (i.e., H =0); m k quadratic needs q =(n +1)(n +2)/2; connect to finite-differences. Global and derivative-free optimizationlectures 1-4 p. 19/46

20 Model-based DFO algorithm Issues to address: model interpolation: matrix of linear system (IC) must be nonsingular and well-conditioned. model minimization: since m k nonconvex, add TR constraint: s k = arg min s R n m k(s) subject to s k. (TR) update m k rather than recompute (only one point leaves from Y and a new one enters); improve the geometry of Y to help with the (conditioning of) model interpolation step. A complete algorithm is very involved; here we give a generic framework. Global and derivative-free optimizationlectures 1-4 p. 20/46

21 Model-based DFO algorithm... Let s k be a(n approximate) solution of (TR). Then predicted model decrease: m k (0) m k (s k )=f(x k ) m k (s k ). actual function decrease: f(x k ) f(x k + s k ). The trust region radius k is chosen based on the value of ρ k := f(xk ) f(x k + s k ). f(x k ) m k (s k ) If ρ k η, where η (0, 1), x k+1 := x k + s k, k+1 k. If ρ k <η, x k+1 = x k and k is reduced or Y is improved. Global and derivative-free optimizationlectures 1-4 p. 21/46

22 Generic model-based DFO algorithm Given Y = {y 1,...,y q } such that (IC) nonsingular, x 0 Y such that f(x 0 ) f(y i ) for i = 1,q, η (0, 1), 0 > 0 and k =0. While (TC not satisfied), do: 1. Form the linear/quadratic model m k (s) to satisfy (IC). 2. Solve (approximately) the (TR) subproblem for s k with m k (s k ) <f(x k ) ( sufficiently ). Compute ρ k := [f(x k ) f(x k + s k )]/[f(x k ) m k (s k )]. 3. If ρ k η, then [successful step] set x k+1 := x k + s k, k+1 k, replace y i Y by x k+1. Else if (ρ k <η) and (Y need not be improved), then set x k+1 = x k and k+1 < k. [unsuccessful step] End(if) 4. Geometry-improving step... Global and derivative-free optimizationlectures 1-4 p. 22/46

23 Generic model-based DFO algorithm... (continued...) 4. Invoke a geometry-improving procedure to update Y [one point leaves Y, new one enters so as to improve the conditioning of (IC)] choose ˆx Y such that f(ˆx) f(y i ) for all y i Y. set k+1 := k ; recompute ρ k for x k + s k := ˆx. If ρ k η, then set x k+1 =ˆx; Else set x k+1 = x k. End(if) 5. Let k := k +1. Global and derivative-free optimizationlectures 1-4 p. 23/46

24 Model-based DFO algorithm: comments ρ k <η= trust region is too large OR the sample set Y is inadequate (degenerate): iterates confined to low-dimensional surface of R n that does not contain the solution replace point in Y if condition number of (IC) is too high so as to improve this condition no. initial Y : vertices and edges midpoints of simplex in R n. quadratic models expensive: O(n 2 ) function evals. for initial model set-up; O(n 4 ) arithmetic operations per iteration for model update and minimization. (cheaper quadratic model: see Frobenius norm updates) use linear models (at least at the start of the algorithm until enough function evaluations have been calculated): O(n) function evals. for initial model set-up; O(n 3 ) ops/it. Global and derivative-free optimizationlectures 1-4 p. 24/46

25 Model construction by interpolation Linear model (linear polynomial in n variables): m k (s) =f(x k )+s T g. Need q = n +1in (IC) to determine c R,g R n ; but c = f(x k ) and so (IC) provide f(x k )+(y i x k ) T g = f(y i ), i = 1,n, or equivalently, (y i x k ) T g = f(y i ) f(x k ), i = 1,n. Thus g and hence m k (s) uniquely defined {y 1 x k,y 2 x k,...,y n x k } linearly independent {x k,y 1,y 2,...,y n } nondegenerate simplex. Pn 1 polynomials of degree at most 1 in Rn ; dim Pn 1 = n +1; monomial basis=natural basis φ = {1,x 1,...,x n }; φ j (x) =x j m k (y i x k )= q j=1 α jφ j (y i )=f(y i ), i = 1,q. Global and derivative-free optimizationlectures 1-4 p. 25/46

26 Model construction by interpolation... Quadratic model (quadratic polynomial in n variables): m k (s) =f(x k )+s T g st Hs, or equivalently, by symmetry of H, where ŝ = m k (s) =f(x k )+s T g + i<j H ijs i s j ( s, {s i s j } i<j, m k (s) =f(x k )+ŝ T ĝ, { 2 1 s 2 i }), ĝ = ( g, {H ij } i<j, P n 2 polynomials of degree at most 2 in Rn ; dim Pn 2 =(n + 1)(n + 2)/2 =q; monomial basis=natural basis φ = {φ j : j = 1,q}; i H iis 2 i, m k (y i x k )= q j=1 α jφ j (y i )=f(y i ), i = 1,q. { 1 2 H 2 ii}). Thus m k uniquely defined δ(φ, Y )=det({φ j (y i )} ij ) 0, for some polynomial basis φ. Global and derivative-free optimizationlectures 1-4 p. 26/46

27 Model construction by interpolation... (IC) M(φ, Y )α φ = f(y ), where M(φ, Y ) ij = φ j (y i ) and f(y ) i = f(y i ), i, j = 1,q. Y poised for interpolation δ(φ, Y ) =detm(φ, Y ) 0for some basis φ δ(φ, Y ) 0for any basis φ interpolating polynomial m k (s) exists and is unique. Other (useful) polynomial basis: Lagrange polynomials; Newton fundamental polynomials. Lagrange polynomials: given Y = {y 1,...,y q }, Lagrange polyn. χ j (x), j = 1,q, such that χ j (y i )=1if i = j and χ j (y i )=0if i j. Y poised = basis {χ j (x)} j uniquely exists = interpolating polyn. of f on Y : m k (s) = q j=1 f(yj )χ j (x k + s). Global and derivative-free optimizationlectures 1-4 p. 27/46

28 Updating Y to improve its geometry remove y Y and add y + to give Y + so that δ(φ, Y + ) increases in magnitude. Property: δ(φ, Y + ) χ j (y + ) δ(φ, Y ) with j =index of y. When ρ k η: y + = x k + s k and y = arg max yj Y χ j (x k + s k ). When ρ k <η: check whether Y needs improvement: [Y adequate at x k if for all y j Y such that y j x k k, δ(φ, Y ) cannot be doubled if y j replaced by y inside TR constr.] If Y adequate, choose k+1 < k, x k+1 = x k, leave Y unchanged. Else, for every y j Y, define potential replacement y j r y j r = arg max y x k k χ j(y). Let y = y j = arg max y i Y χ i (y i r ). Global and derivative-free optimizationlectures 1-4 p. 28/46

29 Constructing cheaper quadratic models extension to DFO of quasi-newton techniques; only O(n) function-evaluations required to construct the quadratic model with a O(n 3 ) arith. cost/iteration. Compute c, g and H by solving min c,g,h H Hk F subject to H = H T,m k (y i x k )=f(y i ), i = 1, ˆq, where F Frobenius norm and H k previous model Hessian. ˆq = O(n): ˆq n +2so we can compute c, g and some H k+1 H k ; practical value: ˆq =2n +1. like before, we need to consider geometry of Y... Software (model-based implementations): COBYLA (linear), DFO (quadratic), UOBYQA (quadratic), WEDGE (quadratic), NEWUOA (cheap quadratic based on qnewton updating). Global and derivative-free optimizationlectures 1-4 p. 29/46

30 Derivative-free optimization Direct-search derivative-free methods Linesearch methods Coordinate search method Pattern search methods Simplex methods Nelder-Mead algorithm Global and derivative-free optimizationlectures 1-4 p. 30/46

31 Linesearch derivative-free methods minimize f(x) subject to x R n. (UP) A Generic Linesearch DF Method (GLM-DF) Choose x 0 R n ; k =0. While (TC not satisfied), REPEAT: choose a search direction s k R n from x k if possible, compute a stepsize α k R along s k such that f(x k + α k s k ) <f(x k ); set x k+1 := x k + α k s k (if such a step α k exists) and x k+1 := x k (otherwise) and k := k +1. recall derivative-based linesearch methods: (s k descent if (s k ) T f(x k ) < 0) f(x k + αs k ) <f(x k ) for α>0 sufficiently small; s k = f(x k ) descent. Global and derivative-free optimizationlectures 1-4 p. 31/46

32 Linesearch derivative-free methods: linesearch f(x) (a) Steps are too long. (b) Steps are too short. (c) Bad search direction. Kolda, Lewis & Torczon (SIREV) Exact linesearch: α k =argmin α R φ k (α) =f(x k + αs k ). Inexact linesearch: sufficient decrease (Armijo-like cond.): f(x k + α k s k ) <f(x k ) ρ(α k ), where ρ(t) 0 increasing function of t, ρ(t)/t 0 as t 0. Use backtracking to satisfy the Armijo-like condition. Global and derivative-free optimizationlectures 1-4 p. 32/46

33 Linesearch DF methods: choice of search directions Example: Coordinate search method x 1 = x 0 + α 0 e 1 x 2 = x 1 + α 1 e 2. x n = x n 1 + α n 1 e n x n+1 = x n + α n e 1 x*. α k computed by exact or inexact linesearch. x 0 x 1 inefficient behaviour: coordinate direction (almost) orthogonal to f(x k ); see Figs. 1 (c) & 2. Global and derivative-free optimizationlectures 1-4 p. 33/46

34 Linesearch DF methods: choice of search directions Coordinate search method (continued...) CS with exact linesearch: example of failure to converge to a stationary point of f (MJD Powell, 1973). efficient when the variables are essentially uncoupled (equiv. to a nearly diagonal Hessian). Problem about coordinate search. when convergent, the local rate of convergence is often slower than steepest descent (Luenberger, 2003): one step of SD n steps of CS. globally convergent variants exist (strong assumptions or sophisticated linesearch); for example, assume that along each coordinate direction, f has a unique minimizer. Global and derivative-free optimizationlectures 1-4 p. 34/46

35 Linesearch DF methods: choice of search directions Other variants of coordinate search and linesearch DF: Back and forth /Double sweep method: Search along e 1,e 2,...,e n,e n 1,...,e 2,e 1,e 2,... Hookes & Jeeves; search along n coordinates then along the line from 1st to last point in cycle. Conjugate directions algorithm (connection to derivative-based Conjugate Gradients). Global convergence for GLM-DF: prevent inefficient behaviour by requiring that cos θ k = f(xk ) T s k δ>0for all k. f(x k ) s k When gradient of f is unavailable, require instead min v 0 max j {0,...,n 1} v T s k+j v s k+j δ>0 for all k. = span {s 0,s 1,...,s n 1 } = R n. Still not enough for global convergence: need sophisticated linesearch (Lucidi et al, 2002). Global and derivative-free optimizationlectures 1-4 p. 35/46

36 Pattern-search methods motivated by the need to make use of parallelization of function evaluations in linesearch methods. Pattern-search algorithm Given ɛ>0, θ 1 (0, 1), θ 2 1, choose x 0 R n, stepsize α 0 >ɛ, initial direction set D 0 ; k =0; While (α k >ɛ), do: 1. If sufficient decrease condition holds at α k for some s i D k, then set x k+1 = x k + α k s i and α k+1 = θ 2 α k. 2. Else set x k+1 = x k and α k+1 = θ 1 α k. Global and derivative-free optimizationlectures 1-4 p. 36/46

37 Pattern-search methods... instead of one search direction s k, at each PS iteration, we have a set of directions D k ; conditions for a good set of directions D k : at least one direction in D k should give descent in the sense that min v 0 max s D k v T s v s δ>0 for all k. (*) (very similar to linesearch methods condition earlier). Also, require that 0 <s min s s max for all s D k. (**) Note that D k = {e i, e i } does not satisfy (*). Global and derivative-free optimizationlectures 1-4 p. 37/46

38 Pattern-search methods... Suitable choices for D k that satisfy (*) and (**): coordinate directions: {e 1,e 2,...,e n, e 1, e 2,...,e n }. {s i = 1 2n e ei }, i = 1,n and s n+1 = 1 e, where 2n e =(1, 1...,1) T. stepsize α k fixed during kth iteration; sufficient decrease condition (see Inexact linesearch earlier) checked at α k along each direction in D k. Suitable values for ρ(t): γt 2 s k or γt 3/2 etc. Pattern-search software packages: APPS (Hough, Kolda & Torczon), DIRECT (Jones, Perttunen & Stuckman), etc Global and derivative-free optimizationlectures 1-4 p. 38/46

39 Pattern-search methods... D k = {e 1,e 2, e 1, e 2 }, n =2. (a) Initial pattern (b) Move North (c) Move West (d) Move North (e) Contract (f) Move West Kolda, Lewis & Torczon (SIREV) Global and derivative-free optimizationlectures 1-4 p. 39/46

40 Simplex methods: Nelder-Mead Nothing to do with simplex methods for linear programming. Nelder-Mead (1965): the most popular algorithm with users of optimization: easy to understand and implement, not sophisticated. But heuristical, not rigorous or reliable and hence not popular with optimizers. Connection to (linear) model-based DFO: NM and simplex methods keep a simplex of points at each iteration, but do not construct a linear approximation of f over this simplex, only use function values at the vertices of the simplex and certain operations on the simplex. Vertices of the simplex: Y = {x k,y 1,...,y n }; Edges from x k : matrix M = ( y 1 x k y 2 x k... y n x k) M nonsingular the simplex (ie, convex hull of Y ) is nondegenerate. Connection to pattern-search: set of search directions D k = {s i = y i x k : i = 1,n}. Global and derivative-free optimizationlectures 1-4 p. 40/46

The Nelder-Mead algorithm Change notation: Y = {x 1,.

Search direction : x n+1 x; x(α) =x + α(x n+1 x).

41 The Nelder-Mead algorithm Change notation: Y = {x 1,...,x n+1 } at iteration k with f(x 1 ) f(x 2 )... f(x n+1 ). Attempt to improve worst function value f(x n+1 ). Centroid of best n points: x = 1 n n i=1 xi. Search direction : x n+1 x; x(α) =x + α(x n+1 x). Simplex operations: illustrated for n =2(S. Richards, 2010). (a) Reflection (b) Expansion (c) Outside Contraction (d) Inside Contraction (e) Shrink Global and derivative-free optimizationlectures 1-4 p. 41/46

42 The Nelder-Mead algorithm Given ρ 1 (reflection), χ>ρ (expansion), γ (0, 1) (contraction) and σ (0, 1) (shrinkage); intial simplex Y = {x 1,...,x n+1 } in R n, k =0; While (TC not satisfied), do: 1. Order vertices: f(x 1 ) f(x 2 )... f(x n+1 ). 2. (Reflection) Compute x r = x( ρ) and f(x r ). If f(x 1 ) f(x r ) <f(x n ), replace x n+1 Y by x r ; k = k +1. Else if f(x r ) <f(x 1 ), then 3. (Expand) Compute x e = x( χ) and f(x e ). If f(x e ) <f(x r ), replace x n+1 Y by x e ; k = k +1. Else replace x n+1 Y by x r ; k = k +1. (End if) Else (i.e., f(x r ) f(x n )). Global and derivative-free optimizationlectures 1-4 p. 42/46

43 The Nelder-Mead algorithm (continued)... Else (i.e., f(x r ) f(x n )) 4. (Contract) If f(x n ) f(x r ) <f(x n+1 ), then (outside contraction) Compute x oc = x( γ) and f(x oc ). If f(x oc ) f(x r ), replace x n+1 Y by x oc ; k = k +1. Else go to Step 5. (End if) Else (i.e., f(x r ) f(x n+1 )), then (inside contraction) Compute x ic = x(γ) and f(x ic ). If f(x ic ) <f(x n+1 ), replace x n+1 Y by x ic ; k = k +1. Else go to Step 5. (End if) (End if) (End if) 5. (Shrink) Define n new vertices y i = x 1 + σ(x i x 1 ), i = 1,n, and new simplex Y + = {x 1,y 1,...,y n }; k = k +1. Global and derivative-free optimizationlectures 1-4 p. 43/46

44 The Nelder-Mead algorithm: some properties Termination conditions: function values at simplex vertices close to each other, or simplex has become too small (max i x i x 1 ɛ max(1, x 1 )). Function-evaluation cost: k =0and any shrinkage step expensive (n +1function values); else, one or two fcts. evals./operation Limited convergence results: only for n =1and n =2. Other simplex methods have better convergence theory; see Torczon (1991). Examples of failure (many), documented (McKinnon 1998). Global and derivative-free optimizationlectures 1-4 p. 44/46

45 The Nelder-Mead algorithm: convergence (Lagarias et al, 1998) Theorem 1. (n =1) Let f : R R be a strictly convex objective with bounded level sets. Assume the initial simplex is nondegenerate. Apply the Nelder-Mead algorithm to minimizing f. Then both end points of the Nelder-Mead interval (i.e., simplex in one-d) converge to the minimizer x of f. Theorem 2. (n =2) Let f : R 2 R be a strictly convex objective with bounded level sets. Assume the initial simplex is nondegenerate and that ρ =1, χ =2and γ =1/2. Apply the Nelder-Mead algorithm to minimizing f. Then and lim k f(x 1,k )=lim k f(x 2,k )=lim k f(x 3,k ) lim k diam(conv(y k )) = 0. Global and derivative-free optimizationlectures 1-4 p. 45/46

46 Illustrations of Nelder-Mead algorithm in action Margaret Wright, 2013 Global and derivative-free optimizationlectures 1-4 p. 46/46

47 A2-dNMpicture(notetheeaseofunderstandingwhat s happening!):

48 Nelder-Mead on the McKinnon counterexample:

49 Similar things happen on the more complicated (in)famous Rosenbrock function, f =100(x 2 1 x 2 ) 2 +(1 x 1 ) 2,withitscurving steep-sided valley. Coordinate search; 81 function evaluations, step =

50 Nelder Mead, 76 function evaluations

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods