Bio-inspired Continuous Optimization: The Coming of Age

Bio-inspired Continuous Optimization: The Coming of Age Anne Auger Nikolaus Hansen Nikolas Mauny Raymond Ros Marc Schoenauer TAO Team, INRIA Futurs, FRANCE http://tao.lri.fr First.Last@inria.fr CEC 27, Singapore, September 27., 27

Problem Statement Continuous Domain Search/Optimization The problem Minimize a fitness function (objective function, loss function) in continuous domain f : S R n R, in the Black Box scenario (direct search) x f(x) Hypotheses domain specific knowledge only used within the black box gradients are not available

Problem Statement Continuous Domain Search/Optimization The problem Minimize a fitness function (objective function, loss function) in continuous domain f : S R n R, in the Black Box scenario (direct search) x f(x) Typical Examples shape optimization (e.g. using CFD) curve fitting, airfoils model calibration biological, physical parameter identification controller, plants, images

Optimization techniques Numerical methods Applied Mathematicians but Heavily rely on theoretical convergence proofs Requires regularity Numerical pitfalls Long history numerical gradient? Bio-inspired algorithms Computer Scientists Recent trendy methods but Computationally heavy No convergence proof mostly from AI field but divided! well, almost see later

This talk Goal Empirical comparison on some artificial testbed illustrating typical difficulties of continuous optimization between some bio-inspired algorithms and some (one!) deterministic optimization method(s) in the back-box scenario without specific intensive parameter tuning

1 Continuous optimization and stochastic search The algorithms 2 Problem difficulties Ruggedness Ill-Conditionning Non-separability 3 Implementations and parameter settings Algorithm implementations Tuning DE Experiments and results Experimental conditions Outputs and Performance measures 5 Conclusion Further work Conclusions

The algorithms Bio-inspired Optimization Algorithms Darwinian Artificial Evolution Repeat (Parent selection Variation Survival selection) Preselection: of CEC 5 Challenge Particle Swarm Optimization Eberhart & Kennedy, 95 Perturb particle velocity best and local best Update best and local best Differential Evolution Rainer and Storn, 95 Add difference vector(s) Uniform crossover Keep best of parent and offspring with proba. 1 CR Covariance Matrix Adaptation-ES Hansen & Ostermeier, 9 Gaussian mutation Keep λ 2 best of λ offspring + Update mutation parameters

The algorithms BFGS Gradient-based methods { xt+1 = x t ρ t d t ρ t = Argmin ρ {F(x t ρd t )} Line search Choice of d t, the descent direction? BFGS: a Quasi-Newton method Maintain an approximation Ĥ t of the Hessian of f Solve for d t Ĥ t d t = f (x t ) Compute x t+1 and update Ĥ t Ĥ t+1 Converges if quadratic approximation of F holds around the optimum Reliable and robust on quadratic functions!

1 Continuous optimization and stochastic search 2 Problem difficulties Ruggedness Ill-Conditionning Non-separability 3 Implementations and parameter settings Experiments and results 5 Conclusion

9 8 7 5 3 2 3 2 1 1 2 3 What makes a problem hard? Non-convexity invalidates most of deterministic theory Ruggedness non-smooth, discontinuous, noisy Multimodality presence of local optima Dimensionality line search is trivial The magnifiscence of high dimensionality Ill-conditioning Non-separability

Ruggedness Monotonous transformation invariance Monotonous transformations x 2 i x 2 i ( x 2 i ) 2 Invariance Comparison-based algorithms PSO, DE, CMA-ES,... are invariant w.r.t. monotonous transformations Gradient-based methods BFGS,... are not

Ruggedness Multimodality Bio-inspired algorithms BFGS are global search algorithms but performance on multi-modal problems depends on population size has no population :-) but starting point is crucial on multimodal functions Replace population by multiple restarts from uniformly distributed points or from the perturbed final point of previous trial the Hessian is anyway reset to I n

Ill-Conditionning Ill-Conditionning The Condition Number (CN) of a positive-definite matrix H is the ratio of its largest and smallest eigenvalues If f is quadratic, f (x) = x T Hx), the CN of f is that of its Hessian H More generally, the CN of f is that of its Hessian wherever it is defined. Graphically, ill-conditioned means squeezed lines of equal function value 1.8...2.2...8 Increased condition number 1.8...2.2...8 1 1.5.5 1 1 1.5.5 1 Issue: The gradient does not point toward the minimum...

Ill-Conditionning A priori discussion Bio-inspired algorithms PSO and DE: population can point toward the minimum CMA-ES: covariance matrix can take longer to learn BFGS Numerical gradient can raise numerical problems Hessian matrix can take longer to learn

Non-separability Separability Definition (Separable Problem) A function f is separable if ( arg min (x 1,...,x n) f (x 1,..., x n ) = arg min f (x 1,...),..., arg min f (..., x n ) x 1 x n solve n independent 1D optimization problems ) Example: Additively decomposable functions n f (x 1,..., x n ) = f i (x i ) i=1 e.g. Rastrigin function 3 2 1 1 2 3 3 2 1 1 2 3

Non-separability Designing Non-Separable Problems Rotating the coordinate system f : x f (x) separable f : x f (Rx) non-separable R rotation matrix Hansen, Ostermeier, & Gawelczyk, 95; Salomon, 9 3 3 2 1 R 2 1 1 1 2 2 3 3 2 1 1 2 3 3 3 2 1 1 2 3

Non-separability Rotational invariance Bio-inspired algorithms PSO: is not rotational invariant see next slide DE: Crossover is not rotational invariant Rotational invariance iff CR = 1 CMA-ES: is rotational invariant BFGS Numerical gradient can raise numerical problems Added to ill-conditionning effects

Non-separability PSO and rotational invariance. A sample swarm Same swarm, rotated V j i (t+1) = Vj i (t)+ c 1 U j i (, 1)(pj i xj i }{{ (t)) + c 2 Ũ j i } (, 1)(gj i xj i }{{ (t)) } approach the "previous" best approach the "global" best

1 Continuous optimization and stochastic search 2 Problem difficulties 3 Implementations and parameter settings Algorithm implementations Tuning DE Experiments and results 5 Conclusion

Algorithm implementations The algorithms Default implementations DE: Matlab code from http://www.icsi.berkeley.edu/ storn/code.html Not really giving default parameters PSO: Std PSO 2, C code from http://www.particleswarm.info/standard_pso_2.c Remark: C code Matlab code here CMA-ES: Matlab code from http://www.bionik.tu-berlin.de/user/niko/ BFGS: Matlab built-in implementation widely blindely used using numerical gradient + multiple restarts local or global

Tuning DE The problem with DE Control parameters NP, F, CR, Stopping Criterion... and strategy to generate difference vector Perturb random or best Number of difference vectors 1 or 2 Slightly mutate perturbation All of the above :-) from public Matlab code 1 DE/rand/1 the basic algorithm 2 DE/local-to-best/1 3 DE/best/1 with jitter DE/rand/1 with per-vector-dither F = rand(f, 1) 5 DE/rand/1 with global-dither either-or-algorithm

Tuning DE DE tuning Experimental conditions Rotated ellipsoid, cond. number = Dimension = Stop when f elli < Strat=1 F=.3 F=.5 F=.7 F=.9.2...8 1 CR Strat=3 log(#evals) 7.5 7.5 5.5 5.5 7.5 F=.3 F=.5 F=.7 F=.9 Strat=2.2.. CR 7 Strat= Design of Experiments variants F = {.3,.5,.7,.9} CR = {.3,.5,.7,.9} NP = {3, 5, } dim 3 runs per setting :-( 8 runs F=.3 F=.5 F=.7 F=.9.2...8 1 CR F=.3 F=.5 F=.7 F=.9 Strat=5.2...8 1 CR Poor DOE, yet totally unrealistic for RWA log(#evals) log(#evals).5 5.5 5.5 7.5.5 5.5.5 F=.3 F=.5 F=.7 F=.9.2.. CR 7 5 F=.3 F=.5 F=.7 F=.9 Strat=.2.. CR

Tuning DE DE Experimental Setting and Discussion Couldn t decide among 2 variants: DE2 Strategy 2 F =.8 CR = 1. NP = dim DE5 Strategy 5 F =. CR = 1. NP = dim Discussion DE2 very close to recommended parameters Except CR =.9, but Rotational invariance iff CR = 1 no crossover

1 Continuous optimization and stochastic search 2 Problem difficulties 3 Implementations and parameter settings Experiments and results Experimental conditions Outputs and Performance measures 5 Conclusion

Experimental conditions The parameters Common Parameters Default parameters except for DE MaxEval = 7 Fitness tolerance = 9 Domain: [ 2, 8] d Optimum not at the center 21 runs in each case except BFGS when little success Population size Standard values: for n =, 2, PSO: + floor(2 n) 1, 18, 22 CMA-ES: + floor(3 ln n), 12, 15 DE: n, 2, To be increased for multi-modal functions e.g. Rastrigin

Experimental conditions The testbed Test functions Ellipsoid function Rosenbrock function Rastrigin function DiffPow Ellipsoid and DiffPow quadratic both separable and rotated for different condition number almost unimodal both original (non-separable) and rotated for different condition number highly multi-modal both separable and rotated convex, but flat unimodal, but non-convex

Outputs and Performance measures Comparisons from CEC 25 challenge on continuous optimization Goals Find the best possible fitness value At minimal cost Number of function evaluations Statistical Measures Must take into account both precision and cost Usual averages and standard deviations of fitness values irrelevant to assess precision Issue: need to impose for obvious practical reasons precision threshold on fitness maximum number of iterations

Outputs and Performance measures SP1 measure SP1 Success Performance one Average required effort for success SP1 = avg # evaluations proportion successful runs Effort to reach a given precision on the fitness (success) Same number of total fitness evaluations to allow comparisons Estimated after a fixed number of runs High (unknown) variance in the case of few successes Could also be estimated after a fixed number of successes Would allow to control the variance A single value, insufficient alone anyway

Outputs and Performance measures Cumulative distributions Cope with both fitness threshold and bound on # evaluations Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 %run 5 %run 5 3 3 2 2 5 7 3 evaluation number fitness value % runs vs # eval to reach success threshold (= 9 ) % runs vs fitness value reached before max eval (= 7 ) Rosenbrock function, dim, cond.

Outputs and Performance measures Dynamic Behaviors Median, best, worse, 25-, 75-percentiles Rosenbrock : 21 trials, dimension 2, tol 1.E 9, alpha, default size, eval max 12 rosenbrock_de2 rosenbrock_de5 function value 1.e+ 5.e+5 1.e+ 1.5e+ 2.e+ 2.5e+ 3.e+ 3.5e+.e+.5e+ function evaluations All runs of 2 settings Comparing DE2 and DE5 on Rosenbrock(), dim=2

Ellipsoid f elli (x) = i 1 n i=1 α n 1 x 2 i = x T H elli x 1... H elli = B @ 1 C A α convex, separable 1.8...2.2...8 1 1.5.5 1 1.8 felli rot(x) = f elli(rx) = x T Helli rotx R random rotation Helli rot = R T H ellir convex, non-separable...2.2...8 1 1.5.5 1 cond(h elli ) = cond(h rot elli ) = α α = 1,..., α = axis ratio of 3, typical for real-world problem

Separable Ellipsoid function PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 7 Ellipsoid dimension, 21 trials, tolerance 1e 9, eval max 1e+7 5 SP1 3 2 Ellipsoid BFGS Ellipsoid DE5 Ellipsoid DE2 Ellipsoid PSO Ellipsoid CMAES Sqrt of sqrt of ellipsoid BFGS 1 2 Condition number 8

Separable Ellipsoid function PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 2 7 Ellipsoid dimension 2, 21 trials, tolerance 1e 9, eval max 1e+7 5 SP1 3 2 Ellipsoid BFGS Ellipsoid DE5 Ellipsoid DE2 Ellipsoid PSO Ellipsoid CMAES Sqrt of sqrt of ellipsoid BFGS 1 2 Condition number 8

Rotated Ellipsoid function PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 7 Ellipsoid dimension, 21 trials, tolerance 1e 9, eval max 1e+7 5 SP1 3 2 1 2 Condition number Rotated Ellipsoid BFGS Rotated Ellipsoid DE5 Rotated Ellipsoid DE2 Rotated Ellipsoid PSO Rotated Ellipsoid CMAES sqrt of sqrt of rotated ellipsoid BFGS 8

Rotated Ellipsoid function PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 2 7 Ellipsoid dimension 2, 21 trials, tolerance 1e 9, eval max 1e+7 5 SP1 3 2 1 2 Condition number Rotated Ellipsoid BFGS Rotated Ellipsoid DE5 Rotated Ellipsoid DE2 Rotated Ellipsoid PSO Rotated Ellipsoid CMAES sqrt of sqrt of rotated ellipsoid BFGS 8

Ellipsoid: Discussion Bio-inspired algorithms Separable case: PSO and DE insensitive to conditionning... but PSO rapidly fails to solve the rotated version... while CMA-ES and DE (CR = 1) are rotation invariant DE scales poorly with dimension d 2.5 compared to d 1.5 for PSO and CMA-ES and BFGS... vs BFGS BFGS fails to solve ill-conditionned cases Matlab message: Roundoff error is stalling convergence CMA-ES only 7 times slower Line search couldn t find an acceptable point in the current search direction on quadratic functions!

Rosenbrock function (Banana) f rosen (x) = n 1 [ i=1 (1 xi ) 2 + α (x i+1 xi 2)2] Non-separable, but... also ran rotated version α = 2, classical Rosenbrock function α = 1,..., 8 Multi-modal for dimension > 3 3 5 5 2 1 1 2 3

Rosenbrock functions PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 8 dimension, 21 trials, tolerance 1e 9, eval max 1e+7 7 SP1 5 3 2 beta 8

Rosenbrock functions PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 2 8 dimension 2, 21 trials, tolerance 1e 9, eval max 1e+7 7 SP1 5 3 2 beta 8

Rosenbrock functions PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 8 dimension, 21 trials, tolerance 1e 9, eval max 1e+7 7 SP1 5 3 2 beta 8

Rosenbrock function Dim Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS α 1 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha 1, default size, eval max Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha 1, default size, eval max 9 9 8 8 7 7 %run 5 %run 5 3 3 2 2 3 5 7 9 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rosenbrock function Dim Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS α Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 %run 5 %run 5 3 3 2 2 5 7 8 2 2 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rosenbrock function Dim Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS α Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 %run 5 %run 5 3 3 2 2 5 7 3 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rosenbrock function Dim Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS α 3 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha 3, default size, eval max 9 8 7 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha 3, default size, eval max 9 8 7 %run 5 %run 5 3 3 2 2 5 7 3 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rosenbrock function Dim Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS α Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 %run 5 %run 5 3 3 2 2 5 7 3 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rosenbrock function Dim Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS α Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 Rosenbrock : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 %run 5 %run 5 3 3 2 2 5 7 7 1 2 5 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rosenbrock: Discussion Bio-inspired algorithms PSO sensitive to non-separability DE still scales badly with dimension... vs BFGS Numerical premature convergence on ill-condition problems Both local and global restarts improve the results

Rastrigin function f rast (x) = n + n i=1 x2 i cos(2πx i ) 3 2 separable multi-modal 1 1 2 3 3 2 1 1 2 3 f rot rast(x) = f rast (Rx) 3 R random rotation non-separable multimodal 2 1 1 2 3 3 2 1 1 2 3

Rastrigin function - SP1 vs fitness value PSO, DE2, DE5, CMA-ES, and BFGS - PopSize Rastrigin: dimension, NP = 9 7 CMAES Rastrigin PSO Rastrigin DE2 Rastrigin DE5 Rastrigin BFGS Rastrigin CMAES Rotated Rastrigin PSO Rotated Rastrigin DE2 Rotated Rastrigin DE5 Rotated Rastrigin BFGS Rotated Rastrigin SP1 5 3 1 Function Value To Reach

Rastrigin function - SP1 vs fitness value PSO, DE2, DE5, CMA-ES, and BFGS - PopSize 1 Rastrigin: dimension, NP = 1 9 7 CMAES Rastrigin PSO Rastrigin DE2 Rastrigin DE5 Rastrigin BFGS Rastrigin CMAES Rotated Rastrigin PSO Rotated Rastrigin DE2 Rotated Rastrigin DE5 Rotated Rastrigin BFGS Rotated Rastrigin SP1 5 3 1 Function Value To Reach

Rastrigin function - SP1 vs fitness value PSO, DE2, DE5, CMA-ES, and BFGS - PopSize 3 Rastrigin: dimension, NP = 3 9 7 CMAES Rastrigin PSO Rastrigin DE2 Rastrigin DE5 Rastrigin BFGS Rastrigin CMAES Rotated Rastrigin PSO Rotated Rastrigin DE2 Rotated Rastrigin DE5 Rotated Rastrigin BFGS Rotated Rastrigin SP1 5 3 1 Function Value To Reach

Rastrigin function - Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS - PopSize Rastrigin : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max Rastrigin : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 8 7 %run %run 5 3 2 evaluation number 1 1 2 fitness value 3 % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rastrigin function - Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS - PopSize 1 Rastrigin : 21 trials, dimension, tol 1.E 9, alpha 1, default size, eval max Rastrigin : 21 trials, dimension, tol 1.E 9, alpha 1, default size, eval max 9 8 7 %run %run 5 3 2 evaluation number 1 1 2 fitness value 3 % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rastrigin function - Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS - PopSize 3 Rastrigin : 21 trials, dimension, tol 1.E 9, alpha 3, default size, eval max Rastrigin : 21 trials, dimension, tol 1.E 9, alpha 3, default size, eval max 9 9 8 8 7 7 %run 5 %run 5 3 3 2 2 3 5 7 1 1 2 3 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rastrigin function - Cumulative distributions PSO, DE2, DE5, CMA-ES, and BFGS - PopSize Rastrigin : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max Rastrigin : 21 trials, dimension, tol 1.E 9, alpha, default size, eval max 9 9 8 8 7 7 %run 5 %run 5 3 3 2 2 3 5 7 1 1 2 3 evaluation number fitness value % success vs # eval to reach success threshold (= 9 ) % success vs fitness value reached before max eval (= 7 )

Rastrigin: Discussion Bio-inspired algorithms Increasing population size improves the results Optimal size is algorithm-dependent CMA-ES and PSO solve separable case PSO about times slower Only CMA-ES solves the rotated Rastrigin reliably requires popsize 3... vs BFGS Gets stuck in local optima Whatever the restart strategies No numerical premature convergence

Away from quadraticity Non-linear scaling invariance Comparison-based algorithms are insensitive to monotonous transformations True for DE, PSO and all ESs BFGS is not and convergence results do depend on convexity Other test functions Simple transformation of ellispoid f SSE (x) = felli (x) The DiffPow function and DiffPow f DiffPow (x) = ( x i 2+( i) )

DiffPow SP1 vs Fitness Values PSO, DE2, DE5, CMA-ES, and BFGS - Dimension 5 Sum of diff. powers: dimension SP1 3 BFGS diffpow DE5 NP= diffpow DE2 NP= diffpow PSO NP= diffpow CMAES diffpow BFGS sqrt(diffpow) DE5 NP= sqrt(diffpow) DE2 NP= sqrt(diffpow) PSO NP= sqrt(diffpow) CMAES sqrt(diffpow) 2 9 3 Diff Pow Value 3

Sqrt and DiffPow: Discussion Bio-inspired algorithms Invariant PSO performs best as expected! DiffPow is separable... vs BFGS Worse on Ellipsoid than on Ellispoid Better on DiffPow than on DiffPow closer to quadraticity? Premature numerical convergence for high CN... fixed by the local restart strategy

Further work What is missing? The algorithms Other deterministic methods e.g. Scilab procedures PCX crossover operator Specific Evolution Engine The testbed Noisy functions Constrained functions The statistics Confidence bounds for SP1 and other precision/cost measures Real-world functions Which ones??? Do complete experiments

Conclusions Summary Bio-inspired algorithms BFGS All are monotonous-transformation invariant PSO very good... only on separable (easy) functions DE poorly scales up with the dimension CMA-ES is a clear best choice Sensitive to non-separability when CR < 1 Redo experiments with parameter-less restart version Optimal choice for quasi-quadratic functions but can suffer from numerical premature convergence for high condition number Even with restart procedures, fails on multimodal problem dim only here

Conclusions The coming of age The message to our Applied Maths colleagues Bio-inspired vs BFGS but CMA-ES only 7 times slower than BFGS on (quasi-)quadratic functions. is less hindered by high conditionning, is monotonous-transformation invariant, is a global search method! Moreover, Theoretical results are catching up Linear convergence for SA-ES Auger, 5 with bound on the CV speed On-going work for CMA-ES