Stochastic Methods for Continuous Optimization

Similar documents
Advanced Optimization

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Introduction to Randomized Black-Box Numerical Optimization and CMA-ES

Evolution Strategies and Covariance Matrix Adaptation

Stochastic optimization and a variable metric approach

Numerical optimization Typical di culties in optimization. (considerably) larger than three. dependencies between the objective variables

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms.

Introduction to Black-Box Optimization in Continuous Search Spaces. Definitions, Examples, Difficulties

Optimisation numérique par algorithmes stochastiques adaptatifs

Addressing Numerical Black-Box Optimization: CMA-ES (Tutorial)

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Bio-inspired Continuous Optimization: The Coming of Age

Experimental Comparisons of Derivative Free Optimization Algorithms

Empirical comparisons of several derivative free optimization algorithms

CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization

How information theory sheds new light on black-box optimization

Evolution Strategies. Nikolaus Hansen, Dirk V. Arnold and Anne Auger. February 11, 2015

A Restart CMA Evolution Strategy With Increasing Population Size

Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating Support Vector Machines and Covariance-Matrix Adaptation-ES

Three Steps toward Tuning the Coordinate Systems in Nature-Inspired Optimization Algorithms

Mirrored Variants of the (1,4)-CMA-ES Compared on the Noiseless BBOB-2010 Testbed

Three Steps toward Tuning the Coordinate Systems in Nature-Inspired Optimization Algorithms

The CMA Evolution Strategy: A Tutorial

Principled Design of Continuous Stochastic Search: From Theory to

BI-population CMA-ES Algorithms with Surrogate Models and Line Searches

Benchmarking the (1+1)-CMA-ES on the BBOB-2009 Function Testbed

Covariance Matrix Adaptation in Multiobjective Optimization

Bounding the Population Size of IPOP-CMA-ES on the Noiseless BBOB Testbed

Black-Box Optimization Benchmarking the IPOP-CMA-ES on the Noisy Testbed

A comparative introduction to two optimization classics: the Nelder-Mead and the CMA-ES algorithms

Benchmarking a BI-Population CMA-ES on the BBOB-2009 Function Testbed

Benchmarking a Weighted Negative Covariance Matrix Update on the BBOB-2010 Noiseless Testbed

The Dispersion Metric and the CMA Evolution Strategy

Parameter Estimation of Complex Chemical Kinetics with Covariance Matrix Adaptation Evolution Strategy

Particle Swarm Optimization with Velocity Adaptation

Efficient Covariance Matrix Update for Variable Metric Evolution Strategies

Natural Evolution Strategies for Direct Search

Challenges in High-dimensional Reinforcement Learning with Evolution Strategies

Completely Derandomized Self-Adaptation in Evolution Strategies

A0M33EOA: EAs for Real-Parameter Optimization. Differential Evolution. CMA-ES.

Cumulative Step-size Adaptation on Linear Functions

Comparison-Based Optimizers Need Comparison-Based Surrogates

Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem

arxiv: v1 [cs.ne] 9 May 2016

Adaptive Coordinate Descent

CLASSICAL gradient methods and evolutionary algorithms

Introduction to Optimization

Noisy Optimization: A Theoretical Strategy Comparison of ES, EGS, SPSA & IF on the Noisy Sphere

Benchmarking Natural Evolution Strategies with Adaptation Sampling on the Noiseless and Noisy Black-box Optimization Testbeds

Investigating the Local-Meta-Model CMA-ES for Large Population Sizes

Convex Optimization. Problem set 2. Due Monday April 26th

Real-Parameter Black-Box Optimization Benchmarking 2010: Presentation of the Noiseless Functions

Comparison of NEWUOA with Different Numbers of Interpolation Points on the BBOB Noiseless Testbed

Computational Intelligence Winter Term 2017/18

Lecture 7 Unconstrained nonlinear programming

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Differential Evolution Based Particle Swarm Optimization

Decomposition and Metaoptimization of Mutation Operator in Differential Evolution

A CMA-ES for Mixed-Integer Nonlinear Optimization

Discriminative Direction for Kernel Classifiers

Evolution Strategies for Constants Optimization in Genetic Programming

Unconstrained optimization

Stochastic Search using the Natural Gradient

Evolutionary Gradient Search Revisited

Tuning Parameters across Mixed Dimensional Instances: A Performance Scalability Study of Sep-G-CMA-ES

A (1+1)-CMA-ES for Constrained Optimisation

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Chapter 10. Theory of Evolution Strategies: A New Perspective

arxiv: v6 [math.oc] 11 May 2018

Stochastic Optimization Algorithms Beyond SG

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Power Prediction in Smart Grids with Evolutionary Local Kernel Regression

Programming, numerics and optimization

Overview of ECNN Combinations

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

THIS paper considers the general nonlinear programming

Local-Meta-Model CMA-ES for Partially Separable Functions

TECHNISCHE UNIVERSITÄT DORTMUND REIHE COMPUTATIONAL INTELLIGENCE COLLABORATIVE RESEARCH CENTER 531

Verification of a hypothesis about unification and simplification for position updating formulas in particle swarm optimization.

A New Trust Region Algorithm Using Radial Basis Function Models

A CMA-ES with Multiplicative Covariance Matrix Updates

Natural Evolution Strategies

Algorithms for Constrained Optimization

1 Numerical optimization

The Story So Far... The central problem of this course: Smartness( X ) arg max X. Possibly with some constraints on X.

Beta Damping Quantum Behaved Particle Swarm Optimization

Numerical Optimization: Basic Concepts and Algorithms

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Two Adaptive Mutation Operators for Optima Tracking in Dynamic Optimization Problems with Evolution Strategies

Optimum Tracking with Evolution Strategies

Line Search Methods for Unconstrained Optimisation

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

The Steepest Descent Algorithm for Unconstrained Optimization

Quality Gain Analysis of the Weighted Recombination Evolution Strategy on General Convex Quadratic Functions

Evolutionary Ensemble Strategies for Heuristic Scheduling

Optimization using derivative-free and metaheuristic methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Transcription:

Stochastic Methods for Continuous Optimization Anne Auger et Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) 2016 Slides taken from Auger, Hansen, Gecco s tutorial

Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization Task: minimize an objective function (fitness function, loss function) in continuous domain f : X R n! R, Black Box scenario (direct search scenario) x 7! f (x) x f(x) I I gradients are not available or not useful problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding Search costs: number of function evaluations 2 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 3 / 81

Problem Statement Black Box Optimization and Its Difficulties What Makes a Function Difficult to Solve? Why stochastic search? non-linear, non-quadratic, non-convex on linear and quadratic functions much better search policies are available ruggedness non-smooth, discontinuous, multimodal, and/or noisy function dimensionality (size of search space) non-separability (considerably) larger than three dependencies between the objective variables ill-conditioning 0 90 80 70 60 50 40 30 20 0 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 3 2 1 0 1 2 3 gradient direction Newton direction 3 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 6 / 81

Problem Statement Black Box Optimization and Its Difficulties Ruggedness non-smooth, discontinuous, multimodal, and/or noisy 0 90 80 70 Fitness 60 50 40 30 20 0 4 3 2 1 0 1 2 3 4 cut from a 5-D example, (easily) solvable with evolution strategies 4 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 7 / 81

Problem Statement Black Box Optimization and Its Difficulties Curse of Dimensionality The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 20 points equally spaced onto the interval [0, 1]. Now consider the -dimensional space [0, 1]. To get similar coverage in terms of distance between adjacent points requires 20 13 points. 20 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Example: exhaustive search. 5 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 8 / 81

Problem Statement Black Box Optimization and Its Difficulties Curse of Dimensionality The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 20 points equally spaced onto the interval [0, 1]. Now consider the -dimensional space [0, 1]. To get similar coverage in terms of distance between adjacent points requires 20 13 points. 20 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Example: exhaustive search. 6 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 8 / 81

Problem Statement Black Box Optimization and Its Difficulties Curse of Dimensionality The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 20 points equally spaced onto the interval [0, 1]. Now consider the -dimensional space [0, 1]. To get similar coverage in terms of distance between adjacent points requires 20 13 points. 20 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Example: exhaustive search. 7 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 8 / 81

Problem Statement Black Box Optimization and Its Difficulties Curse of Dimensionality The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 20 points equally spaced onto the interval [0, 1]. Now consider the -dimensional space [0, 1]. To get similar coverage in terms of distance between adjacent points requires 20 13 points. 20 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Example: exhaustive search. 8 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 8 / 81

Problem Statement Non-Separable Problems Separable Problems Definition (Separable Problem) A function f is separable if arg min (x 1,...,x n ) f (x 1,...,x n )= arg min x 1 f (x 1,...),...,arg min f (...,x n ) x n ) it follows that f can be optimized in a sequence of n independent 1-D optimization processes Example: Additively decomposable functions nx f (x 1,...,x n )= f i (x i ) i=1 Rastrigin function 3 2 1 0 1 2 3 3 2 1 0 1 2 3 9 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 9 / 81

Problem Statement Non-Separable Problems Non-Separable Problems Building a non-separable problem from a separable one (1,2) Rotating the coordinate system f : x 7! f (x) separable f : x 7! f (Rx) non-separable R rotation matrix 3 3 2 1 0 1 R! 2 1 0 1 2 2 3 3 2 1 0 1 2 3 3 3 2 1 0 1 2 3 1 Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2 Salomon (1996). Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms. BioSystems, 39(3):263-278 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 / 81

Problem Statement Ill-Conditioned Problems Ill-Conditioned Problems Curvature of level sets Consider the convex-quadratic function f (x) = 1 2 (x x ) T H(x x )= 1 P 2 i h i,i (x i xi )2 + 1 2 P i6=j h i,j (x i x i )(x j x j ) H is Hessian matrix of f and symmetric positive definite gradient direction Newton direction f 0 (x) T H 1 f 0 (x) T Ill-conditioning means squeezed level sets (high curvature). Condition number equals nine here. Condition numbers up to are not unusual in real world problems. If H I (small condition number of H) first order information (e.g. the gradient) is sufficient. Otherwise second order information (estimation of H 1 ) is necessary. 11 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 11 / 81

Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve?... and what can be done The Problem Dimensionality Ill-conditioning Possible Approaches exploiting the problem structure separability, locality/neighborhood, encoding second order approach changes the neighborhood metric Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts...metaphors 12 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 12 / 81

Evolution Strategies (ES) A Search Template Stochastic Search A black box search template to minimize f : R n! R Initialize distribution parameters, set population size 2 N While not terminate 1 Sample distribution P (x )! x 1,...,x 2 R n 2 Evaluate x 1,...,x on f 3 Update parameters F (, x 1,...,x, f (x 1 ),...,f(x )) Everything depends on the definition of P and F deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 13 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81

Evolution Strategies (ES) A Search Template Evolution Strategies New search points are sampled normally distributed x i m + N i (0, C) for i = 1,..., as perturbations of m, where x i, m 2 R n, 2 R +, C 2 R n n where the mean vector m 2 R n represents the favorite solution the so-called step-size 2 R + controls the step length the covariance matrix C 2 R n n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m, C, and. 14 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 17 / 81

Evolution Strategies (ES) The Normal Distribution Normal Distribution 0.4 Standard Normal Distribution probability density 0.3 0.2 0.1 probability density of the 1-D standard normal distribution 0 4 2 0 2 4 2 D Normal Distribution 1 0.4 0.8 0.3 0.2 0.1 5 0 0 0 5 probability density of a 2-D normal distribution 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 5 5 15 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 19 / 81

16

17

18

Evolution Strategies (ES) The Normal Distribution... any covariance matrix can be uniquely identified with the iso-density ellipsoid {x 2 R n (x m) T C 1 (x m) =1} Lines of Equal Density N m, 2 I m + N (0, I) one degree of freedom components are independent standard normally distributed N m, D 2 m + D N (0, I) n degrees of freedom components are independent, scaled N (m, C) m + C 1 2 N (0, I) (n 2 + n)/2 degrees of freedom components are correlated where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A N (0, I) N 0, AA T holds for all A. 19 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 21 / 81

Evolution Strategies (ES) The Normal Distribution The (µ/µ, )-ES Non-elitist selection and intermediate (weighted) recombination Given the i-th solution point x i = m + N i (0, C) {z } =: y i = m + y i Let x i: the i-th ranked solution point, such that f (x 1: ) apple apple f (x : ). The new mean reads m µx w i x i: = m + µx w i y i: i=1 i=1 {z } =: y w where w 1 w µ > 0, P µ i=1 w i = 1, 1 P µ i=1 w i 2 =: µ w 4 The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied. 20 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 26 / 81

Evolution Strategies (ES) Invariance Invariance Under Monotonically Increasing Functions Rank-based algorithms Update of all parameters uses only the ranks f (x 1: ) apple f (x 2: ) apple... apple f (x : ) 3 g(f (x 1: )) apple g(f (x 2: )) apple... apple g(f (x : )) 8g g is strictly monotonically increasing g preserves ranks 3 Whitley 1989. The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA 21 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 27 / 81

Evolution Strategies (ES) Invariance Basic Invariance in Search Space translation invariance is true for most optimization algorithms f (x) $ f (x a) Identical behavior on f and f a f : x 7! f (x), x (t=0) = x 0 f a : x 7! f (x a), x (t=0) = x 0 + a No difference can be observed w.r.t. the argument of f 22 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 28 / 81

Evolution Strategies (ES) Invariance Invariance Under Monotonically Increasing Functions Rank-based algorithms 3 Update of all parameters uses only the ranks 2 1 Invariance Under Rigid Search Space Transformations f = h Rast f-level sets in dimension 2 f (x 1: ) apple f (x 2: ) apple... apple f (x : ) f = h 0 1 2 3 3 3 2 1 0 1 2 3 g(f (x 1: )) apple g(f (x 2: )) apple... apple g(f (x : )) 8g for example, invariance under search space rotation (separable non-separable) g is strictly monotonically increasing g preserves ranks 3 Whitley 1989. The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA 23 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 27 / 81

Evolution Strategies (ES) Invariance Invariance Under Monotonically Increasing Functions Rank-based algorithms 3 Update of all parameters uses only the ranks 2 1 Invariance Under Rigid Search Space Transformations f = h Rast R f-level sets in dimension 2 f (x 1: ) apple f (x 2: ) apple... apple f (x : ) f = h R 0 1 2 3 3 3 2 1 0 1 2 3 g(f (x 1: )) apple g(f (x 2: )) apple... apple g(f (x : )) 8g for example, invariance under search space rotation (separable non-separable) g is strictly monotonically increasing g preserves ranks 24 24 3 Whitley 1989. The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA Anne Auger & Nikolaus Hansen CMA-ES July, 2014 27 / 81

Evolution Strategies (ES) Invariance Invariance The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. Albert Einstein Empirical performance results I I from benchmark functions from solved real world problems are only useful if they do generalize to other problems Invariance is a strong non-empirical statement about generalization generalizing (identical) performance from a single function to a whole class of functions consequently, invariance is important for the evaluation of search algorithms 25 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 30 / 81

Comparing Experiments SP1 Comparison to BFGS, NEWUOA, PSO and DE f convex quadratic, separable with varying condition number 7 6 5 4 3 2 1 0 Ellipsoid dimension 20, 21 trials, tolerance 1e 09, eval max 1e+07 2 4 6 Condition number 8 NEWUOA BFGS DE2 PSO CMAES BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) =g(x T Hx) with H diagonal g identity (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations 14 to reach the target function value of g 1 ( 9 ) 14 Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA 26 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 70 / 81

Comparing Experiments SP1 Comparison to BFGS, NEWUOA, PSO and DE f convex quadratic, non-separable (rotated) with varying condition number 7 6 5 4 3 2 1 0 Rotated Ellipsoid dimension 20, 21 trials, tolerance 1e 09, eval max 1e+07 2 4 6 Condition number 8 NEWUOA BFGS DE2 PSO CMAES BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) =g(x T Hx) with H full g identity (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations 15 to reach the target function value of g 1 ( 9 ) 15 Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA 27 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 71 / 81

Comparing Experiments SP1 Comparison to BFGS, NEWUOA, PSO and DE f non-convex, non-separable (rotated) with varying condition number 7 6 5 4 3 2 1 Sqrt of sqrt of rotated ellipsoid dimension 20, 21 trials, tolerance 1e 09, eval max 1e+07 0 2 4 6 Condition number 8 NEWUOA BFGS DE2 PSO CMAES BFGS (Broyden et al 1970) NEWUAO (Powell 2004) DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) CMA-ES (Hansen & Ostermeier 2001) f (x) =g(x T Hx) with H full g : x 7! x 1/4 (for BFGS and NEWUOA) g any order-preserving = strictly increasing function (for all other) SP1 = average number of objective function evaluations 16 to reach the target function value of g 1 ( 9 ) 16 Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA 28 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 72 / 81

29 Zoom On Evolution Strategies Zoom on ESs: Objectives Illustrate why and how sampling distribution is controlled step-size control (overall standard deviation) allows to achieve linear convergence covariance matrix control allows to solve ill-conditioned problems

Step-Size Control Why Step-Size Control Why Step-Size Control? 0 step size too small random search function value 3 6 9 optimal step size (scale invariant) constant step size step size too large 0 0.5 1 1.5 2 function evaluations x 4 (1+1)-ES (red & green) f (x) = nx i=1 x 2 i in [ 2.2, 0.8] n for n = 30 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 33 / 81

9(2) eostermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV Step-Size Control Why Step-Size Control Methods for Step-Size Control 1/5-th success rule ab, often applied with + -selection increase step-size if more than 20% of the new solutions are successful, decrease otherwise -self-adaptation c, applied with, -selection mutation is applied to the step-size and the better, according to the objective function value, is selected simplified global self-adaptation path length control d (Cumulative Step-size Adaptation, CSA) e self-adaptation derandomized and non-localized a Rechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog b Schumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC c Schwefel 1981, Numerical Optimization of Computer Models, Wiley d Hansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput. 31 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 39 / 81

Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The Concept of Cumulative Step-Size Adaptation x i = m + y i m m + y w Measure the length of the evolution path the pathway of the mean vector m in the generation sequence + decrease + increase loosely speaking steps are perpendicular under random selection (in expectation) perpendicular in the desired situation (to be most efficient) 32 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 40 / 81

Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The Equations Initialize m 2 R n, 2 R +, evolution path p = 0, set c 4/n, d 1. m m + y w where y w = P µ i=1 w i y i: update mean q p (1 c ) p + 1 (1 c ) 2 p µw y w {z } {z} accounts for 1 c accounts for w i c kp k exp 1 update step-size d EkN (0, I) k {z } >1 () kp k is greater than its expectation 33 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 41 / 81

Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The Equations Initialize m 2 R n, 2 R +, evolution path p = 0, set c 4/n, d 1. m m + y w where y w = P µ i=1 w i y i: update mean q p (1 c ) p + 1 (1 c ) 2 p µw y w {z } {z} accounts for 1 c accounts for w i c kp k exp 1 update step-size d EkN (0, I) k {z } >1 () kp k is greater than its expectation 34 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 41 / 81

Step-Size Control (5/5, )-CSA-ES, default parameters Path Length Control (CSA) km x k nx f (x) = i=1 x 2 i in [ 0.2, 0.8] n for n = 30 35 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 42 / 81

Covariance Matrix Adaptation (CMA) Evolution Strategies Recalling New search points are sampled normally distributed x i m + N i (0, C) for i = 1,..., as perturbations of m, where x i, m 2 R n, 2 R +, C 2 R n n where the mean vector m 2 R n represents the favorite solution the so-called step-size 2 R + controls the step length the covariance matrix C 2 R n n determines the shape of the distribution ellipsoid The remaining question is how to update C. 36 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 44 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) initial distribution, C = I...equations 37 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) initial distribution, C = I...equations 38 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) y w, movement of the population mean m (disregarding )...equations 39 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) mixture of distribution C and step y w, C 0.8 C + 0.2 y w y T w...equations 40 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) new distribution (disregarding )...equations 41 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) new distribution (disregarding )...equations 42 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) movement of the population mean m...equations 43 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) mixture of distribution C and step y w, C 0.8 C + 0.2 y w y T w...equations 44 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + y w, y w = P µ i=1 w i y i:, y i N i (0, C) new distribution, C 0.8 C + 0.2 y w y T w the ruling principle: the adaptation increases the likelihood of successful steps, y w, to appear again another viewpoint: the adaptation follows a natural gradient approximation of the expected fitness...equations 45 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81

Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update Initialize m 2 R n, and C = I, set = 1, learning rate c cov 2/n 2 While not terminate x i = m + y i, y i N i (0, C), µx m m + y w where y w = w i y i: C (1 c cov )C + c cov µ w y w y T w {z} rank-one i=1 where µ w = 1 P µ i=1 w i 2 1 The rank-one update has been found independently in several domains 6 7 8 9 6 Kjellström&Taxén 1981. Stochastic Optimization in System Design, IEEE TCS 7 Hansen&Ostermeier 1996. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, ICEC 8 Ljung 1999. System Identification: Theory for the User 9 Haario et al 2001. An adaptive Metropolis algorithm, JSTOR 46 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 46 / 81

Evolution Strategies (ES) A Search Template The CMA-ES Input: m 2 R n, 2 R +, Initialize: C = I, and p c = 0, p = 0, Set: c c 4/n, c 4/n, c 1 2/n 2, c µ µ w /n 2, c 1 + c µ apple 1, d 1 + p µ w 1 and w i=1... such that µ w = P µ i=1 w 0.3 i 2 While not terminate x i = m + y i, y i N i (0, C), for i = 1,..., sampling P µ m i=1 w i x i: = m + y w where y w = P µ i=1 w i y i: update mean p c (1 c c ) p c + 1I {kp k<1.5 p n} p 1 (1 cc ) 2p µ w y w cumulation for C p (1 c ) p + p 1 (1 c ) 2p µ w C 1 2 y w cumulation for C (1 c 1 c µ ) C + c 1 p c p T P µ c + c µ i=1 w i y i: y T i: update C exp update of c d kp k EkN(0,I)k 1 Not covered on this slide: termination, restarts, useful output, boundaries and encoding 47 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 16 / 81 n,

CMA-ES Summary The Experimentum Crucis Experimentum Crucis (0) What did we want to achieve? reduce any convex-quadratic function f (x) =x T Hx to the sphere model f (x) =x T x e.g. f (x) = P n i 1 i=1 6 n 1 xi 2 without use of derivatives lines of equal density align with lines of equal fitness C / H 1 in a stochastic sense 48 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 60 / 81

CMA-ES Summary The Experimentum Crucis Experimentum Crucis (1) f convex quadratic, separable blue:abs(f), cyan:f min(f), green:sigma, red:axis ratio 5 0 5 f=2.66178883753772e 1e 05 1e 08 0 2000 4000 6000 Object Variables (9 D) 15 x(1)=3.0931e x(2)=2.2083e x(6)=5.6127e x(7)=2.7147e 5 x(8)=4.5138e x(9)=2.741e 0 x(5)= 1.0864 x(4)= 3.8371 5 0 2000 4000 x(3)= 6.99 6000 2 Principle Axes Lengths 0 2 4 0 2000 4000 6000 function evaluations f (x) = P n i=1 i 1 Standard Deviations in Coordinates divided by sigma 2 1 0 2 4 n 1 x 2 i, = 6 9 0 2000 4000 6000 function evaluations 2 3 4 5 6 7 8 49 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 61 / 81

CMA-ES Summary The Experimentum Crucis Experimentum Crucis (2) f convex quadratic, as before but non-separable (rotated) blue:abs(f), cyan:f min(f), green:sigma, red:axis ratio 5 0 5 f=7.955728188042e 8e 06 2e 06 0 2000 4000 6000 2 Principle Axes Lengths 0 2 4 0 2000 4000 6000 function evaluations 4 2 0 2 Object Variables (9 D) x(1)=2.0052e x(5)=1.2552e x(6)=1.2468e x(9)= 7.3812e x(4)= 2.9981e x(7)= 8.3583e x(3)= 2.0364e x(2)= 2.1131e 4 x(8)= 2.6301e 0 2000 4000 6000 Standard Deviations in Coordinates divided by sigma 3 0 4 0 2000 4000 6000 function evaluations f (x) =g x T Hx, g : R! R stricly increasing 1 8 2 7 5 6 9 C / H 1 for all g, H 50 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 62 / 81