Evolution Strategies and Covariance Matrix Adaptation

Size: px

Start display at page:

Download "Evolution Strategies and Covariance Matrix Adaptation"

Amie Simpson
5 years ago
Views:

1 Evolution Strategies and Covariance Matrix Adaptation Cours Contrôle Avancé - Ecole Centrale Paris Anne Auger January 2014 INRIA Research Centre Saclay Île-de-France University Paris-Sud, LRI (UMR 8623), Bat ORSAY Cedex, France Slides from A. Auger, N. Hansen GECCO 2013 Tutorial on ES and CMA-ES January / 82

2 Content 1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA) 4 Covariance Matrix Adaptation Covariance Matrix Rank-One Update Cumulation the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks January / 82

3 Problem Statement Problem Statement Continuous Domain Search/Optimization Black Box Optimization and Its Difficulties Task: minimize an objective function (fitness function, loss function) in continuous domain f : X R n R, Black Box scenario (direct search scenario) x x f (x) f(x) gradients are not available or not useful problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding Search costs: number of function evaluations January / 82

4 Problem Statement Problem Statement Continuous Domain Search/Optimization Black Box Optimization and Its Difficulties Goal fast convergence to the global optimum... or to a robust solution x solution x with small function value f (x) with least search cost there are two conflicting objectives Typical Examples shape optimization (e.g. using CFD) model calibration parameter calibration curve fitting, airfoils biological, physical controller, plants, images Problems exhaustive search is infeasible naive random search takes too long deterministic search is not successful / takes too long Approach: stochastic search, Evolutionary Algorithms January / 82

5 Problem Statement Problem Statement Continuous Domain Search/Optimization Black Box Optimization and Its Difficulties Goal fast convergence to the global optimum... or to a robust solution x solution x with small function value f (x) with least search cost there are two conflicting objectives Typical Examples shape optimization (e.g. using CFD) model calibration parameter calibration curve fitting, airfoils biological, physical controller, plants, images Problems exhaustive search is infeasible naive random search takes too long deterministic search is not successful / takes too long Approach: stochastic search, Evolutionary Algorithms January / 82

6 Problem Statement Problem Statement Continuous Domain Search/Optimization Black Box Optimization and Its Difficulties Goal fast convergence to the global optimum... or to a robust solution x solution x with small function value f (x) with least search cost there are two conflicting objectives Typical Examples shape optimization (e.g. using CFD) model calibration parameter calibration curve fitting, airfoils biological, physical controller, plants, images Problems exhaustive search is infeasible naive random search takes too long deterministic search is not successful / takes too long Approach: stochastic search, Evolutionary Algorithms January / 82

7 Problem Statement Black Box Optimization and Its Difficulties Objective Function Properties We assume f : X R n R to be non-linear, non-separable and to have at least moderate dimensionality, say n 10. Additionally, f can be non-convex multimodal non-smooth discontinuous, plateaus ill-conditioned noisy... there are possibly many local optima derivatives do not exist Goal : cope with any of these function properties they are related to real-world problems January / 82

8 Problem Statement Black Box Optimization and Its Difficulties Objective Function Properties We assume f : X R n R to be non-linear, non-separable and to have at least moderate dimensionality, say n 10. Additionally, f can be non-convex multimodal non-smooth discontinuous, plateaus ill-conditioned noisy... there are possibly many local optima derivatives do not exist Goal : cope with any of these function properties they are related to real-world problems January / 82

9 Problem Statement Black Box Optimization and Its Difficulties What Makes a Function Difficult to Solve? Why stochastic search? non-linear, non-quadratic, non-convex on linear and quadratic functions much better search policies are available ruggedness non-smooth, discontinuous, multimodal, and/or noisy function dimensionality (size of search space) non-separability (considerably) larger than three dependencies between the objective variables ill-conditioning gradient direction Newton direction January / 82

10 Problem Statement Black Box Optimization and Its Difficulties Ruggedness non-smooth, discontinuous, multimodal, and/or noisy Fitness cut from a 5-D example, (easily) solvable with evolution strategies January / 82

11 Problem Statement Curse of Dimensionality Black Box Optimization and Its Difficulties The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1] 10 would require = points. A 100 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. January / 82

12 Problem Statement Curse of Dimensionality Black Box Optimization and Its Difficulties The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1] 10 would require = points. A 100 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. January / 82

13 Problem Statement Curse of Dimensionality Black Box Optimization and Its Difficulties The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1] 10 would require = points. A 100 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. January / 82

14 Problem Statement Curse of Dimensionality Black Box Optimization and Its Difficulties The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1] 10 would require = points. A 100 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. January / 82

15 Separable Problems Problem Statement Non-Separable Problems Definition (Separable Problem) A function f is separable if arg min (x 1,...,x n) f (x 1,..., x n ) = ( ) arg min f (x 1,...),..., arg min f (..., x n ) x 1 x n it follows that f can be optimized in a sequence of n independent 1-D optimization processes Example: Additively decomposable functions n f (x 1,..., x n ) = f i (x i ) i=1 Rastrigin function January / 82

16 Non-Separable Problems Problem Statement Non-Separable Problems Building a non-separable problem from a separable one (1,2) Rotating the coordinate system f : x f (x) separable f : x f (Rx) non-separable R rotation matrix R Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp , Morgan Kaufmann 2 Salomon (1996). Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms. BioSystems, 39(3): January / 82

17 Problem Statement Ill-Conditioned Problems Curvature of level sets Ill-Conditioned Problems Consider the convex-quadratic function f (x) = 1 2 (x x ) T H(x x ) = 1 2 i h i,i xi i j h i,j x i x j H is Hessian matrix of f and symmetric positive definite gradient direction f (x) T Newton direction H 1 f (x) T Ill-conditioning means squeezed level sets (high curvature). Condition number equals nine here. Condition numbers up to are not unusual in real world problems. If H I (small condition number of H) first order information (e.g. the gradient) is sufficient. Otherwise second order information (estimation of H 1 ) is necessary. January / 82

18 Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve?... and what can be done The Problem Dimensionality Ill-conditioning Possible Approaches exploiting the problem structure separability, locality/neighborhood, encoding second order approach changes the neighborhood metric Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts... metaphors January / 82

19 Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve?... and what can be done The Problem Dimensionality Ill-conditioning Possible Approaches exploiting the problem structure separability, locality/neighborhood, encoding second order approach changes the neighborhood metric Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts... metaphors January / 82

20 Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve?... and what can be done The Problem Dimensionality Ill-conditioning Possible Approaches exploiting the problem structure separability, locality/neighborhood, encoding second order approach changes the neighborhood metric Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts... metaphors January / 82

21 Metaphors Problem Statement Ill-Conditioned Problems Evolutionary Computation Optimization/Nonlinear Programmin individual, offspring, parent candidate solution decision variables design variables object variables population set of candidate solutions fitness function objective function loss function cost function error function generation iteration... methods: ESs January / 82

22 Evolution Strategies 1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA) 4 Covariance Matrix Adaptation Covariance Matrix Rank-One Update Cumulation the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks January / 82

23 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

24 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

25 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

26 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

27 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

28 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

29 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

30 Evolution Strategies A Search Template Stochastic Search A black box search template to minimize f : R n R Initialize distribution parameters θ, set population size λ N While not terminate 1 Sample distribution P (x θ) x 1,..., x λ R n 2 Evaluate x 1,..., x λ on f 3 Update parameters θ F θ (θ, x 1,..., x λ, f (x 1 ),..., f (x λ )) Everything depends on the definition of P and F θ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms January / 82

31 The CMA-ES Evolution Strategies A Search Template Input: m R n, σ R +, λ Initialize: C = I, and p c = 0, p σ = 0, Set: c c 4/n, c σ 4/n, c 1 2/n 2, c µ µ w /n 2, c 1 + c µ 1, d σ 1 + µ w 1 and w i=1...λ such that µ w = µ 0.3 λ i=1 wi2 While not terminate x i = m + σ y i, y i N i (0, C), for i = 1,..., λ sampling m µ i=1 w i x i:λ = m + σy w where y w = µ i=1 w i y i:λ update mean p c (1 c c ) p c + 1I { pσ <1.5 n} 1 (1 cc ) 2 µ w y w cumulation for C p σ (1 c σ ) p σ + 1 (1 c σ ) 2 µ w C 1 2 y w C (1 c 1 c µ ) C + c 1 p c p T µ c + c µ i=1 w i y i:λ y T i:λ ( ( )) c σ σ exp σ pσ d σ E N(0,I) 1 n, cumulation for σ update C update of σ Not covered on this slide: termination, restarts, useful output, boundaries and encoding January / 82

32 Evolution Strategies A Search Template Evolution Strategies New search points are sampled normally distributed x i m + σ N i (0, C) for i = 1,..., λ as perturbations of m, where where x i, m R n, σ R +, C R n n the mean vector m R n represents the favorite solution the so-called step-size σ R + controls the step length the covariance matrix C R n n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m, C, and σ. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

33 Evolution Strategies A Search Template Evolution Strategies New search points are sampled normally distributed x i m + σ N i (0, C) for i = 1,..., λ as perturbations of m, where where x i, m R n, σ R +, C R n n the mean vector m R n represents the favorite solution the so-called step-size σ R + controls the step length the covariance matrix C R n n determines the shape of the distribution ellipsoid here, all new points are sampled with the same parameters The question remains how to update m, C, and σ. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

34 Evolution Strategies The Normal Distribution Why Normal Distributions? 1 widely observed in nature, for example as phenotypic traits 2 only stable distribution with finite variance stable means that the sum of normal variates is again normal: N (x, A) + N (y, B) N (x + y, A + B) helpful in design and analysis of algorithms related to the central limit theorem 3 most convenient way to generate isotropic search points the isotropic distribution does not favor any direction, rotational invariant 4 maximum entropy distribution with finite variance the least possible assumptions on f in the distribution shape January / 82

35 Normal Distribution Evolution Strategies The Normal Distribution 0.4 Standard Normal Distribution probability density probability density of the 1-D standard normal distribution D Normal Distribution probability density of a 2-D normal distribution January / 82

36 Evolution Strategies The Normal Distribution The Multi-Variate (n-dimensional) Normal Distribution Any multi-variate normal distribution N (m, C) is uniquely determined by its mean value m R n and its symmetric positive definite n n covariance matrix C. The mean value m determines the displacement (translation) value with the largest density (modal value) the distribution is symmetric about the distribution mean D Normal Distribution January / 82

37 Evolution Strategies The Normal Distribution The Multi-Variate (n-dimensional) Normal Distribution Any multi-variate normal distribution N (m, C) is uniquely determined by its mean value m R n and its symmetric positive definite n n covariance matrix C. The mean value m determines the displacement (translation) value with the largest density (modal value) the distribution is symmetric about the distribution mean D Normal Distribution The covariance matrix C determines the shape geometrical interpretation: any covariance matrix can be uniquely identified with the iso-density ellipsoid {x R n (x m) T C 1 (x m) = 1} January / 82

38 Evolution Strategies The Normal Distribution... any covariance matrix can be uniquely identified with the iso-density ellipsoid {x R n (x m) T C 1 (x m) = 1} Lines of Equal Density N ( m, σ 2 I ) m + σn (0, I) one degree of freedom σ components are independent standard normally distributed N ( m, D 2) m + D N (0, I) n degrees of freedom components are independent, scaled N (m, C) m + C 1 2 N (0, I) (n 2 + n)/2 degrees of freedom components are correlated where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A N (0, I) N ( 0, AA T) holds for all A. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

39 Evolution Strategies The Normal Distribution... any covariance matrix can be uniquely identified with the iso-density ellipsoid {x R n (x m) T C 1 (x m) = 1} Lines of Equal Density N ( m, σ 2 I ) m + σn (0, I) one degree of freedom σ components are independent standard normally distributed N ( m, D 2) m + D N (0, I) n degrees of freedom components are independent, scaled N (m, C) m + C 1 2 N (0, I) (n 2 + n)/2 degrees of freedom components are correlated where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A N (0, I) N ( 0, AA T) holds for all A. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

40 Evolution Strategies The Normal Distribution... any covariance matrix can be uniquely identified with the iso-density ellipsoid {x R n (x m) T C 1 (x m) = 1} Lines of Equal Density N ( m, σ 2 I ) m + σn (0, I) one degree of freedom σ components are independent standard normally distributed N ( m, D 2) m + D N (0, I) n degrees of freedom components are independent, scaled N (m, C) m + C 1 2 N (0, I) (n 2 + n)/2 degrees of freedom components are correlated where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A N (0, I) N ( 0, AA T) holds for all A. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

41 Evolution Strategies Effect of Dimensionality The Normal Distribution 2 D Normal Distribution N (0, I) N (0, I) / ( n ) 2 N (0, I) N 1/2, 1/2, with modal value: n 1 yet: maximum entropy distribution January ESs / 82

42 Evolution Strategies Terminology Evolution Strategies The Normal Distribution Let µ: # of parents, λ: # of offspring Plus (elitist) and comma (non-elitist) selection (µ + λ)-es: selection in {parents} {offspring} (µ, λ)-es: selection in {offspring} (1 + 1)-ES Sample one offspring from parent m If x better than m select x = m + σ N (0, C) m x... why? January / 82

43 Evolution Strategies The Normal Distribution The (µ/µ, λ)-es Non-elitist selection and intermediate (weighted) recombination Given the i-th solution point x i = m + σ N i (0, C) = m + σ y }{{} i =: y i Let x i:λ the i-th ranked solution point, such that f (x 1:λ ) f (x λ:λ ). The new mean reads µ m w i x i:λ i=1 where w 1 w µ > 0, µ i=1 w i = 1, 1 µ i=1 w i 2 =: µ w λ 4 The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

44 Evolution Strategies The Normal Distribution The (µ/µ, λ)-es Non-elitist selection and intermediate (weighted) recombination Given the i-th solution point x i = m + σ N i (0, C) = m + σ y }{{} i =: y i Let x i:λ the i-th ranked solution point, such that f (x 1:λ ) f (x λ:λ ). The new mean reads µ m w i x i:λ i=1 where w 1 w µ > 0, µ i=1 w i = 1, 1 µ i=1 w i 2 =: µ w λ 4 The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

45 Evolution Strategies The Normal Distribution The (µ/µ, λ)-es Non-elitist selection and intermediate (weighted) recombination Given the i-th solution point x i = m + σ N i (0, C) = m + σ y }{{} i =: y i Let x i:λ the i-th ranked solution point, such that f (x 1:λ ) f (x λ:λ ). The new mean reads µ µ m w i x i:λ = m + σ w i y i:λ where i=1 i=1 } {{ } =: y w w 1 w µ > 0, µ i=1 w i = 1, 1 µ i=1 w i 2 =: µ w λ 4 The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

46 Evolution Strategies Invariance Invariance Under Monotonically Increasing Functions Rank-based algorithms Update of all parameters uses only the ranks f (x 1:λ ) f (x 2:λ )... f (x λ:λ ) 3 g(f (x 1:λ )) g(f (x 2:λ ))... g(f (x λ:λ )) g g is strictly monotonically increasing g preserves ranks 3 Whitley The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA January / 82

47 Evolution Strategies Invariance Basic Invariance in Search Space translation invariance is true for most optimization algorithms f (x) f (x a) Identical behavior on f and f a f : x f (x), x (t=0) = x 0 f a : x f (x a), x (t=0) = x 0 + a No difference can be observed w.r.t. the argument of f January / 82

48 Evolution Strategies Invariance Rotational Invariance in Search Space invariance to orthogonal (rigid) transformations R, where RR T = I e.g. true for simple evolution strategies recombination operators might jeopardize rotational invariance f (x) f (Rx) Identical behavior on f and f R 45 f : x f (x), x (t=0) = x 0 f R : x f (Rx), x (t=0) = R 1 (x 0 ) No difference can be observed w.r.t. the argument of f 4 Salomon Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms. BioSystems, 39(3): Hansen Invariance, Self-Adaptation and Correlated Mutations in Evolution Strategies. Parallel Problem Solving from Nature PPSN VI January / 82

49 Invariance Evolution Strategies Invariance The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. Albert Einstein Empirical performance results, for example from benchmark functions from solved real world problems are only useful if they do generalize to other problems Invariance is a strong non-empirical statement about generalization generalizing (identical) performance from a single function to a whole class of functions consequently, invariance is important for the evaluation of search algorithms January / 82

50 Step-Size Control 1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA) 4 Covariance Matrix Adaptation Covariance Matrix Rank-One Update Cumulation the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks January / 82

51 Step-Size Control Evolution Strategies Recalling New search points are sampled normally distributed x i m + σ N i (0, C) for i = 1,..., λ as perturbations of m, where where x i, m R n, σ R +, C R n n the mean vector m R n represents the favorite solution and m µ i=1 w i x i:λ the so-called step-size σ R + controls the step length the covariance matrix C R n n determines the shape of the distribution ellipsoid The remaining question is how to update σ and C. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

52 Step-Size Control Why Step-Size Control? Why Step-Size Control 10 0 step size too small random search function value optimal step size (scale invariant) constant step size step size too large function evaluations x 10 4 (1+1)-ES (red & green) f (x) = n i=1 x 2 i in [ 2.2, 0.8] n for n = 10 January / 82

53 Step-Size Control Why Step-Size Control Why Step-Size Control? (5/5 w, 10)-ES, 11 runs 10 0 with optimal step-size with step-size control m x = f (x) f (x) = n i=1 x 2 i for n = 10 and x 0 [ 0.2, 0.8] n function evaluations with optimal step-size σ January / 82

54 Step-Size Control Why Step-Size Control Why Step-Size Control? (5/5 w, 10)-ES, 2 11 runs 10 0 with optimal step-size with step-size control m x = f (x) f (x) = n i=1 x 2 i for n = 10 and x 0 [ 0.2, 0.8] n function evaluations with optimal versus adaptive step-size σ with too small initial σ January / 82

55 Step-Size Control Why Step-Size Control? (5/5 w, 10)-ES Why Step-Size Control m x = f (x) 10 0 with optimal step-size with step-size control respective step-size f (x) = n i=1 x 2 i for n = 10 and x 0 [ 0.2, 0.8] n function evaluations comparing number of f -evals to reach m = 10 5 : January / 82

56 Step-Size Control Why Step-Size Control? (5/5 w, 10)-ES Why Step-Size Control m x = f (x) 10 0 with optimal step-size with step-size control respective step-size f (x) = n i=1 x 2 i in [ 0.2, 0.8] n for n = function evaluations comparing optimal versus default damping parameter d σ : January / 82

57 Step-Size Control Why Step-Size Control? Why Step-Size Control 10 0 random search constant σ 0.2 function value normalized progress optimal step size (scale invariant) adaptive step size σ function evaluations normalized step size σ σ opt parent ϕ n σ opt ϕ evolution window refers to the step-size interval ( is observed ) where reasonable performance January / 82

58 Step-Size Control Methods for Step-Size Control Why Step-Size Control 1/5-th success rule ab, often applied with + -selection increase step-size if more than 20% of the new solutions are successful, decrease otherwise σ-self-adaptation c, applied with, -selection mutation is applied to the step-size and the better, according to the objective function value, is selected simplified global self-adaptation path length control d (Cumulative Step-size Adaptation, CSA) e self-adaptation derandomized and non-localized a Rechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog b Schumer and Steiglitz Adaptive step size random search. IEEE TAC c Schwefel 1981, Numerical Optimization of Computer Models, Wiley d Hansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput. 9(2) Cours Contrôle e Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

59 One-fifth success rule Step-Size Control One-Fifth Success Rule increase σ decrease σ January / 82

60 One-fifth success rule Step-Size Control One-Fifth Success Rule Probability of success (p s ) 1/2 1/5 Probability of success (p s ) too small January / 82

61 One-fifth success rule Step-Size Control One-Fifth Success Rule p s : # of successful offspring / # offspring (per generation) ( 1 σ σ exp 3 p ) s p target Increase σ if p s > p target 1 p target Decrease σ if p s < p target (1 + 1)-ES p target = 1/5 IF offspring better parent p s = 1, σ σ exp(1/3) ELSE p s = 0, σ σ/ exp(1/3) 1/4 January / 82

62 Step-Size Control Path Length Control (CSA) The Concept of Cumulative Step-Size Adaptation Path Length Control (CSA) Measure the length of the evolution path the pathway of the mean vector m in the generation sequence x i = m + σ y i m m + σy w decrease σ increase σ loosely speaking steps are perpendicular under random selection (in expectation) perpendicular in the desired situation (to be most efficient) January / 82

63 Step-Size Control Path Length Control (CSA) The Equations Path Length Control (CSA) Initialize m R n, σ R +, evolution path p σ = 0, set c σ 4/n, d σ 1. m m + σy w where y w = µ i=1 w i y i:λ update mean p σ (1 c σ ) p σ + 1 (1 c σ ) 2 µw y w }{{}}{{} accounts for 1 c σ accounts for w ( ( )) i cσ p σ σ σ exp d σ E N (0, I) 1 update step-size }{{} >1 p σ is greater than its expectation January / 82

64 Step-Size Control Path Length Control (CSA) The Equations Path Length Control (CSA) Initialize m R n, σ R +, evolution path p σ = 0, set c σ 4/n, d σ 1. m m + σy w where y w = µ i=1 w i y i:λ update mean p σ (1 c σ ) p σ + 1 (1 c σ ) 2 µw y w }{{}}{{} accounts for 1 c σ accounts for w ( ( )) i cσ p σ σ σ exp d σ E N (0, I) 1 update step-size }{{} >1 p σ is greater than its expectation January / 82

65 Step-Size Control (5/5, 10)-CSA-ES, default parameters Path Length Control (CSA) with optimal step-size 10 0 with step-size control respective step-size 10-1 m x f (x) = n i=1 x 2 i in [ 0.2, 0.8] n for n = function evaluations ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

66 Covariance Matrix Adaptation 1 Problem Statement 2 Evolution Strategies 3 Step-Size Control 4 Covariance Matrix Adaptation Covariance Matrix Rank-One Update Cumulation the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks January / 82

67 Covariance Matrix Adaptation Evolution Strategies Recalling New search points are sampled normally distributed x i m + σ N i (0, C) for i = 1,..., λ as perturbations of m, where where x i, m R n, σ R +, C R n n the mean vector m R n represents the favorite solution the so-called step-size σ R + controls the step length the covariance matrix C R n n determines the shape of the distribution ellipsoid The remaining question is how to update C. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

68 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) initial distribution, C = I... equations January / 82

69 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) initial distribution, C = I... equations January / 82

70 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) y w, movement of the population mean m (disregarding σ)... equations January / 82

71 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) mixture of distribution C and step y w, C 0.8 C y w y T w... equations January / 82

72 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) new distribution (disregarding σ)... equations January / 82

73 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) new distribution (disregarding σ)... equations January / 82

74 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) movement of the population mean m... equations January / 82

75 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) mixture of distribution C and step y w, C 0.8 C y w y T w... equations January / 82

76 Covariance Matrix Adaptation Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update m m + σy w, y w = µ i=1 w i y i:λ, y i N i (0, C) new distribution, C 0.8 C y w y T w the ruling principle: the adaptation increases the likelihood of successful steps, y w, to appear again another viewpoint: the adaptation follows a natural gradient approximation of the expected fitness... equations January / 82

77 Covariance Matrix Adaptation Covariance Matrix Adaptation Rank-One Update Covariance Matrix Rank-One Update Initialize m R n, and C = I, set σ = 1, learning rate c cov 2/n 2 While not terminate x i = m + σ y i, y i N i (0, C), µ m m + σy w where y w = w i y i:λ i=1 C (1 c cov )C + c cov µ w y w y T w }{{} rank-one where µ w = 1 µ i=1 w i 2 1 The rank-one update has been found independently in several domains Kjellström&Taxén Stochastic Optimization in System Design, IEEE TCS 7 Hansen&Ostermeier Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, ICEC 8 Ljung System Identification: Theory for the User 9 Haario et al An adaptive Metropolis algorithm, JSTOR January / 82

78 Covariance Matrix Adaptation Covariance Matrix Rank-One Update covariance matrix adaptation C (1 c cov)c + c covµ wy wy T w learns all pairwise dependencies between variables off-diagonal entries in the covariance matrix reflect the dependencies conducts a principle component analysis (PCA) of steps y w, sequentially in time and space eigenvectors of the covariance matrix C are the principle components / the principle axes of the mutation ellipsoid learns a new rotated problem representation components are independent (only) in the new representation learns a new (Mahalanobis) metric variable metric method approximates the inverse Hessian on quadratic functions transformation into the sphere function for µ = 1: conducts a natural gradient ascent on the distribution N entirely independent of the given coordinate system... cumulation, rank-µ January / 82

79 Covariance Matrix Adaptation Covariance Matrix Rank-One Update 1 Problem Statement 2 Evolution Strategies 3 Step-Size Control 4 Covariance Matrix Adaptation Covariance Matrix Rank-One Update Cumulation the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks January / 82

80 Covariance Matrix Adaptation Cumulation the Evolution Path Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum of steps y w is used g p c The recursive construction of the evolution path (cumulation): i=0 (1 c c) g i }{{} exponentially fading weights y (i) w p c (1 c c) }{{} decay factor p c + 1 (1 c c) 2 µ w y w }{{}}{{} normalization factor input = m m old σ where µ w = 1 wi 2, c c 1. History information is accumulated in the evolution path. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

81 Covariance Matrix Adaptation Cumulation the Evolution Path Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum of steps y w is used g p c The recursive construction of the evolution path (cumulation): i=0 (1 c c) g i }{{} exponentially fading weights y (i) w p c (1 c c) }{{} decay factor p c + 1 (1 c c) 2 µ w y w }{{}}{{} normalization factor input = m m old σ where µ w = 1 wi 2, c c 1. History information is accumulated in the evolution path. ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

82 Covariance Matrix Adaptation Cumulation the Evolution Path Cumulation is a widely used technique and also know as exponential smoothing in time series, forecasting exponentially weighted mooving average iterate averaging in stochastic approximation momentum in the back-propagation algorithm for ANNs... Cumulation conducts a low-pass filtering, but there is more to it why? January / 82

83 Covariance Matrix Adaptation Cumulation the Evolution Path Cumulation Utilizing the Evolution Path We used y wy T w for updating C. Because y wy T w = y w( y w) T the sign of y w is lost. The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path. p c (1 c c) p c + 1 (1 c c) }{{} 2 µ w }{{} decay factor normalization factor C T (1 c cov)c + c cov p c p c }{{} rank-one where µ w = 1 wi 2, c cov c c 1 such that 1/c c is the backward time horizon. y w January resulting 52 in/ 82

84 Covariance Matrix Adaptation Cumulation the Evolution Path Cumulation Utilizing the Evolution Path We used y wy T w for updating C. Because y wy T w = y w( y w) T the sign of y w is lost. The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path. p c (1 c c) p c + 1 (1 c c) }{{} 2 µ w }{{} decay factor normalization factor C T (1 c cov)c + c cov p c p c }{{} rank-one where µ w = 1 wi 2, c cov c c 1 such that 1/c c is the backward time horizon. y w January resulting 52 in/ 82

85 Covariance Matrix Adaptation Cumulation the Evolution Path Cumulation Utilizing the Evolution Path We used y wy T w for updating C. Because y wy T w = y w( y w) T the sign of y w is lost. The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path. p c (1 c c) p c + 1 (1 c c) }{{} 2 µ w }{{} decay factor normalization factor C T (1 c cov)c + c cov p c p c }{{} rank-one where µ w = 1 wi 2, c cov c c 1 such that 1/c c is the backward time horizon. y w January resulting 52 in/ 82

86 Covariance Matrix Adaptation Cumulation the Evolution Path Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from about O(n 2 ) to O(n). (a) a Hansen, Müller and Koumoutsakos Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp Number of f -evaluations divided by dimension on the cigar function f (x) = x n i=2 x2 i 10 4 c c = 1 (no cumulation) 10 3 c c = 1/ n c c = 1/n dimension The overall model complexity is n 2 but important parts of the model can be learned in time of order n January / 82

87 Rank-µ Update Covariance Matrix Adaptation Covariance Matrix Rank-µ Update x i = m + σ y i, y i N i (0, C), m m + σy w y w = µ i=1 w i y i:λ The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each generation step. The weighted empirical covariance matrix C µ = µ w i y i:λ y T i:λ i=1 computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. with µ = λ weights can be negative 10 The rank-µ update then reads C (1 c cov ) C + c cov C µ where c cov µ w /n 2 and c cov Jastrebski and Arnold (2006). Improving evolution strategies through active covariance matrix adaptation. CEC. January / 82

88 Rank-µ Update Covariance Matrix Adaptation Covariance Matrix Rank-µ Update x i = m + σ y i, y i N i (0, C), m m + σy w y w = µ i=1 w i y i:λ The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each generation step. The weighted empirical covariance matrix C µ = µ w i y i:λ y T i:λ i=1 computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. with µ = λ weights can be negative 10 The rank-µ update then reads C (1 c cov ) C + c cov C µ where c cov µ w /n 2 and c cov Jastrebski and Arnold (2006). Improving evolution strategies through active covariance matrix adaptation. CEC. January / 82

89 Rank-µ Update Covariance Matrix Adaptation Covariance Matrix Rank-µ Update x i = m + σ y i, y i N i (0, C), m m + σy w y w = µ i=1 w i y i:λ The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each generation step. The weighted empirical covariance matrix C µ = µ w i y i:λ y T i:λ i=1 computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. with µ = λ weights can be negative 10 The rank-µ update then reads C (1 c cov ) C + c cov C µ where c cov µ w /n 2 and c cov Jastrebski and Arnold (2006). Improving evolution strategies through active covariance matrix adaptation. CEC. January / 82

90 Covariance Matrix Adaptation Covariance Matrix Rank-µ Update x i = m + σ y i, y i N (0, C) C µ = 1 yi:λ µ y T i:λ C (1 1) C + 1 C µ m new m + 1 µ yi:λ sampling of λ = 150 solutions where C = I and σ = 1 calculating C where µ = 50, w 1 = = w µ = 1 µ, and c cov = 1 new distribution ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

91 Covariance Matrix Adaptation Covariance Matrix Rank-µ Update Rank-µ CMA versus Estimation of Multivariate Normal Algorithm EMNA global 11 x i = m old + y i, y i N (0, C) C µ 1 (xi:λ m old )(x i:λ m old ) T m new = m old + 1 µ yi:λ rank-µ CMA conducts a PCA of steps x i = m old + y i, y i N (0, C) C µ 1 (xi:λ m new)(x i:λ m new) T sampling of λ = 150 calculating C from µ = 50 solutions (dots) solutions m new is the minimizer for the variances when calculating C m new = m old + 1 µ yi:λ new distribution EMNA global conducts a PCA of points 11 Hansen, N. (2006). The CMA Evolution Strategy: A Comparing Review. In J.A. Lozano, P. Larranga, I. Inza and E. Bengoetxea (Eds.). Towards a new evolutionary computation. Advances in estimation of distribution algorithms. pp January / 82

92 Covariance Matrix Adaptation Covariance Matrix Rank-µ Update The rank-µ update increases the possible learning rate in large populations roughly from 2/n 2 to µ w/n 2 can reduce the number of necessary generations roughly from O(n 2 ) to O(n) (12) given µ w λ n Therefore the rank-µ update is the primary mechanism whenever a large population size is used say λ 3 n + 10 The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O(n 2 ) to O(n). Rank-one update and rank-µ update can be combined 12 Hansen, Müller, and Koumoutsakos Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp all equations ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

93 Covariance Matrix Adaptation Covariance Matrix Rank-µ Update The rank-µ update increases the possible learning rate in large populations roughly from 2/n 2 to µ w/n 2 can reduce the number of necessary generations roughly from O(n 2 ) to O(n) (12) given µ w λ n Therefore the rank-µ update is the primary mechanism whenever a large population size is used say λ 3 n + 10 The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O(n 2 ) to O(n). Rank-one update and rank-µ update can be combined 12 Hansen, Müller, and Koumoutsakos Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp all equations ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

94 Covariance Matrix Adaptation Covariance Matrix Rank-µ Update The rank-µ update increases the possible learning rate in large populations roughly from 2/n 2 to µ w/n 2 can reduce the number of necessary generations roughly from O(n 2 ) to O(n) (12) given µ w λ n Therefore the rank-µ update is the primary mechanism whenever a large population size is used say λ 3 n + 10 The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from O(n 2 ) to O(n). Rank-one update and rank-µ update can be combined 12 Hansen, Müller, and Koumoutsakos Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp all equations ours Contrôle Avancé - Ecole Centrale Paris[0.8cm] Anne Auger CMA-ES January 2014 () January / 82

95 CMA-ES Summary Summary of Equations The Covariance Matrix Adaptation Evolution Strategy Input: m R n, σ R +, λ Initialize: C = I, and p c = 0, p σ = 0, Set: c c 4/n, c σ 4/n, c 1 2/n 2, c µ µ w /n 2, c 1 + c µ 1, d σ 1 + µ w 1 and w i=1...λ such that µ w = µ 0.3 λ i=1 wi2 While not terminate x i = m + σ y i, y i N i (0, C), for i = 1,..., λ sampling m µ i=1 w i x i:λ = m + σy w where y w = µ i=1 w i y i:λ update mean p c (1 c c ) p c + 1I { pσ <1.5 n} 1 (1 cc ) 2 µ w y w cumulation for C p σ (1 c σ ) p σ + 1 (1 c σ ) 2 µ w C 1 2 y w C (1 c 1 c µ ) C + c 1 p c p T µ c + c µ i=1 w i y i:λ y T i:λ ( ( )) c σ σ exp σ pσ d σ E N(0,I) 1 n, cumulation for σ update C update of σ Not covered on this slide: termination, restarts, useful output, boundaries and encoding January / 82

96 Source Code Snippet CMA-ES Summary January / 82

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation Anne Auger & Nikolaus Hansen INRIA Research Centre Saclay Île-de-France Project team TAO University Paris-Sud, LRI (UMR 8623), Bat.