Natural Evolution Strategies for Direct Search

Similar documents
How information theory sheds new light on black-box optimization

A CMA-ES with Multiplicative Covariance Matrix Updates

Natural Gradient for the Gaussian Distribution via Least Squares Regression

arxiv: v4 [math.oc] 28 Apr 2017

Information geometry of mirror descent

Advanced Optimization

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Information geometry for bivariate distribution control

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Stochastic Optimization Algorithms Beyond SG

CSC2541 Lecture 5 Natural Gradient

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Towards stability and optimality in stochastic gradient descent

arxiv: v2 [cs.ai] 13 Jun 2011

arxiv: v1 [cs.ne] 9 May 2016

Manifold Monte Carlo Methods

Discrete Variables and Gradient Estimators

Introduction to Optimization

Nonparametric Bayesian Methods (Gaussian Processes)

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Block Diagonal Natural Evolution Strategies

BI-population CMA-ES Algorithms with Surrogate Models and Line Searches

Evolution Strategies and Covariance Matrix Adaptation

A new Hierarchical Bayes approach to ensemble-variational data assimilation

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms.

CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization

Algorithms for Variational Learning of Mixture of Gaussians

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Convex Optimization Lecture 16

New Insights and Perspectives on the Natural Gradient Method

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Annealing Between Distributions by Averaging Moments

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space

Covariance Matrix Adaptation in Multiobjective Optimization

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Stochastic gradient descent on Riemannian manifolds

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Stochastic optimization and a variable metric approach

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Non-convex optimization. Issam Laradji

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Stochastic gradient descent on Riemannian manifolds

Variational Autoencoders

Homework 5. Convex Optimization /36-725

Discrete Latent Variable Models

Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Degrees of Freedom in Regression Ensembles

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Probabilistic Graphical Models

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

PROJECTION METHODS FOR DYNAMIC MODELS

Ch 4. Linear Models for Classification

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Benchmarking Natural Evolution Strategies with Adaptation Sampling on the Noiseless and Noisy Black-box Optimization Testbeds

Lecture 3 September 1

Neighbourhoods of Randomness and Independence

POLI 8501 Introduction to Maximum Likelihood Estimation

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Convex Optimization Algorithms for Machine Learning in 10 Slides

Optimization and Gradient Descent

Introduction to Optimization

Statistical Machine Learning from Data

High Dimensions and Heavy Tails for Natural Evolution Strategies

Bio-inspired Continuous Optimization: The Coming of Age

Stochastic Analogues to Deterministic Optimizers

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Gradient-based Adaptive Stochastic Search

Introduction to Randomized Black-Box Numerical Optimization and CMA-ES

Riemann Manifold Methods in Bayesian Statistics

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Differentiation of functions of covariance

Vectors. January 13, 2013

Stochastic Gradient Descent

Machine Learning in Modern Well Testing

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Efficient Likelihood-Free Inference

Stochastic Methods for Continuous Optimization

1 Bayesian Linear Regression (BLR)

The K-FAC method for neural network optimization

Stochastic Search using the Natural Gradient

CPSC 540: Machine Learning

Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems

Stochastic Subgradient Method

Support Vector Machines and Bayes Regression

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Linear Models in Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Neural Network Training

Greedy Layer-Wise Training of Deep Networks

FALL 2018 MATH 4211/6211 Optimization Homework 4

Exercises * on Linear Algebra

Lecture 1: Supervised Learning

Logistic Regression Logistic

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Transcription:

Tobias Glasmachers Natural Evolution Strategies for Direct Search 1 Natural Evolution Strategies for Direct Search PGMO-COPI 2014 Recent Advances on Continuous Randomized black-box optimization Thursday October 30, 2014 Tobias Glasmachers Institut für Neuroinformatik Ruhr-Universität Bochum, Germany

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization:

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R.

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R. 2 0-th order (direct search) setting.

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R. 2 0-th order (direct search) setting. 3 Black-box model x f f(x)

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R. 2 0-th order (direct search) setting. 3 Black-box model x f f(x) 4 Don t ever rely on function values, only comparisons f(x) < f(x ), e.g., for ranking.

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations:

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations: 1 generate solution x R d most of the logic

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations: 1 generate solution x R d most of the logic 2 evaluate f(x) black box, most of the computation time

Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations: 1 generate solution x R d most of the logic 2 evaluate f(x) black box, most of the computation time 3 compare (rank) against other solutions: f(x) < f(x )?

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges:

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality observations of objective values perturbed by noise

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality observations of objective values perturbed by noise high dimensionality, e.g., d 1000

Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality observations of objective values perturbed by noise high dimensionality, e.g., d 1000 black-box constraints (possibly non-smooth,...)

Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5

Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering.

Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering.

Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering. 2 Tuning of parameters of machine learning algorithms.

Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering. 2 Tuning of parameters of machine learning algorithms.

Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 6 input m R d, σ > 0, C(= I) R d d loop sample offspring x 1,...,x λ N(m,σ 2 C) evaluate f(x 1 ),...,f(x λ ) select new population of size µ (e.g., best offspring) update mean vector m update global step size σ update covariance matrix C until stopping criterion met

Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 6 input m R d, σ > 0, C(= I) R d d loop sample offspring x 1,...,x λ N(m,σ 2 C) evaluate f(x 1 ),...,f(x λ ) select new population of size µ (e.g., best offspring) update mean vector m update global step size σ update covariance matrix C until stopping criterion met randomized population-based rank-based (function-value free) step size control metric learning (covariance matrix adaptation, CMA)

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population?

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ?

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks?

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values. Why CMA?

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values. Why CMA? Efficient optimization of ill-conditioned problems, similar to (quasi) Newton methods.

Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values. Why CMA? Efficient optimization of ill-conditioned problems, similar to (quasi) Newton methods.

This Talk Tobias Glasmachers Natural Evolution Strategies for Direct Search 8 Given samples and their ranks, how to update the distribution parameters?

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C).

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ. For multi-variate Gaussians: Θ = R d P d { } P d = M R d d M = M T,M pos. def.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ. For multi-variate Gaussians: Θ = R d P d { } P d = M R d d M = M T,M pos. def. Statistical manifold of distributions { P θ θ Θ } = Θ.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ. For multi-variate Gaussians: Θ = R d P d { } P d = M R d d M = M T,M pos. def. Statistical manifold of distributions { P θ θ Θ } = Θ. Equipped with intrinsic (Riemannian) geometry; metric tensor given by Fisher information matrix I(θ).

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d. Lift objective function f : R d R to objective W f : Θ R.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d. Lift objective function f : R d R to objective W f : Θ R. Simplest choice: ] W f (θ) = E x Pθ [f(x)

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d. Lift objective function f : R d R to objective W f : Θ R. Simplest choice: More flexible choice: ] W f (θ) = E x Pθ [f(x) W f (θ) = E x Pθ [ w ( f(x) )] with monotonic weight function w : R R.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 11 Expectation operator adds one degree of smoothness.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 11 Expectation operator adds one degree of smoothness. Hence under weak assumptions W f (θ) can be optimized with gradient descent (GD): θ θ η θ W f (θ)

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 11 Expectation operator adds one degree of smoothness. Hence under weak assumptions W f (θ) can be optimized with gradient descent (GD): log-likelohood trick : θ θ η θ W f (θ) θ W f (θ) = θ E x Pθ [ w ( f(x) )] = E x Pθ [ θ log ( p θ (x) ) w ( f(x) )]

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why?

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ)

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ. (Natural) gradient vector field Θ T Θ defines (natural) gradient flow φ t (θ) tangential to θ W f (θ).

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ. (Natural) gradient vector field Θ T Θ defines (natural) gradient flow φ t (θ) tangential to θ W f (θ).

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ. (Natural) gradient vector field Θ T Θ defines (natural) gradient flow φ t (θ) tangential to θ W f (θ). Optimization: follow inverse flow curves t φ t (θ) from θ into (local) minimum of W f.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation is intractable. E x Pθ [ θ log ( p θ (x) ) w ( f(x) )]

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation [ E x Pθ θ log ( p θ (x) ) w ( f(x) )] is intractable. Monte Carlo (MC) approximation θ W f (θ) G(θ) = 1 λ for x 1,...,x λ P θ. λ [ θ log ( p θ (x i ) ) w ( f(x i ) )] i=1

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation is intractable. E x Pθ [ θ log ( p θ (x) ) w ( f(x) )] Monte Carlo (MC) approximation θ W f (θ) G(θ) = 1 λ for x 1,...,x λ P θ. λ [ θ log ( p θ (x i ) ) w ( f(x i ) )] i=1 The estimate E[G(θ)] = θ W f (θ) is consistent, hence following G(θ) amounts to stochastic gradient descent.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation is intractable. E x Pθ [ θ log ( p θ (x) ) w ( f(x) )] Monte Carlo (MC) approximation θ W f (θ) G(θ) = 1 λ for x 1,...,x λ P θ. λ [ θ log ( p θ (x i ) ) w ( f(x i ) )] i=1 The estimate E[G(θ)] = θ W f (θ) is consistent, hence following G(θ) amounts to stochastic gradient descent. Yields natural gradient MC approximation G(θ) = ( I(θ) ) 1 G(θ).

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ)

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling. Natural gradient flow is invariant under choice of distribution parameters θ P θ.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling. Natural gradient flow is invariant under choice of distribution parameters θ P θ. SNGD algorithm invariant in first order approximation due to Euler steps.

Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling. Natural gradient flow is invariant under choice of distribution parameters θ P θ. SNGD algorithm invariant in first order approximation due to Euler steps. Ollivier et al. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles. arxiv:1106.3708, 2011.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way. NES approach: optimization of expected objective value with Gaussian distributions.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way. NES approach: optimization of expected objective value with Gaussian distributions. Offspring x 1,...,x λ act as MC sample for the estimation of G(θ).

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way. NES approach: optimization of expected objective value with Gaussian distributions. Offspring x 1,...,x λ act as MC sample for the estimation of G(θ). Closed form Fisher tensor for N(m,C): I i,j (θ) = mt C 1 m + 1 ( ) θ i θ j 2 tr C 1 C C 1 C θ i θ j

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 16 NES algorithm input θ Θ, λ N, η > 0 loop sample x 1,...,x λ P θ evaluate f(x 1 ),...,f(x λ ) G(θ) 1 λ λ i=1 θlog ( p θ (x i ) ) f(x i ) G(θ) ( I(θ) ) 1 G(θ) θ θ η G(θ) until stopping criterion met Wierstra et al. Natural Evolution Strategies. CEC, 2008 and JMLR, 2014.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ. Sample ranks are f-quantile estimators, hence utility values can be represented as w(f(x i )) with special weight function w = w θ based on f-quantiles under current distribution P θ.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ. Sample ranks are f-quantile estimators, hence utility values can be represented as w(f(x i )) with special weight function w = w θ based on f-quantiles under current distribution P θ. This turns NES into a function value free algorithm.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ. Sample ranks are f-quantile estimators, hence utility values can be represented as w(f(x i )) with special weight function w = w θ based on f-quantiles under current distribution P θ. This turns NES into a function value free algorithm. Benefits: invariance under monotonic transformations of objective values linear convergence on scale invariant problems.

Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 18 xnes (exponential NES) algorithm input (m,a) Θ, λ N, η m,η A > 0 loop sample z 1,...,z λ N(0,I) transform x k,...,x λ Az k +m evaluate f(x 1 ),...,f(x λ ) G m (θ) 1 λ λ i=1 w(f(x i)) z i G C (θ) 1 λ λ i=1 w(f(x i)) 1 2 (z izi T I) m m η m ( A G m (θ) A A exp η A 1 G ) 2 C (θ) until stopping criterion met Glasmachers et al. Exponential Natural Evolution Strategies. GECCO, 2010.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective).

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective). ES traditionally have three distinct and often very different mechanisms for optimization: adaptation of m, step size control: adaptation of σ, metric learning: adaptation of C.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective). ES traditionally have three distinct and often very different mechanisms for optimization: adaptation of m, step size control: adaptation of σ, metric learning: adaptation of C. SNGD has only a single mechanism.

Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective). ES traditionally have three distinct and often very different mechanisms for optimization: adaptation of m, step size control: adaptation of σ, metric learning: adaptation of C. SNGD has only a single mechanism. Astonishing insight: all (most) parameters can be updated with a single mechanism.

Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN 2010.

Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN 2010. In particular, the update of m is identical and the rank-µ update of CMA-ES coincides with a NES update.

Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN 2010. In particular, the update of m is identical and the rank-µ update of CMA-ES coincides with a NES update. This is astonishing, it is surely not by coincidence.

Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN 2010. In particular, the update of m is identical and the rank-µ update of CMA-ES coincides with a NES update. This is astonishing, it is surely not by coincidence. This is insightful: it means that CMA-ES is (essentially) a SNGD algorithm.

Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21

Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization.

Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way?

Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way? The parameter space is equipped with the non-euclidean information geometry of the corresponding statistical manifold of search distributions.

Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way? The parameter space is equipped with the non-euclidean information geometry of the corresponding statistical manifold of search distributions. SNGD on a stochastically relaxed problem results in a direct search algorithm: the Natural-gradient Evolution Strategy (NES) algorithm.

Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way? The parameter space is equipped with the non-euclidean information geometry of the corresponding statistical manifold of search distributions. SNGD on a stochastically relaxed problem results in a direct search algorithm: the Natural-gradient Evolution Strategy (NES) algorithm. The SNGD parameter update is by no means restricted to Gaussian distributions. It is a general construction template for update equations of continuous distribution parameters.

Thank you! Questions? Tobias Glasmachers Natural Evolution Strategies for Direct Search 22