How information theory sheds new light on black-box optimization

Size: px

Start display at page:

Download "How information theory sheds new light on black-box optimization"

Mildred Jacobs
5 years ago
Views:

information : nouvelles frontières dans le cadre du

1 How information theory sheds new light on black-box optimization Anne Auger Colloque Théorie de l information : nouvelles frontières dans le cadre du centenaire de Claude Shannon IHP, 26, 27 and 28 October, 2016

2 Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles, JMLR (accepted)

3 Overview ➊ Black-Box Optimization Typical difficulties ➋ Information Geometric Optimization ➌ Invariance ➍ Recovering well-known algorithms CMA-ES PBIL, cga 3

4 Black-Box Optimization optimize f : 7! R discrete optimization = {0, 1} n continuous optimization R n 4

5 Black-Box Optimization optimize f : 7! R discrete optimization = {0, 1} n continuous optimization R n x 2 f(x) 2 R gradients not available or not useful 5

6 Which Typical Difficulties Need to be Addressed? x 2? f(x) 2 R The algorithm does not know in advance the features/ difficulties of the optimization problem that needs to be solved difficulties related to real-wold problems Should be ready to solve any of them 6

7 What Can Be in the Black-box? What Makes a Problem Difficult to Solve? non-linear non-convex non-quadratic 7

8 What Can Be in the Black-box? What Makes a Problem Difficult to Solve? non-separable non-linear non-convex non-quadratic dependencies between the variables 8

9 What Can Be in the Black-box? What Makes a Problem Difficult to Solve? non-separable non-linear non-convex non-quadratic dependencies between the variables ill-conditioned 9

10 What Can Be in the Black-box? What Makes a Problem Difficult to Solve? non-separable non-linear non-convex non-quadratic dependencies between the variables rugged ill-conditioned 10 discontinuous noisy multi-modal

11 Algorithmic Key Components Within State-of-the-art Black-Box Optimization Algorithms randomized / stochastic robustness wrt noise, local optima invariance generalization of results translation, scale, affine-invariance, 11

12 Algorithmic Key Components Within State-of-the-art Black-Box Optimization Algorithms randomized / stochastic robustness wrt noise, local optima adaptive invariance learn on the fly certain parameters generalization of results translation, scale, affine-invariance, 12

13 Adaptive Stochastic Black-Box Algorithm t : state of the algorithm Sample candidate solutions X i t+1 = Sol( t, U i t+1),i=1,..., {U t+1,t2 N} i.i.d. Evaluate solutions X i t+1 f X i t+1 Update state of the algorithm t+1 = F t, X 1 t+1,f(x 1 t+1),..., X t+1,f(x t+1 ) 13

14 Algorithmic Key Components Within State-of-the-art Black-Box Optimization Algorithms randomized / stochastic robustness wrt noise, local optima adaptive learn on the fly certain parameters invariance translation, scale, affine-invariance, 14 x 7! f(x) x 7! f x?(x) =f(x x? )

15 Why Invariance? The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. Albert Einstein With Invariance: generalization of results from a single function to associated invariance class 15

16 Comparison-based Stochastic Algorithms Invariance to strictly increasing transformations Sample candidate solutions X i t+1 = Sol( t, U i t+1),i=1,..., Evaluate and rank solutions f X S(1) t+1 apple...apple f X S( ) t+1 S permutation with index of ordered solutions Update state of the algorithm t+1 = F t, U S(1) t+1,...,us( ) t+1 16

17 Comparison-based Property: Consequence Same ranking of candidate solutions on f or g f g :Im(f)! R strictly increasing f X S(1) t+1 apple...apple f X S( ) t+1 g f apple...apple g f X S(1) t+1 X S( ) t+1 Invariance to strictly increasing transformations of f 17

18 Comparison-based Property: Consequence Same ranking of candidate solutions on f or g f g :Im(f)! R strictly increasing f X S(1) t+1 apple...apple f X S( ) t+1 g f apple...apple g f X S(1) t+1 X S( ) t+1 Invariance to strictly increasing transformations of f Those three functions are optimized in the same way by a comparison-based algorithm 18

19 Comparison-based Stochastic Algorithms Sample candidate solutions X i t+1 = Sol( t, U i t+1),i=1,..., Evaluate and rank solutions f X S(1) t+1 apple...apple f X S( ) t+1 S permutation with index of ordered solutions Update state of the algorithm t+1 = F t, U S(1) t+1,...,us( ) t+1 19

20 Overview ➊ Black-Box Optimization Typical difficulties ➋ Information Geometric Optimization ➌ Invariance ➍ Recovering well-known algorithms CMA-ES PBIL, cga 20

21 Information Geometric Optimization Setting Family of probability distributions (P ) 2 on 2 continuous multicomponent parameter 21

22 Information Geometric Optimization Setting Family of probability distributions (P ) 2 on 2 continuous multicomponent parameter : statistical manifold Example: = R n P multivariate normal distribution =(m, C) 22

23 Changing Viewpoint I Transform original optimization problem on min x2 f(x) Onto optimization problem on Z : Minimize F ( ) = f(x)p (dx) Minimizing F, Find dirac-delta distribution concentrated on argmin x f(x) 23 [Wiestra et al, 2014]

24 Changing Viewpoint I Transform original optimization problem on min x2 f(x) Onto optimization problem on Z : Minimize F ( ) = f(x)p (dx) Minimizing But not F invariant, Find to strictly dirac-delta increasing distribution transformations of f concentrated on argmin x f(x) 24

25 Changing Viewpoint II Invariant under strictly increasing transformation of f Transform original optimization problem on min x2 f(x) Onto optimization problem on Z : Maximize J t( ) = W f t (x) {z } P (dx) w(p t[y : f(y) apple f(x)]) with w :[0, 1]! R decreasing weight function Rationale: f small $ W f t (x) large 25 [Ollivier et al.]

26 Maximizing J t ( ) Information Geometric Optimization Perform natural gradient step on t+ t = t + t e r Z W f t (x) P (dx) 26

27 Maximizing J t ( ) Information Geometric Optimization Perform natural gradient step on t+ t = t + t e r Z W f t (x) P (dx) 27

28 Natural Gradient Fisher Information Metric er Natural gradient : gradient wrt Fisher metric defined via Fisher matrix I ij ( ) = R ln P ln P j P (dx) er = KL(P + P )= 1 2 X Iij ( ) i j + O( 3 ) of P intrinsic: independent of chosen parametrization Fisher metric essentially the only way to obtain this property [Amari, Nagaoka, 2001] 28

29 Maximizing J t ( ) Information Geometric Optimization Perform natural gradient step on t+ t = t + t e r Z Z = t + t Z = t + t W f t (x) P (dx) W f (x) r e t ln P (x) = tp t(dx) w(p t[y : f(y) apple f(x)]) r e ln P (x) = tp t(dx) does not depend on rf 29

30 Maximizing J t ( ) Information Geometric Optimization Perform natural gradient step on t+ t = t + t e r Z Z = t + t Z = t + t W f t (x) P (dx) W f (x) r e t ln P (x) = tp t(dx) w(p t[y : f(y) apple f(x)]) r e ln P (x) = tp t(dx) does not depend on rf 30

31 Maximizing J t ( ) Information Geometric Optimization Perform natural gradient step on t+ t = t + t e r Z Z = t + t Z = t + t W f t (x) P (dx) W f (x) r e t ln P (x) = tp t(dx) w(p t[y : f(y) apple f(x)]) r e ln P (x) = tp t(dx) does not depend on rf 31

32 Maximizing J t ( ) Information Geometric Optimization Perform natural gradient step on t+ t = t + t e r Z Z = t + t Z = t + t W f t (x) P (dx) W f (x) r e t ln P (x) = tp t(dx) w(p t[y : f(y) apple f(x)]) r e ln P (x) = tp t(dx) does not depend on rf ➊ ➋ IGO flow: t! 0 IGO algorithms: discretization of integrals 32

33 IGO gradient flow Information Geometric Optimization set of continuous time trajectories in the defined by the ODE: - space d t dt = Z W f t (x) e r ln P (x) = tp t(dx) 33 [Ollivier et al.]

34 Information Geometric Optimization Algorithm Information Geometric Optimization (IGO) Monte Carlo Approximation of Integrals Sample X i P t, i =1,...N w(p t[y : f(y) apple f(x)]) w rk(xi )+1/2 N IGO Algorithm t+ t = t + t 1 N NX rk(xi )+1/2 w N i=1 NX = t + t ŵ ir e ln P (X i ) = t i=1 er ln P (X i ) = t 34

35 IGO Algorithm [Ollivier et al.] Monte Carlo Approximation of Integrals Sample X i P t, i =1,...N w(p t[y : f(y) apple f(x)]) w rk(xi )+1/2 N IGO Algorithm t+ t = t + t 1 N NX rk(xi )+1/2 w N i=1 NX = t + t ŵ ir e ln P (X i ) = t i=1 35 er ln P (X i ) = t consistent estimator of integral

36 Invariances IGO flow and IGO algorithm invariant to strict. increasing transformations of f IGO flow invariant under -parametrization IGO flow invariant under change of coordinates in IGO algorithm affine invariant main idea behind information geometry [Amari, Nagaoka 1993] provided the change of coordinates preserves the family of distribution for family of distributions invariant under affine transformations 36 [Ollivier et al.]

37 Instantiation of IGO Multivariate Normal Distributions [Akimoto et al. 2010] P multivariate normal distribution, =(m, C) IGO Algorithm m t+ t = m t + t C t+ t = C t + t NX ŵ i (X i m t ) i=1 NX i=1 ŵ i (X i m t )(X i m t ) T C t Recovers the CMA-ES with rank-mu update algorithm [Hansen et al. 2003] 37 affine-invariant

38 Instantiation of IGO Bernoulli measures = {0, 1} d P (x) =p 1 (x 1 )...p d (x d ) family of Bernoulli measures Recovers PBIL (Population based incremental learning) [Baluja, Caruana 1995] cga (compact Genetic Algorithm) [Harick et al. 1999] 38

39 Conclusions Information Geometric Optimization framework: a unified picture of discrete and continuous optimization theoretical foundations for existing algorithms CMA-ES state-of-the-art in continuous bb optimization some parts of CMA-ES algorithm not explained by IGO framework step-size adaptation, cumulation New algorithms: large-scale variant of CMA-ES based on IGO, 39

40 References [Ollivier et al.] Y Ollivier, L. Arnold, A. Auger, N. Hansen, Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles, JMLR (cond. accepted) [Akimoto et al. 2010] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi Bidirectional relation between CMA evolution strategies and natural evolution strategies, PPSN 2010 [Hansen et al. 2003] N. Hansen, S.D. Müller, and P. Koumoutsakos, Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES), ECJ 2003 [Amari, Nagaoka 1993] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, 1993 [Baluja, Caruana 1995] Shumeet Baluja and Rich Caruana. Removing the genetics from the standard genetic algorithm. ICML, [Harick et al. 1999] Georges R Harik, Fernando G Lobo, and David E Goldberg. The compact genetic algorithm. IEEE Trans EC, [Wiestra et al, 2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. JMLR, 2014

Natural Evolution Strategies for Direct Search

Tobias Glasmachers Natural Evolution Strategies for Direct Search 1 Natural Evolution Strategies for Direct Search PGMO-COPI 2014 Recent Advances on Continuous Randomized black-box optimization Thursday