Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating Support Vector Machines and Covariance-Matrix Adaptation-ES

Similar documents
BI-population CMA-ES Algorithms with Surrogate Models and Line Searches

Comparison-Based Optimizers Need Comparison-Based Surrogates

Adaptive Coordinate Descent

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms.

Covariance Matrix Adaptation in Multiobjective Optimization

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Jeff Howbert Introduction to Machine Learning Winter

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Experimental Comparisons of Derivative Free Optimization Algorithms

Evolution Strategies and Covariance Matrix Adaptation

Stochastic Methods for Continuous Optimization

Advanced Optimization

Perceptron Revisited: Linear Separators. Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Introduction to SVM and RVM

Introduction to Black-Box Optimization in Continuous Search Spaces. Definitions, Examples, Difficulties

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Bio-inspired Continuous Optimization: The Coming of Age

arxiv: v1 [cs.ne] 9 May 2016

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

Local-Meta-Model CMA-ES for Partially Separable Functions

CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization

CSC 411 Lecture 17: Support Vector Machine

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine (SVM) and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Introduction to Randomized Black-Box Numerical Optimization and CMA-ES

Support Vector Machine (SVM) and Kernel Methods

Max Margin-Classifier

Benchmarking Gaussian Processes and Random Forests on the BBOB Noiseless Testbed

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Mirrored Variants of the (1,4)-CMA-ES Compared on the Noiseless BBOB-2010 Testbed

Optimisation numérique par algorithmes stochastiques adaptatifs

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Support Vector Machines.

Classifier Complexity and Support Vector Classifiers

Natural Evolution Strategies for Direct Search

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Discriminative Models

Introduction to Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Investigating the Local-Meta-Model CMA-ES for Large Population Sizes

Support Vector Machine & Its Applications

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Introduction to Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

CSC242: Intro to AI. Lecture 21

Bounding the Population Size of IPOP-CMA-ES on the Noiseless BBOB Testbed

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture Notes on Support Vector Machine

Support Vector Machine

Support Vector Machine (continued)

Support Vector Machines

Stochastic optimization and a variable metric approach

Pattern Recognition 2018 Support Vector Machines

Statistical Machine Learning from Data

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Machine Learning. Support Vector Machines. Manfred Huber

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

(Kernels +) Support Vector Machines

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Statistical Pattern Recognition

Statistical learning theory, Support vector machines, and Bioinformatics

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Discriminative Models

TDT4173 Machine Learning

Addressing Numerical Black-Box Optimization: CMA-ES (Tutorial)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Basis Expansion and Nonlinear SVM. Kai Yu

Modelli Lineari (Generalizzati) e SVM

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Incorporating detractors into SVM classification

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

A Restart CMA Evolution Strategy With Increasing Population Size

5.6 Nonparametric Logistic Regression

Support Vector Machines

Linear & nonlinear classifiers

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

SVMs, Duality and the Kernel Trick

Transcription:

Covariance Matrix Adaptation-Evolution Strategy Surrogate models for Single and Multi-Objective Stochastic Optimization: Integrating and Covariance-Matrix Adaptation-ES Ilya Loshchilov, Marc Schoenauer, Michèle Sebag TAO CNRS INRIA Univ. Paris-Sud May 23rd, 2011 Michèle Sebag Surrogate optimization: SVM for CMA 1/ 47

Motivations Find Argmin {F : X IR} Context: ill-posed optimization problems Function F (fitness function) on X IR d Gradient not available or not useful F available as an oracle (black box) continuous Build {x 1,x 2,...} Argmin(F) Black-box approaches + Applicable + Robust comparison-based approaches are invariant High computational costs: number of function evaluations Michèle Sebag Surrogate optimization: SVM for CMA 2/ 47

Surrogate optimization Principle Gather E = {(x i,f(x i ))} Build ˆF from E training set learn surrogate model Use surrogate model ˆF for some time: Optimization: use ˆF instead of true F in std algo Filtering: select promising x i based on ˆF in population-based algo. Compute F(x i ) for some x i Update ˆF Iterate Michèle Sebag Surrogate optimization: SVM for CMA 3/ 47

Surrogate optimization, cont Issues Learning Hypothesis space (polynoms, neural nets, Gaussian processes,...) Selection of training set (prune, update,...) What is the learning target? Interaction of Learning & Optimization modules Schedule (when to relearn) How to use ˆF to support optimization search How to use search results to support learning ˆF This talk Using Covariance-Matrix Estimation within Support Vector Machines Using SVM for multi-objective optimization Michèle Sebag Surrogate optimization: SVM for CMA 4/ 47

Content 1 Covariance Matrix Adaptation-Evolution Strategy Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization 2 Statistical Machine Learning Linear classifiers The kernel trick 3 Previous Work Mixing Rank-SVM and Local Information Experiments 4 Background Dominance-based Surrogate Experimental Validation Michèle Sebag Surrogate optimization: SVM for CMA 5/ 47

Stochastic Search Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization A black box search template to minimize f : R n R Initialize distribution parameters θ, set sample size λ N While not terminate 1 Sample distribution P (x θ) x 1,...,x λ R n 2 Evaluate x 1,...,x λ on f 3 Update parameters θ F θ (θ,x 1,...,x λ,f(x 1 ),...,f(x λ )) Covers Deterministic algorithms, Evolutionary Algorithms, PSO, DE P implicitly defined by the variation operators Estimation of Distribution Algorithms Michèle Sebag Surrogate optimization: SVM for CMA 6/ 47

The (µ, λ) Evolution Strategy Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization Gaussian Mutations x i m+σn i (0,C) for i = 1,...,λ as perturbations of m where x i,m R n, σ R +, and C R n n where the mean vector m R n represents the favorite solution the so-called step-size σ R + controls the step length the covariance matrix C R n n determines the shape of the distribution ellipsoid How to update m, σ, and C? Michèle Sebag Surrogate optimization: SVM for CMA 7/ 47

History Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization The one-fifth rule Rechenberg, 73 One single parameter σ for the whole population Measure empirical success rate Increase σ if too large, decrease σ if too small Often wrong in non-smooth landscapes Self-adaptive mutations Schwefel, 81 Each individual carries its own mutation parameter Log-normal mutation of mutation parameters (Normal) mutation of individual Adaptation is slow for full covariance case from 1 to n2 n 2 Michèle Sebag Surrogate optimization: SVM for CMA 8/ 47

Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization Cumulative Step-Size Adaptation (CSA) x i = m+σy i m m+σy w Measure the length of the evolution path the pathway of the mean vector m in the generation sequence decrease σ increase σ loosely speaking steps are perpendicular under random selection (in expectation) perpendicular in the desired situation (to be most efficient) Michèle Sebag Surrogate optimization: SVM for CMA 9/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) initial distribution, C = I new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) y w, movement of the population mean m (disregarding σ) new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) mixture of distribution C and step y w, C 0.8 C+0.2 y w y T w new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) new distribution (disregarding σ) new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) movement of the population mean m new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) mixture of distribution C and step y w, C 0.8 C+0.2 y w y T w new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Covariance Matrix Adaptation Rank-One Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization m m+σy w, y w = µ i=1 w iy i:λ, y i N i (0,C) new distribution: C 0.8 C+0.2 y w y T w ruling principle: the adaptation increases the probability of successful steps, y w, to appear again Michèle Sebag Surrogate optimization: SVM for CMA 10/ 47

Rank-µ Update Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization x i = m + σy i, y i N i(0,c), m m + σy w y w = µ i=1 wi y i:λ x i = m + σ y i, y i N (0,C) Cµ = µ 1 yi:λ y T i:λ C (1 1) C + 1 Cµ mnew m + 1 µ yi:λ sampling of λ = 150 solutions where C = I and σ = 1 calculating C from µ = 50 points, w 1 = = w µ = 1 µ new distribution Remark: the old (sample) distribution shape has a great influence on the new distribution iterations needed Michèle Sebag Surrogate optimization: SVM for CMA 11/ 47

3 2 1 0 1 2 3 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 Covariance Matrix Adaptation-Evolution Strategy Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization Invariance: Guarantee for Generalization Invariance properties of CMA-ES Invariance to order preserving transformations in function space like all comparison-based algorithms Translation and rotation invariance to rigid transformations of the search space CMA-ES is almost parameterless Tuning of a small set of functions Hansen & Ostermeier 2001 Default values generalize to whole classes Exception: population size for multi-modal functions but try the Restart-CMA-ES Auger & Hansen, 2005 Michèle Sebag Surrogate optimization: SVM for CMA 12/ 47

State-of-the-art Results Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization BBOB Black-Box Optimization Benchmarking ACM-GECCO workshop, in 2009 and 2010 Set of 25 benchmark functions, dimensions 2 to 40 With known difficulties (ill-conditioning, non-separability,... ) Noisy and non-noisy versions Competitors include BFGS (Matlab version), Fletcher-Powell, DFO (Derivative-Free Optimization, Powell 04) Differential Evolution Particle Swarm Optimization and many more Michèle Sebag Surrogate optimization: SVM for CMA 13/ 47

Contents Statistical Machine Learning Linear classifiers The kernel trick 1 Covariance Matrix Adaptation-Evolution Strategy Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization 2 Statistical Machine Learning Linear classifiers The kernel trick 3 Previous Work Mixing Rank-SVM and Local Information Experiments 4 Background Dominance-based Surrogate Experimental Validation Michèle Sebag Surrogate optimization: SVM for CMA 14/ 47

Supervised Machine Learning Statistical Machine Learning Linear classifiers The kernel trick Context Universe instance x i Oracle y i Input: Training set E = {(x i,y i ),i = 1...n,x i X,y i Y} Output: Hypothesis h : X Y Criterion: Quality of h Michèle Sebag Surrogate optimization: SVM for CMA 15/ 47

Supervised Machine Learning, 2 Statistical Machine Learning Linear classifiers The kernel trick Definitions E = {(x i,y i ),x i X,y i Y,i = 1...n} Classification : Y finite Regression : Y IR Hypothesis space H : X Y failure/ok time to failure Tasks Select H model selection Assess h H expected accuracy IE[h(x) y] Find h in H minimizing the error cost in expectation h = Arg min{ie[l(h(x) y),h H} Michèle Sebag Surrogate optimization: SVM for CMA 16/ 47

Dilemma Statistical Machine Learning Linear classifiers The kernel trick Fitting the data Bias variance tradeoff Michèle Sebag Surrogate optimization: SVM for CMA 17/ 47

Statistical Machine Learning Statistical Machine Learning Linear classifiers The kernel trick Minimize expected loss Minimize IE[l(h(x), y)] Principle If h is well-behaved on the training set, if the training set is representative and if h is regular, then h is well-behaved in expectation. n i=1 E[F] F(x i) +c(f,n) n Michèle Sebag Surrogate optimization: SVM for CMA 18/ 47

Statistical Machine Learning Linear classifiers The kernel trick Linear classification; the noiseless case H : X IR d IR prediction = sgn(h(x)) h(x) = < w,x > +b L 1 L 2 w L 3 b Michèle Sebag Surrogate optimization: SVM for CMA 19/ 47

Statistical Machine Learning Linear classifiers The kernel trick Linear classification; the noiseless case, 2 Example Constraint y i (< w,x i > +b) margin 0 Maximize minimum margin 2/ w support vector w b +1 b b 0-1 Formalisation { Minimize 1 2 w 2 subject to i, y i (< w,x i > +b) 1 2/ w i Michèle Sebag Surrogate optimization: SVM for CMA 20/ 47

Statistical Machine Learning Linear classifiers The kernel trick Linear classification; the noiseless case, 3 Primal form { Minimize 1 2 w 2 subject to i, y i (< w,x i > +b) 1 Using Lagrange multipliers: Minimize w,b max α { 1 2 w 2 Dual form Maximize α } n α i [y i (w x i b) 1] i=1 n α i 1 α i α j y i y j x i,x j 2 i=1 Optimization: quadratic programming i,j w = α i y i x i subject to α i 0,i = 1...n Michèle Sebag Surrogate optimization: SVM for CMA 21/ 47

Statistical Machine Learning Linear classifiers The kernel trick Linear classification; the noisy case Allow constraint violation; consider slack variables Primal form { Minimize 1 2 w 2 +C n i=1 ξ i subject to i, y i (< w,x i > +b) 1 ξ i, 0 ξ i Lagrange multipliers: { 1 n 2 w 2 +C ξ i Dual form Solution i=1 Maximize α Minimize w,b,ξ max α,β } n n α i [y i (w x i b) 1+ξ i ] β i ξ i i=1 n α i 1 α i α j y i y j x i,x j 2 i=1 i,j i=1 subject to 0 α i C,i = 1...n, α i y i = 0 support vectors h(x) = w,x = α i y i x i,x Michèle Sebag Surrogate optimization: SVM for CMA 22/ 47

The kernel trick Statistical Machine Learning Linear classifiers The kernel trick Intuition X Ω Φ : x = (x 1,x 2 ) (x 2 1, 2x 1.x 2,x 2 2) Principle: choose Φ, K such that Φ(x),Φ(x ) = K(x,x ) Michèle Sebag Surrogate optimization: SVM for CMA 23/ 47

The kernel trick, 2 Statistical Machine Learning Linear classifiers The kernel trick SVM only considers the scalar product PROS h(x) = i α iy i x i,x linear case h(x) = i α iy i K(x i,x) mboxkerneltrick A rich hypothesis space No computational overhead: no explicit mapping on the feature space Open problem: kernel design Michèle Sebag Surrogate optimization: SVM for CMA 24/ 47

The kernel trick, 3 Statistical Machine Learning Linear classifiers The kernel trick Kernels Polynomial: k(x i,x j ) = ( x i,x j +1) d Gaussian or Radial Basis Function: k(x i,x j ) = exp( xi xj 2 2σ 2 ) Hyperbolic tangent: k(x i,x j ) = tanh(k x i,x j +c) Examples for Polynomial (left) and Gaussian (right) Kernels: Michèle Sebag Surrogate optimization: SVM for CMA 25/ 47

Rank-based SVM Statistical Machine Learning Linear classifiers The kernel trick Learning to order things On training set E = {x i,i = 1...n} expert gives preferences: (x ik x jk ), k = 1...K underconstrained regression Order constraints Primal form { Minimize 1 2 w 2 +C K k=1 ξ k subject to k, w,x ik w,x jk 1 ξ k Michèle Sebag Surrogate optimization: SVM for CMA 26/ 47

Contents Previous Work Mixing Rank-SVM and Local Information Experiments 1 Covariance Matrix Adaptation-Evolution Strategy Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization 2 Statistical Machine Learning Linear classifiers The kernel trick 3 Previous Work Mixing Rank-SVM and Local Information Experiments 4 Background Dominance-based Surrogate Experimental Validation Michèle Sebag Surrogate optimization: SVM for CMA 27/ 47

Surrogate Models for CMA-ES Previous Work Mixing Rank-SVM and Local Information Experiments lmm-cma-es Build a full quadratic meta-model around current point Weighted by Mahalanobis distance from covariance matric Speed-up: a factor of 2-3 for n 4 Complexity: from O(n 4 ) to O(n 6 ) (intractable for n>16) Rank-invariance is lost S. Kern et al. (2006). "Local Meta-Models for Optimization Using Evolution Strategies" Z. Bouzarkouna et al. (2010). Investigating the lmm-cma-es for Large Population Sizes Michèle Sebag Surrogate optimization: SVM for CMA 28/ 47

Previous Work Mixing Rank-SVM and Local Information Experiments Surrogate Models for CMA-ES, cont Using Rank-SVM Builds a global model using Rank-SVM x i x j iff F(x i ) < F(x j ) Kernel and parameters highly problem-dependent Note: no use of information from current state of CMA T. Runarsson (2006). "Ordinal Regression in Evolutionary Computation" ACM Algorithm Use C from CMA-ES as Gaussian kernel I. Loschilov et al. (2010). "Comparison-based optimizers need comparison-based surrogates Michèle Sebag Surrogate optimization: SVM for CMA 29/ 47

Model Learning Non-separable Ellipsoid problem Previous Work Mixing Rank-SVM and Local Information Experiments 1 1 0.5 0.5 X 2 0 X 2 0 0.5 0.5 1 1 1 0.5 0 0.5 1 X 1 1 0.5 0 0.5 1 X 1 Rank-SVM regression in original coordinate system Rank-SVM regression in transformed coordinate system given by current covariance matrix C and mean m: x = C 1 2(x m) Michèle Sebag Surrogate optimization: SVM for CMA 30/ 47

Using the Surrogate Model Previous Work Mixing Rank-SVM and Local Information Experiments Optimization: Significant Speed-Up... if global and accurate model Filtering: "Guaranteed" Speed-Up Prescreen (λ ) Evaluate (λ ) Retain with rank, Retain with rank λ, ~λ ~ Probability Density 0.6 0.5 0.4 0.3 0.2 0 100 200 300 400 500 Rank Michèle Sebag Surrogate optimization: SVM for CMA 31/ 47

ACM-ES Optimization Loop Previous Work Mixing Rank-SVM and Local Information Experiments from the current covariance matrix A. Select training points B. Build a surrogate model 1. Select best k training points. 2. The change of coordinates, defined and the current mean value, reads [4]: 1 1 0.5 0.5 X 2 0 X 2 0-0.5-0.5 7. Add new λ training points and update parameters of CMA-ES. D. Select most promising children 5. 6. Prescreen (λ ) Evaluate (λ ) Probability Density 0.6 0.5 0.4 0.3 0.2 Retain with rank, Retain with rank λ, ~ λ 0 100 200 300 400 500 Rank -1 ~ -1-0.5 0 0.5 1 X1-1 Rank-based Surrogate Model -1-0.5 0 0.5 1 X1 3. Build a surrogate model using Rank SVM. C. Generate pre-children 4. Generate pre-children and rank them according to surrogate fitness function. Michèle Sebag Surrogate optimization: SVM for CMA 32/ 47

Parameters Previous Work Mixing Rank-SVM and Local Information Experiments SVM Learning Number of training points: N training = 30 d for all problems, except Rosenbrock and Rastrigin, where N training = 70 (d) Number of iterations: N iter = 50000 d Kernel function: RBF function with σ equal to the average distance of the training points The cost of constraint violation: C i = 10 6 (N training i) 2.0 Offspring Selection Number of test points: N test = 500 Number of evaluated offsprings: λ = λ 3 Offspring selection pressure parameters: σ 2 sel0 = 2σ2 sel1 = 0.8 Michèle Sebag Surrogate optimization: SVM for CMA 33/ 47

Results Speed-up Previous Work Mixing Rank-SVM and Local Information Experiments Speedup 5 4 3 2 Schwefel 2 Schwefel 1/4 Rosenbrock 1 Noisy Sphere 1 Ackley Ellipsoid 0 0 5 10 15 20 25 30 35 0 40 Problem Dimension 5 4 3 Function n λ λ e ACM-ES spu CMA-ES Schwefel 10 10 3 801 36 3.3 2667 87 20 12 4 3531 179 2.0 7042 172 40 15 5 13440 281 1.7 22400 289 1/4 Schwefel 10 10 3 1774 37 4.1 7220 206 20 12 4 6138 82 2.5 15600 294 40 15 5 22658 390 1.8 41534 466 Rosenbrock 10 10 3 2059 143 (0.95) 3.7 7669 691 (0.90) 20 12 4 11793 574 (0.75) 1.8 21794 1529 40 15 5 49750 2412 (0.9) 1.6 82043 3991 NoisySphere 10 10 3 0.15 766 90 (0.95) 2.7 2058 148 20 12 4 0.11 1361 212 2.8 3777 127 40 15 5 0.08 2409 120 2.9 7023 173 Ackley 10 10 3 892 28 4.1 3641 154 20 12 4 1884 50 3.5 6641 108 40 15 5 3690 80 3.3 12084 247 Ellipsoid 10 10 3 1628 95 3.8 6211 264 20 12 4 8250 393 2.3 19060 501 40 15 5 33602 548 2.1 69642 644 Rastrigin 5 140 70 23293 1374 (0.3) 0.5 12310 1098 (0.75) Michèle Sebag Surrogate optimization: SVM for CMA 34/ 47

Results Learning Time Previous Work Mixing Rank-SVM and Local Information Experiments Cost of model learning/testing increases quasi-linearly with d on Sphere function: 10 Learning time (ms) 10 3 10 2 Slope=1.13 10 1 10 0 10 1 10 2 Problem Dimension 10 3 Michèle Sebag Surrogate optimization: SVM for CMA 35/ 47

ACM-ES: conclusion Previous Work Mixing Rank-SVM and Local Information Experiments ACM-ES From 2 to 4 times faster on Uni-Modal Problems Invariance to rank-preserving transformations preserved Computation complexity is O(n) Available online at http://www.lri.fr/~ilya/publications/acmesppsn2010.zip Open Issues Extention to multi-modal optimization On-line adaptation of selection pressure and surrogate model complexity Michèle Sebag Surrogate optimization: SVM for CMA 36/ 47

Contents Background Dominance-based Surrogate Experimental Validation 1 Covariance Matrix Adaptation-Evolution Strategy Evolution Strategies CMA-ES The state-of-the-art of (Stochastic) Optimization 2 Statistical Machine Learning Linear classifiers The kernel trick 3 Previous Work Mixing Rank-SVM and Local Information Experiments 4 Background Dominance-based Surrogate Experimental Validation Michèle Sebag Surrogate optimization: SVM for CMA 37/ 47

Background Dominance-based Surrogate Experimental Validation Multi-objective CMA-ES (MO-CMA-ES) MO-CMA-ES = µ mo independent (1+1)-CMA-ES. Each (1+1)-CMA samples new offspring. The size of the temporary population is 2µ mo. Only µ mo best solutions should be chosen for new population after the hypervolume-based non-dominated sorting. Update of CMA individuals takes place. Dominated Pareto Objective 2 Michèle Sebag Surrogate optimization: SVM for CMA 38/ 47

Background Dominance-based Surrogate Experimental Validation A Multi-Objective Surrogate Model Rationale Rationale: find a unique function F(x) that defines the aggregated quality of the solution x in multi-objective case. Idea originally proposed using a mixture of One-Class SVM and regression-svm a a I. Loshchilov, M. Schoenauer, M. Sebag (GECCO 2010). "A Mono Surrogate for Multiobjective Optimization" Dominated Pareto New Pareto p+e 2e p-e Objective 2 X2 Objective 1 FSVM p+e p-e Michèle Sebag Surrogate optimization: SVM for CMA 39/ 47 X1

Unsing the Surrogate Model Background Dominance-based Surrogate Experimental Validation Filtering Generate N inform pre-children For each pre-children A and the nearest parent B calculate Gain(A,B) = F svm (A) F svm (B) New children is the point with the maximum value of Gain true Pareto SVM Pareto X2 X1 Michèle Sebag Surrogate optimization: SVM for CMA 40/ 47

Dominance-Based Surrogate Using Rank-SVM Which ordered pairs? Background Dominance-based Surrogate Experimental Validation Considering all possible relations may be too expensive. Primary constraints: x and its nearest dominated point Secondary constraints: any 2 points not belonging to the same front (according to non-dominated sorting) Objective 2 a b c e f d FSVM > - constraints: Primary Secondary All primary constraints, and a limited number of secondary constraints Objective 1 Michèle Sebag Surrogate optimization: SVM for CMA 41/ 47

Dominance-Based Surrogate (2) Background Dominance-based Surrogate Experimental Validation Construction of the surrogate model Initialize archive Ω active as the set of Primary constraints, and Ω passive as the set of Secondary constraints. Learn the model for 1000 Ω active iterations. Add the most violated passive contraint from Ω passive to Ω active and optimize the model for 10 Ω active iterations. Repeat the last step 0.1 Ω active times. Michèle Sebag Surrogate optimization: SVM for CMA 42/ 47

Experimental Validation Parameters Background Dominance-based Surrogate Experimental Validation Surrogate Models ASM - aggregated surrogate model based on One-Class SVM and Regression SVM RASM - proposed Rank-based SVM SVM Learning Number of training points: at most N training = 1000 points Number of iterations: 1000 Ω active + Ω active 2 2N 2 training Kernel function: RBF function with σ equal to the average distance of the training points The cost of constraint violation: C = 1000 Offspring Selection Number of pre-children: p = 2 and p = 10 Michèle Sebag Surrogate optimization: SVM for CMA 43/ 47

Experimental Validation Comparative Results Background Dominance-based Surrogate Experimental Validation ASM and Rank-based ASM applied on top of NSGA-II (with hypervolume secondary criterion) and MO-CMA-ES, on ZDT and IHR functions. N = How many more true evaluations than best performer Michèle Sebag Surrogate optimization: SVM for CMA 44/ 47

Discussion Background Dominance-based Surrogate Experimental Validation Results on ZDT and IHR problems Rank-SVM versions are 1.5 times faster for p = 2 2-5 for p = 10 before the algorithm reaches nearly-optimal Pareto points premature convergence of approximation of optimal µ-distribution: the surrogate model only enforces convergence toward Pareto front, but does not care about diversity. Michèle Sebag Surrogate optimization: SVM for CMA 45/ 47

Background Dominance-based Surrogate Experimental Validation Dominance-based Surrogate: Conclusion The proposed aggregated surrogate model is invariant to preserving transformation of the objective functions. The speed-up is significant, but limited to the convergence to the optimal Pareto front. Objective 2 a b g c e f d > - constraints: Primary Secondary = - constraints: Primary Secondary FSVM Thanks to the flexibility of SVM, "any" kind of preference could be taken into account: extreme points, "=" relations, Hypervolume Contribution, Decision Maker - defined relations. Objective 1 Michèle Sebag Surrogate optimization: SVM for CMA 46/ 47

Machine Learning for Optimization: Discussion Learning about the landscape easy Using available samples Using prior knowledge / constraints...using it to speed-up the search Dilemma Exploration vs Exploitation Multi-modal optimization very doable; more difficult Michèle Sebag Surrogate optimization: SVM for CMA 47/ 47