Introduction to Optimization

Similar documents
Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Numerical Optimization: Basic Concepts and Algorithms

Stochastic optimization and a variable metric approach

Introduction to Black-Box Optimization in Continuous Search Spaces. Definitions, Examples, Difficulties

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

The Story So Far... The central problem of this course: Smartness( X ) arg max X. Possibly with some constraints on X.

Overview. Optimization. Easy optimization problems. Monte Carlo for Optimization. 1. Survey MC ideas for optimization: (a) Multistart

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms.

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Artificial Intelligence

Probabilistic Graphical Models

Geometric Semantic Genetic Programming (GSGP): theory-laden design of semantic mutation operators

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

MCMC: Markov Chain Monte Carlo

Probabilistic Graphical Models

Advanced Optimization

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Integrated Non-Factorized Variational Inference

Robotics. Mobile Robotics. Marc Toussaint U Stuttgart

Machine Learning for Data Science (CS4786) Lecture 24

Introduction to Machine Learning

Covariance Matrix Adaptation in Multiobjective Optimization

Computational statistics

STA 414/2104: Machine Learning

Sampling Methods (11/30/04)

Automatic Differentiation and Neural Networks

x 2 i 10 cos(2πx i ). i=1

Markov Chain Monte Carlo Methods for Stochastic Optimization

Natural Evolution Strategies for Direct Search

Introduction to Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization

Reinforcement Learning: Part 3 Evolution

Markov Chain Monte Carlo Methods for Stochastic

Expectation Maximization and Mixtures of Gaussians

Introduction to Machine Learning CMU-10701

Part 1: Expectation Propagation

Scaling Up. So far, we have considered methods that systematically explore the full search space, possibly using principled pruning (A* etc.).

Probabilistic Graphical Models

Particle Filtering for Data-Driven Simulation and Optimization

Discrete Variables and Gradient Estimators

CPSC 540: Machine Learning

Practical Numerical Methods in Physics and Astronomy. Lecture 5 Optimisation and Search Techniques

arxiv: v1 [cs.ne] 9 May 2016

STA 4273H: Statistical Machine Learning

Variational Autoencoder

STA 4273H: Statistical Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Local and Stochastic Search

CS 331: Artificial Intelligence Local Search 1. Tough real-world problems

Gaussian EDA and Truncation Selection: Setting Limits for Sustainable Progress

Statistical learning. Chapter 20, Sections 1 4 1

Session 3A: Markov chain Monte Carlo (MCMC)

Introduction to Randomized Black-Box Numerical Optimization and CMA-ES

Monte Carlo methods for sampling-based Stochastic Optimization

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Nonparametric Bayesian Methods (Gaussian Processes)

An introduction to Bayesian statistics and model calibration and a host of related topics

A.I.: Beyond Classical Search

CS60021: Scalable Data Mining. Large Scale Machine Learning

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Artificial Intelligence

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

RAO-BLACKWELLISED PARTICLE FILTERS: EXAMPLES OF APPLICATIONS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning Static Parameters in Stochastic Processes

17 : Markov Chain Monte Carlo

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Optimization and Gradient Descent

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning

Local search algorithms

CSC 578 Neural Networks and Deep Learning

Bioinformatics: Network Analysis

Data assimilation in high dimensions

The Genetic Algorithm is Useful to Fitting Input Probability Distributions for Simulation Models

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

Markov Chains and MCMC

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Genetic Algorithms and Genetic Programming Lecture 17

CPSC 540: Machine Learning

ADAPTIVE EVOLUTIONARY MONTE CARLO FOR HEURISTIC OPTIMIZATION: WITH APPLICATIONS TO SENSOR PLACEMENT PROBLEMS. A Dissertation YUAN REN

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

CSCI-567: Machine Learning (Spring 2019)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

STA 4273H: Statistical Machine Learning

Log-Linear Models, MEMMs, and CRFs

Implementation and performance of selected evolutionary algorithms

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

CSC 2541: Bayesian Methods for Machine Learning

Lecture 21: Spectral Learning for Graphical Models

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Optimization

Local Search and Optimization

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

Learning Energy-Based Models of High-Dimensional Data

Transcription:

Introduction to Optimization Blackbox Optimization Marc Toussaint U Stuttgart

Blackbox Optimization The term is not really well defined I use it to express that only f(x) can be evaluated f(x) or 2 f(x) are not (directly) accessible More common terms: Global optimization This usually emphasizes that methods should not get stuck in local optima Very very interesting domain close analogies to (active) Machine Learning, bandits, POMDPs, optimal decision making/planning, optimal experimental design Usually mathematically well founded methods Stochastic search or Evolutionary Algorithms or Local Search Usually these are local methods (extensions trying to be more global) Various interesting heuristics Some of them (implicitly or explicitly) locally approximating gradients or 2nd order models 2/25

Blackbox Optimization Problem: Let x R n, f : R n R, find min x f(x) where we can only evaluate f(x) for any x R n A constrained version: Let x R n, f : R n R, g : R n {0, 1}, find min x f(x) s.t. g(x) = 1 where we can only evaluate f(x) and g(x) for any x R n I haven t seen much work on this. Would be interesting to consider this more rigorously. 3/25

A zoo of approaches People with many different backgrounds drawn into this Ranging from heuristics and Evolutionary Algorithms to heavy mathematics Evolutionary Algorithms, esp. Evolution Strategies, Covariance Matrix Adaptation, Estimation of Distribution Algorithms Simulated Annealing, Hill Climing, Downhill Simplex local modelling (gradient/hessian), global modelling 4/25

Optimizing and Learning Blackbox optimization is often related to learning: When we have local a gradient or Hessian, we can take that local information and run no need to keep track of the history or learn (exception: BFGS) In the Blackbox case we have no local information directly accessible one needs to account of the history in some way or another to have an idea where to continue search Accounting for the history very often means learning: Learning a local or global model of f itself, learning which steps have been successful recently (gradient estimation), or which step directions, or other heuristics 5/25

Outline Stochastic Search A simple framework that many heuristics and local modelling approaches fit in Evolutionary Algorithms, Covariance Matrix Adaptation, EDAs as special case Heuristics Simulated Annealing Hill Climing Downhill Simplex Global Optimization Framing the big problem: The optimal solution to optimization Mentioning very briefly No Free Lunch Theorems Greedy approximations, Kriging-type methods 6/25

Stochastic Search 7/25

Stochastic Search The general recipe: The algorithm maintains a probability distribution p θ (x) In each iteration it takes n samples {x i } n i=1 p θ(x) Each x i is evaluated data {(x i, f(x i ))} n i=1 That data is used to update θ Stochastic Search: Input: initial parameter θ, function f(x), distribution model p θ (x), update heuristic h(θ, D) Output: final θ and best point x 1: repeat 2: Sample {x i } n i=1 p θ(x) 3: Evaluate samples, D = {(x i, f(x i ))} n i=1 4: Update θ h(θ, D) 5: until θ converges 8/25

Stochastic Search The parameter θ is the only knowledge/information that is being propagated between iterations θ encodes what has been learned from the history θ defines where to search in the future Evolutionary Algorithms: θ is a parent population Evolution Strategies: θ defines a Gaussian with mean & variance Estimation of Distribution Algorithms: θ are parameters of some distribution model, e.g. Bayesian Network Simulated Annealing: θ is the current point and a temperature 9/25

Example: Gaussian search distribution (µ, λ)-es From 1960s/70s. Rechenberg/Schwefel Perhaps the simplest type of distribution model θ = (ˆx), p t (x) = N(x ˆx, σ 2 ) a n-dimenstional isotropic Gaussian with fixed deviation σ Update heuristic: Given D = {(x i, f(x i ))} λ i=1, select µ best: D = bestof µ (D) Compute the new mean ˆx from D This algorithm is called Evolution Strategy (µ, λ)-es The Gaussian is meant to represent a species λ offspring are generated the best µ selected 10/25

Example: elitarian selection (µ + λ)-es θ also stores the µ best previous points θ = (ˆx, D ), p t (x) = N(x ˆx, σ 2 ) The θ update: Select the µ best from D D: D = bestof µ (D D) Compute the new mean ˆx from D Is called elitarian because good parents can survive Consider the (1 + 1)-ES: a Hill Climber There is considerable theory on convergence of, e.g., (1 + λ)-es 11/25

Evolutionary Algorithms (EAs) These were two simple examples of EAs Generally, I think EAs can well be described/understood as very special kinds of parameterizing p θ (x) and updating θ The θ typically is a set of good points found so far (parents) Mutation & Crossover define p θ (x) The samples D are called offspring The θ-update is often a selection of the best, or fitness-proportional or rank-based Categories of EAs: Evolution Strategies: x R n, often Gaussian p θ (x) Genetic Algorithms: x {0, 1} n, crossover & mutation define p θ (x) Genetic Programming: x are programs/trees, crossover & mutation Estimation of Distribution Algorithms: θ directly defines p θ (x) 12/25

Covariance Matrix Adaptation (CMA-ES) An obvious critique of the simple Evolution Strategies: The search distribution N(x ˆx, σ 2 ) is isotropic (no going forward, no preferred direction) The variance σ is fixed! Covariance Matrix Adaptation Evolution Strategy (CMA-ES) 13/25

Covariance Matrix Adaptation (CMA-ES) In Covariance Matrix Adaptation θ = (ˆx, σ, C, p σ, p C ), p θ (x) = N(x ˆx, σ 2 C) where C is the covariance matrix of the search distribution The θ maintains two more pieces of information: p σ and p C capture the path (motion) of the mean ˆx in recent iterations Rough outline of the θ-update: Let D = bestof µ (D) be the set of selected points Compute the new mean ˆx from D Update p σ and p C proportional to ˆx k+1 ˆx k Update σ depending on p σ Update C depending on p c p c (rank-1-update) and Var(D ) 14/25

CMA references Hansen, N. (2006), The CMA evolution strategy: a comparing review Hansen et al.: Evaluating the CMA Evolution Strategy on Multimodal Test Functions, PPSN 2004. For large enough populations local minima are avoided A variant: Igel et al.: A Computational Efficient Covariance Matrix Update and a (1 + 1)-CMA for Evolution Strategies, GECCO 2006. 15/25

CMA conclusions It is a good starting point for an off-the-shelf blackbox algorithm It includes components like estimating the local gradient (p σ, p C ), the local Hessian (Var(D )), smoothing out local minima (large populations) 16/25

Estimation of Distribution Algorithms (EDAs) Generally, EDAs fit the distribution p θ (x) to model the distribution of previously good search points For instance, if in all previous distributions, the 3. bit equals the 7. bit, then the search distribution p θ (x) should put higher probability on such candidates. p θ (x) is meant to capture the structure in previously good points, i.e. the dependencies/correlation between variables. A rather successful class of EDAs on discrete spaces uses graphical models to learn the dependencies between variables, e.g. Bayesian Optimization Algorithm (BOA) In continuous domains, CMA is an example for an EDA 17/25

Further Ideas We could learn a distribution over steps which steps have decreased f recently model (Related to differential evolution ) We could learn a distributions over directions only sample one line search 18/25

Stochastic search conclusions Input: initial parameter θ, function f(x), distribution model p θ (x), update heuristic h(θ, D) Output: final θ and best point x 1: repeat 2: Sample {x i } n i=1 p θ(x) 3: Evaluate samples, D = {(x i, f(x i ))} n i=1 4: Update θ h(θ, D) 5: until θ converges The framework is very general The crucial difference between algorithms is their choice of p θ (x) 19/25

Heuristics Simulated Annealing Hill Climing Simplex 20/25

Simulated Annealing Must read!: An Introduction to MCMC for Machine Learning Input: initial x, function f(x), proposal distribution q(x x) Output: final x 1: initialilze T = 1 2: repeat 3: generate a new sample x q(x x) 4: acceptance probability A = min { 1, e f(x )/T q(x x ) } e f(x)/t q(x x) 5: With probability A, x x // ACCEPT 6: Decrease T 7: until θ converges Typically: q(x x) = N(x x, σ 2 ) Gaussian transition probabilities 21/25

Simulated Annealing Simulated Annealing is a Markov chain Monte Carlo (MCMC) method. These are iterative methods to sample from a distribution, in our case p(x) e f(x)/t For a fixed temperature T, one can show that the set of accepted points is distributed as p(x) (but non-i.i.d.!) The acceptance probability compares the f(x ) and f(x), but also the reversibility of q(x x) When cooling the temperature, samples focus at the extrema Guaranteed to sample all extrema eventually 22/25

Simulated Annealing [Wikipedia gif amination] 23/25

Hill Climing Same as Simulated Annealing with T = 0 Same as (1 + 1)-ES There also exists a CMA version of (1+1)-ES, see Igel reference above. The role of hill climing should not be underestimated: Very often it is efficient to repeat hill climing from many random start points. However, no type of learning at all (stepsize, direction) 24/25

Nelder-Mead method Downhill Simplex Method 25/25