Computational Intelligence Winter Term 2018/19

Similar documents
Computational Intelligence

Computational Intelligence Winter Term 2017/18

Computational Intelligence Winter Term 2017/18

Computational Intelligence

Evolutionary Computation

Chapter 8: Introduction to Evolutionary Computation

Estimation Theory Fredrik Rusek. Chapters

REIHE COMPUTATIONAL INTELLIGENCE COLLABORATIVE RESEARCH CENTER 531

Computational Intelligence

Generation from simple discrete distributions

Probability and Information Theory. Sargur N. Srihari

Chapter Generating Functions

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Computational Intelligence Winter Term 2009/10

Computational Intelligence

Evolutionary Algorithms How to Cope With Plateaus of Constant Fitness and When to Reject Strings of The Same Fitness

STEP Support Programme. Pure STEP 1 Questions

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Crossover Techniques in GAs

Mean-field dual of cooperative reproduction

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Information geometry for bivariate distribution control

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Information Theory in Intelligent Decision Making

02 Background Minimum background on probability. Random process

The Story So Far... The central problem of this course: Smartness( X ) arg max X. Possibly with some constraints on X.

2008 Winton. Statistical Testing of RNGs

1 Complete Statistics

Statistical Methods in Particle Physics

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Introduction to Machine Learning. Lecture 2

Lecture 8: Channel Capacity, Continuous Random Variables

Chapter 5: Integer Compositions and Partitions and Set Partitions

Runtime Analysis of Mutation-Based Geometric Semantic Genetic Programming on Boolean Functions

DRAFT -- DRAFT -- DRAFT -- DRAFT -- DRAFT --

Linear and non-linear programming

How to Treat Evolutionary Algorithms as Ordinary Randomized Algorithms

MAS113 Introduction to Probability and Statistics. Proofs of theorems

Error Detection and Correction: Small Applications of Exclusive-Or

PROBABILITY AND STATISTICS IN COMPUTING. III. Discrete Random Variables Expectation and Deviations From: [5][7][6] German Hernandez

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Math Review Sheet, Fall 2008

Probability on a Riemannian Manifold

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Optimality, Duality, Complementarity for Constrained Optimization

MATH 104: INTRODUCTORY ANALYSIS SPRING 2008/09 PROBLEM SET 8 SOLUTIONS

Expected Running Time Analysis of a Multiobjective Evolutionary Algorithm on Pseudo-boolean Functions

Advanced Optimization

2 Random Variable Generation

Simulation. Where real stuff starts

Error Detection and Correction: Hamming Code; Reed-Muller Code

Basic concepts of probability theory

Slide05 Haykin Chapter 5: Radial-Basis Function Networks

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Statistical Methods in Particle Physics

Data Compression Techniques

Statistical signal processing

A Comparison of GAs Penalizing Infeasible Solutions and Repairing Infeasible Solutions on the 0-1 Knapsack Problem

2.098/6.255/ Optimization Methods Practice True/False Questions

18.440: Lecture 28 Lectures Review

Mod-φ convergence I: examples and probabilistic estimates

Lecture 21: P vs BPP 2

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

CPSC 540: Machine Learning

MAS113 Introduction to Probability and Statistics. Proofs of theorems

Probability and Distributions

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

4. Convex optimization problems

1 Maintaining a Dictionary

The n th moment M n of a RV X is defined to be E(X n ), M n = E(X n ). Andrew Dabrowski

Evolutionary Computation Theory. Jun He School of Computer Science University of Birmingham Web: jxh

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Introduction to Probability and Statistics (Continued)

Lecture Notes Special: Using Neural Networks for Mean-Squared Estimation. So: How can we evaluate E(X Y )? What is a neural network? How is it useful?

Ch. 12 Linear Bayesian Estimators

An exponential separation between quantum and classical one-way communication complexity

COMS 4771 Regression. Nakul Verma

Lecture 1, August 21, 2017

Likelihood-Based Methods

MARKING A BINARY TREE PROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Chapter 5: Integer Compositions and Partitions and Set Partitions

Optimal Metric Planning with State Sets in Automata Representation [3]

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Intelligent Embedded Systems Uncertainty, Information and Learning Mechanisms (Part 1)

Chapter 5. Chapter 5 sections

CSCI-6971 Lecture Notes: Monte Carlo integration

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

how should the GA proceed?

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Lecture No 1 Introduction to Diffusion equations The heat equat

Why is Deep Learning so effective?

Transcription:

Computational Intelligence Winter Term 2018/19 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund

Three tasks: 1. Choice of an appropriate problem representation. 2. Choice / design of variation operators acting in problem representation. 3. Choice of strategy parameters (includes initialization). ad 1) different schools : (a) operate on binary representation and define genotype/phenotype mapping + can use standard algorithm mapping may induce unintentional bias in search (b) no doctrine: use most natural representation must design variation operators for specific representation + if design done properly then no bias in search 2

ad 1a) genotype-phenotype mapping original problem f: X R d scenario: no standard algorithm for search space X available X f R d g B n standard EA performs variation on binary strings b B n fitness evaluation of individual b via (f g)(b) = f(g(b)) where g: B n X is genotype-phenotype mapping selection operation independent from representation 3

Genotype-Phenotype-Mapping B n [L, R] R Standard encoding for b B n Problem: hamming cliffs 000 001 010 011 100 101 110 111 0 1 2 3 4 5 6 7 genotype phenotype 1 Bit 2 Bit 1 Bit 3 Bit 1 Bit 2 Bit 1 Bit Hamming cliff L = 0, R = 7 n = 3 4

Genotype-Phenotype-Mapping B n [L, R] R Gray encoding for b B n Let a B n standard encoded. Then b i = a i, if i = 1 a i-1 a i, if i > 1 = XOR 000 001 011 010 110 111 101 100 0 1 2 3 4 5 6 7 genotype phenotype OK, no hamming cliffs any longer small changes in phenotype lead to small changes in genotype since we consider evolution in terms of Darwin (not Lamarck): small changes in genotype lead to small changes in phenotype! but: 1-Bit-change: 000 100 5

Genotype-Phenotype-Mapping B n P log(n) (example only) e.g. standard encoding for b B n individual: 010 101 111 000 110 001 101 100 0 1 2 3 4 5 6 7 genotype index consider index and associated genotype entry as unit / record / struct; sort units with respect to genotype value, old indices yield permutation: 000 001 010 100 101 101 110 111 3 5 0 7 1 6 4 2 genotype old index = permutation 6

ad 1a) genotype-phenotype mapping typically required: strong causality small changes in individual leads to small changes in fitness small changes in genotype should lead to small changes in phenotype but: how to find a genotype-phenotype mapping with that property? necessary conditions: 1) g: B n X can be computed efficiently (otherwise it is senseless) 2) g: B n X is surjective (otherwise we might miss the optimal solution) 3) g: B n X preserves closeness (otherwise strong causality endangered) Let d(, ) be a metric on B n and d X (, ) be a metric on X. x, y, z B n : d(x, y) d(x, z) d X (g(x), g(y)) d X (g(x), g(z)) 7

ad 1b) use most natural representation typically required: strong causality small changes in individual leads to small changes in fitness need variation operators that obey that requirement but: how to find variation operators with that property? need design guidelines... 8

ad 2) design guidelines for variation operators a) reachability every x 2 X should be reachable from arbitrary x 0 X after finite number of repeated variations with positive probability bounded from 0 b) unbiasedness unless having gathered knowledge about problem variation operator should not favor particular subsets of solutions formally: maximum entropy principle c) control variation operator should have parameters affecting shape of distributions; known from theory: weaken variation strength when approaching optimum 9

ad 2) design guidelines for variation operators in practice binary search space X = B n variation by k-point or uniform crossover and subsequent mutation a) reachability: regardless of the output of crossover we can move from x B n to y B n in 1 step with probability where H(x,y) is Hamming distance between x and y. Since min{ p(x,y): x,y B n } = δ > 0 we are done. 10

b) unbiasedness don t prefer any direction or subset of points without reason use maximum entropy distribution for sampling! properties: - distributes probability mass as uniform as possible - additional knowledge can be included as constraints: under given constraints sample as uniform as possible 11

Formally: Definition: Let X be discrete random variable (r.v.) with p k = P{ X = x k } for some index set K. The quantity is called the entropy of the distribution of X. If X is a continuous r.v. with p.d.f. f X ( ) then the entropy is given by The distribution of a random variable X for which H(X) is maximal is termed a maximum entropy distribution. 12

Excursion: Maximum Entropy Distributions Knowledge available: Discrete distribution with support { x 1, x 2, x n } with x 1 < x 2 < x n < 1 leads to nonlinear constrained optimization problem: s.t. solution: via Lagrange (find stationary point of Lagrangian function) 13

Excursion: Maximum Entropy Distributions partial derivatives: uniform distribution 14

Excursion: Maximum Entropy Distributions Knowledge available: Discrete distribution with support { 1, 2,, n } with p k = P { X = k } and E[ X ] = ν leads to nonlinear constrained optimization problem: s.t. and solution: via Lagrange (find stationary point of Lagrangian function) 15

Excursion: Maximum Entropy Distributions partial derivatives: (*) (continued on next slide) 16

Excursion: Maximum Entropy Distributions discrete Boltzmann distribution value of q depends on ν via third condition: (*) 17

Excursion: Maximum Entropy Distributions ν = 2 Boltzmann distribution (n = 9) ν = 8 ν = 3 specializes to uniform distribution if ν = 5 (as expected) ν = 7 ν = 4 ν = 5 ν = 6 18

Excursion: Maximum Entropy Distributions Knowledge available: Discrete distribution with support { 1, 2,, n } with E[ X ] = ν and V[ X ] = η 2 leads to nonlinear constrained optimization problem: s.t. and and solution: in principle, via Lagrange (find stationary point of Lagrangian function) but very complicated analytically, if possible at all consider special cases only note: constraints are linear equations in p k 19

Excursion: Maximum Entropy Distributions Special case: n = 3 and E[ X ] = 2 and V[ X ] = η 2 Linear constraints uniquely determine distribution: I. II. III. II I: I III: insertion in III. unimodal uniform bimodal 20

Excursion: Maximum Entropy Distributions Knowledge available: Discrete distribution with unbounded support { 0, 1, 2, } and E[ X ] = ν leads to infinite-dimensional nonlinear constrained optimization problem: s.t. and solution: via Lagrange (find stationary point of Lagrangian function) 21

Excursion: Maximum Entropy Distributions partial derivatives: (*) (continued on next slide) 22

Excursion: Maximum Entropy Distributions set and insists that insert for geometrical distribution it remains to specify q; to proceed recall that 23

Excursion: Maximum Entropy Distributions value of q depends on ν via third condition: (*) 24

Excursion: Maximum Entropy Distributions ν = 1 geometrical distribution with E[ x ] = ν ν = 7 ν = 2 p k only shown for k = 0, 1,, 8 ν = 6 ν = 3 ν = 4 ν = 5 25

Excursion: Maximum Entropy Distributions Overview: support { 1, 2,, n } and require E[X] = θ and require V[X] = η 2 discrete uniform distribution Boltzmann distribution N.N. (not Binomial distribution) support N not defined! and require E[X] = θ geometrical distribution and require V[X] = η 2? support Z and require E[ X ] = θ and require E[ X 2 ] = η 2 not defined! bi-geometrical distribution (discrete Laplace distr.) N.N. (discrete Gaussian distr.) 26

Excursion: Maximum Entropy Distributions support [a,b] R support R + with E[X] = θ uniform distribution Exponential distribution support R with E[X] = θ, V[X] = η 2 normal / Gaussian distribution N(θ, η 2 ) support R n with E[X] = θ and Cov[X] = C multinormal distribution N(θ, C) expectation vector R n covariance matrix R n,n positive definite: x 0 : x Cx > 0 27

Excursion: Maximum Entropy Distributions for permutation distributions? uniform distribution on all possible permutations set v[j] = j for j = 1, 2,..., n for i = n to 1 step -1 draw k uniformly at random from { 1, 2,..., i } swap v[i] and v[k] endfor generates permutation uniformly at random in Θ(n) time Guideline: Only if you know something about the problem a priori or if you have learnt something about the problem during the search include that knowledge in search / mutation distribution (via constraints!) 28

ad 2) design guidelines for variation operators in practice integer search space X = Z n a) reachability b) unbiasedness c) control - every recombination results in some z Z n - mutation of z may then lead to any z* Z n with positive probability in one step ad a) support of mutation should be Z n ad b) need maximum entropy distribution over support Z n ad c) control variability by parameter formulate as constraint of maximum entropy distribution 29

ad 2) design guidelines for variation operators in practice X = Z n task: find (symmetric) maximum entropy distribution over Z with E[ Z ] = θ > 0 need analytic solution of a 1-dimensional, nonlinear optimization problem with constraints! max! s.t. Z, (symmetry w.r.t. 0) (normalization) (control spread ) Z. (nonnegativity) 30

result: a random variable Z with support Z and probability distribution Z symmetric w.r.t. 0, unimodal, spread manageable by q and has max. entropy generation of pseudo random numbers: Z = G 1 G 2 where stochastic independent! 31

probability distributions for different mean step sizes E Z = θ 32

probability distributions for different mean step sizes E Z = θ 33

How to control the spread? We must be able to adapt q (0,1) for generating Z with variable E Z = θ! self-adaptation of q in open interval (0,1)? make mean step size adjustable! θ adjustable by mutative self adaptation R + (0,1) get q from θ like mutative step size size control of σ in EA with search space R n! 34

n - dimensional generalization random vector Z = (Z 1, Z 2,... Z n ) with Z i = G 1,i G 2,i (stoch. indep.); parameter q for all G 1i, G 2i equal n = 2 35

n - dimensional generalization n-dimensional distribution is symmetric w.r.t. 1 norm! all random vectors with same step length have same probability! 36

How to control E[ Z 1 ]? by def. linearity of E[ ] identical distributions for Z i = θ self-adaptation calculate from θ 37

Algorithm: (Rudolph, PPSN 1994) 38

Excursion: Maximum Entropy Distributions ad 2) design guidelines for variation operators in practice continuous search space X = R n a) reachability b) unbiasedness c) control leads to CMA-ES! 39