Estimation-of-Distribution Algorithms. Discrete Domain.

Size: px

Start display at page:

Download "Estimation-of-Distribution Algorithms. Discrete Domain."

Berniece Day
5 years ago
Views:

1 Estimation-of-Distribution Algorithms. Discrete Domain. Petr Pošík Introduction to EDAs 2 Genetic Algorithms and Epistasis Genetic Algorithms GA vs EDA Content of the lectures Motivation Example 7 Example Selection, Modeling, Sampling UMDA Behaviour for OneMax problem What about a different fitness? UMDA behaviour on concatanated traps What can be done about traps? Good news! Discrete EDAs Discrete EDAs: Overview EDAs without interactions 7 EDAs without interactions Pairwise Interactions 9 From single bits to pairwise models Example with pairwise dependencies: dependency tree Example of dependency tree learning Dependency tree: probabilities EDAs with pairwise interactions Summary Multivariate Interactions 26 ECGA ECGA: Evaluation metric BOA BOA: Learning the structure Scalability Analysis 3 Test functions Test function (cont.) Scalability analysis OneMax Non-dec. Equal Pairs Decomp. Equal Pairs Non-dec. Sliding XOR Decomp. Sliding XOR Decomp. Trap Model structure during evolution Conclusions 42 Summary Suggestions for discrete EDAs

2 Introduction to EDAs 2 / 44 Genetic Algorithms and Epistasis From the lecture on epistasis: x best =..., f(x best ) = 40 GA works: no dependencies GA fails: deps. exist GA not able to work with them GA works again: deps. exist GA knows about them 4 Popsize60 4 Popsize60 4 Popsize f40xbitonemax f8xbittrap f8xbittrap best average generation best average generation best average generation Popsize60 Popsize60 Popsize xmean xmean xmean generation generation generation P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 3 / 44 Genetic Algorithms Algorithm : Genetic Algorithm begin 2 Initialize the population. 3 while termination criteria are not met do 4 Select parents from the population. Cross over the parents, create offspring. 6 Mutate offspring. 7 Incorporate offspring into the population. Conventional GA operators are not adaptive, and cannot (or ususally do not) discover and use the interactions among solution components. Select cross over mutate approach What does an intearction mean? we would like to create a new offspring by mutation we would like the offspring to have better, or at least the same, quality as the parent if we must modify x i together with x j to reach the desired goal (if it is not possible to improve the solution by modifying either x i or x j only), then x i interacts with x j. The goal of recombination operators: Intensify the search in areas which contained good individuals in previous iterations. Must be able to take the interactions into account. Why not directly describe the distribution of good individuals??? P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 4 / 44 2

3 GA vs EDA Algorithm : Genetic Algorithm begin 2 Initialize the population. 3 while termination criteria are not met do 4 Select parents from the population. Cross over the parents, create offspring. 6 Mutate offspring. 7 Incorporate offspring into the population. Select cross over mutate approach Algorithm 2: Estimation-of-Distribution Alg. begin 2 Initialize the population. 3 while termination criteria are not met do 4 Select parents from the population. Learn a model of their distribution. 6 Sample new individuals. 7 Incorporate offspring into the population. Select model sample approach Explicit probabilistic model: principled way of working with dependencies adaptation ability (different behavior in different stages of evolution) Names: EDA Estimation-of-Distribution Algorithm PMBGA Probabilistic Model-Building Genetic Algorithm IDEA Iterated Density Estimation Algorithm P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms / 44 Content of the lectures. EDA for discrete domains (e.g. binary) Motivation example Without interactions Pairwise interactions Higher order interactions 2. EDA for real domain (vectors of real numbers) Evolution strategies Histograms Gaussian distribution and its mixtures P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 6 / 44 3

4 Motivation Example 7 / 44 Example -bit OneMax (CountOnes) problem: f DxbitOneMax (x) = D d= x d Optimum:, fitness: Algorithm: Univariate Marginal Distribution Algorithm (UMDA) Population size: 6 Tournament selection: t = 2 Model: vector of probabilities p = (p,..., p D ) each p d is the probability of observingat dth element Model learning: compute p from selected individuals Model sampling: generateon dth position with probability p d (independently of other positions) P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 8 / 44 Selection, Modeling, Sampling Old population: 00 (3) 0000 () 0 (4) 0 (4) 0000 () 000 (2) Tournaments: 0 (4) vs. 0 (4) 0 (4) vs. 0 (4) 0 (4) vs () 000 (2) vs () 0000 () vs () 0000 () vs. 00 (3) Selected parents: 0 (4) 0 (4) 0 (4) 000 (2) 0000 () 00 (3) Offspring: 000 (2) 000 (2) 0 (4) 00 (3) 00 (3) 0 (4) Probability vector: P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 9 / 44 4

5 UMDA Behaviour for OneMax problem Probability vector entries Generation s are better then0s on average, selection increases the proportion ofs Recombination preserves and combiness, the ratio ofs increases over time If we have manys in population, we cannot miss the optimum The number of evaluations needed for reliable convergence: Algorithm UMDA behaves similarly to GA with uniform crossover! Nr. of evaluations UMDA O(D ln D) Hill-Climber O(D ln D) GA with uniform xover approx. O(D ln D) GA with -point xover a bit slower P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 0 / 44 What about a different fitness? For OneMax function: UMDA works well, all the bits probably eventually converge to the right value. Will UMDA be similarly successful for other fitness functions? Well, no. :-( Problem: Concatanated -bit traps f = f trap (x, x 2, x 3, x 4, x )+ + f trap (x 6, x 7, x 8, x 9, x 0 ) The trap function is defined as { if u(x) = f trap (x) = 4 u(x) otherwise where u(x) is the so called unity function and returns the number ofs in x (it is actually the One Max function). trap(x) Number of s in chromosome, u(x) P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms / 44

6 UMDA behaviour on concatanated traps Traps: Optimum in... But f trap (0 ) = 2 while f trap ( ) =.37 -dimensional probabilities lead the GA to the wrong way! Exponentially increasing population size is needed, otherwise GA will not find optimum reliably. 0.9 Probability vector entries Generation P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 2 / 44 What can be done about traps? The f trap function is deceptive: Statistics over**** and0**** do not lead us to the right solution The same holds for statistics over*** and00***,** and000**,* and0000* Harder than the needle-in-the-haystack problem: regular haystack simply does not provide any information, where to search for the needle f trap -haystack actively lies to you it points you to the wrong part of the haystack But: f trap (00000) < f trap (), will be better than00000 on average bit statistics should work for bit traps in the same way as bit statistics work for OneMax problem! Model learning: build model for each -tuple of bits compute p(00000), p(0000),..., p(), Model sampling: Each -tuple of bits is generated independently Generate with probability p(00000), 0000 with probability p(0000),... P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 3 / 44 6

7 Good news! Good statistics work great! Relative frequency of Generation What shall we do next? If we were able to find good statistics with a small overhead, and use them in the UMDA framework, we would be able to solve order-k separable problems using O(D 2 ) evaluations.... and there are many problems of this type. The problem solution is closely related to the so-called linkage learning, i.e. discovering and using statistical dependencies among variables. Algorithm Nr. of evaluations UMDA with bit BB O(D ln D) (WOW!) Hill-Climber O(D k ln D), k = GA with uniform xover approx. O(2 D ) GA with -point xover similar to unif. xover P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 4 / 44 Discrete EDAs / 44 Discrete EDAs: Overview. Overview: (a) Univariate models (without interactions) (b) Bivariate models (pairwise dependencies) (c) Multivariate models (higher order interactions) 2. Conclusions P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 6 / 44 7

8 EDAs without interactions 7 / 44 EDAs without interactions. Population-based incremental learning (PBIL) Baluja, Univariate marginal distribution algorithm (UMDA) Mühlenbein and Paaß, Compact genetic algorithm (cga) Harik, Lobo, Goldberg, 998 Similarities: all of them use a vector of probabilities Differences: PBIL and cga do not use population (only the vector p); UMDA does PBIL and cga use different rules for the adaptation of p Advantages: Simplicity Speed Simple simulation of large populations Limitations: Solves reliably only order- decomposable problems P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 8 / 44 EDAs with Pairwise Interactions 9 / 44 From single bits to pairwise models How to describe two positions together? Using the joint probability distribution: A Number of free parameters: 3 B p(a, B) B 0 A 0 p(0, 0) p(0, ) p(, 0) p(, ) Using statistical dependence: A Number of free parameters: 3 B p(a, B) = p(b A) p(a): p(b = A = 0) p(b = A = ) p(a = ) Question: what is the number of parameters in case of the following models? A B C A B C A B C P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 20 / 44 8

9 Example with pairwise dependencies: dependency tree Nodes: binary variables (loci of chromozome) Edges: dependencies among variables Features: Each node depends at most on other node Graph does not contain cycles Graph is connected Learning the structure of dependency tree:. Score the edges using mutual information: p(x, y) I(X, Y) = p(x, y) log x,y p(x)p(y) 2. Use any algorithm to determine the maximum spanning tree of the graph, e.g. Prim (97) (a) (b) Start building the tree from any node Add such a node that is connected to the tree by the edge with maximum score P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 2 / 44 Example of dependency tree learning x x x x x x x x x x 4 6 x 3 x 4 6 x 3 x 4 6 x 3 x x x x 7 3 x 2 x 7 3 x 2 x 7 x x 4 6 x 3 x 4 6 x 3 x 4 x 3 P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 22 / 44 9

10 Dependency tree: probabilities x x 7 x x 4 x 3 Probability Number of params p(x = ) p(x 4 = X ) 2 p(x = X 4 ) 2 p(x 2 = X 4 ) 2 p(x 3 = X 2 ) 2 Whole model 9 P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 23 / 44 EDAs with pairwise interactions. MIMIC (sequences) Mutual Information Maximization for Input Clustering de Bonet et al., COMIT (trees) Combining Optimizers with Mutual Information Trees Baluja and Davies, BMDA (forrest) Bivariate Marginal Distribution Algorithm Pelikan and Mühlenbein, 998 P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 24 / 44 0

11 Summary Advantages: Still simple Still fast Can learn something about the structure Limitations: Reliably solves only order-2 decomposable problems P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 2 / 44 EDAs with Multivariate Interactions 26 / 44 ECGA Extended Compact GA, Harik, 999 Marginal Product Model (MPM) Variables are treated in groups Variables in different groups are considered statistically independent Each group is modeled by its joint probability distribution The algorithm adaptively searches for the groups during evolution Learning the structure Problem. Evaluation metric: Minimum Description Length (MDL) 2. Search procedure: greedy (a) (b) (c) Start with each variable belonging to its own group Ideal group configuration OneMax [][2][3][4][][6][7][8][9][0] bittraps [ ][ ] Perform such a join of two groups which improves the score best Finish if no join improves the score P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 27 / 44

12 ECGA: Evaluation metric Minimum description length: Minimize the number of bits needed to store the model and the data encoded using the model DL(Model, Data) = DL Model + DL Data Model description length: Each group g has g dimensions, i.e. 2 g frequencies, each of them can take on values up to N DL Model = log N (2 g ) g G Data description length using the model: Defined using the entropy of marginal distributions (X g is g -dimensional random vector, x g is its realization): DL Data = N h(x g ) = N p(x g = x g ) log p(x g = x g ) g G g G x g P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 28 / 44 BOA Bayesian Optimization Algorithm: Pelikán, Goldberg, Cantù-Paz, 999 Bayesian network (BN) Conditional dependencies (instead groups) Sequence, tree, forrest special cases of BN For trap function: The same model used independently in Estimation of Bayesian Network Alg. (EBNA), Etxeberria et al., 999 Learning Factorized Density Alg. (LFDA), Mühlenbein et al., 999 P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 29 / 44 2

13 BOA: Learning the structure. Evaluation metric: Bayesian-Dirichlet metric, or Bayesian information criterion (BIC) 2. Search procedure: greedy (a) Start with graph with no edges (univariate marginal product model) (b) Perform one of the following operations, choose the one which improves the score best Add an edge Delete an edge Reverse an edge (c) Finish if no operation improves the score BOA solves order-k decomposable problems in less then O(D 2 ) evaluations! n evals = O(D. ) to O(D 2 ) P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 30 / 44 Scalability Analysis 3 / 44 Test functions One Max: Trap: f DxbitOneMax (x) = D x d d= { D if u(x) = D f DbitTrap (x) = D u(x) otherwise Equal Pairs: f DbitEqualPairs (x) = + D d=2 { if x = x f EqualPair (x d, x d ) f EqualPair (x, x 2 ) = 2 0 if x = x 2 Sliding XOR: f DbitSlidingXOR (x) = + f AllEqual (x)+ D + f XOR (x d 2, x d, x d ) d=3 f AllEqual (x) = if x = ( ) if x = (... ) 0 otherwise { if x x2 = x f XOR (x, x 2, x 3 ) = 3 0 otherwise Concatenated short basis functions: K f NxKbitBasisFunction = f BasisFunction (x K(k )+,..., x Kk ) k= P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 32 / 44 3

14 Test function (cont.). f 40xbitOneMax order- decomposable function, no interactions 2. f x40bitequalpairs non-decomposable function weak interactions: optimal setting of each bit depends on the value of the preceding bit 3. f 8xbitEqualPairs order- decomposable function 4. f x40bitslidingxor non-decomposable function stronger interactions: optimal setting of each bit depends on the value of the 2 preceding bits. f 8xbitSlidingXOR order- decomposable function 6. f 8xbitTrap order- decomposable function interactions in each -bit block are very strong, the basis function is deceptive P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 33 / 44 Scalability analysis Facts: using small population size, population-based optimizers can solve only easy problems increasing the population size, the optimizers can solve increasingly harder problems... but using a too big population is wasting of resources. Scalability analysis: determines the optimal (smallest) population size, with which the algorithm solves the given problem reliably reliably: algorithm finds the optimum in 24 out of 2 runs) for each problem complexity, the optimal population size is determined e.g. using the bisection method studies the influence of the problem complexity (dimensionality) on the optimal population size and on the number of needed evaluations P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 34 / 44 4

15 Scalability on the One Max function 0 4 UMDA COMIT ECGA BOA počet ohodnocení dim P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 3 / 44

16 Scalability on the non-decomposable Equal Pairs function UMDA COMIT ECGA BOA počet ohodnocení dim P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 36 / 44 6

17 Scalability on the decomposable Equal Pairs function UMDA COMIT ECGA BOA počet ohodnocení dim P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 37 / 44 7

18 Scalability on the non-decomposable Sliding XOR function UMDA COMIT ECGA BOA počet ohodnocení dim P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 38 / 44 8

19 Scalability on the decomposable Sliding XOR function 0 UMDA COMIT ECGA BOA počet ohodnocení dim P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 39 / 44 9

20 Scalability on the decomposable Trap function UMDA COMIT ECGA BOA počet ohodnocení dim P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 40 / 44 20

21 Model structure during evolution During the evolution, the model structure is increasingly precise and at the end of the evolution, the model structure describes the problem structure exactly. NO! That s not true! Why? In the beginning, the distribution patterns are not very discernible, models similar to uniform distributions are used. In the end, the population converges and contains many copies of the same individual (or a few individuals). No interactions among variables can be learned. Model structure is wrong (all bits independent), but the model describes the position of optimum very precisely. The model with the best matching structure is found somewhere in the middle of the evolution. Even though the right structure is never found during the evolution, the problem can be solved successfully. P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 4 / 44 Conclusions 42 / 44 Summary Models: Bayesian networks are general models of joint probability High-dimensional models are hard to train High-dimensional models are very flexible Advantages: Reliably solves problems decomposable to subproblems of bounded order Limitations: Does not solve problems decomposable to logarithmic subproblems (hierarchical problems) P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 43 / 44 2

22 Suggestions for discrete EDAs For simple problems: PBIL, UMDA, cga they behave similarly to simple GAs For harder problems: MIMIC, COMIT, BMDA they are able to account for bivariate dependencies For hard problems: BOA, ECGA, EBNA, LFDA they can take into account more general dependencies, problems with hierarchichal structures For even harder problems: hboa (hierarchical BOA) P. Pošík c 204 A0M33EOA: Evolutionary Optimization Algorithms 44 / 44 22

Probabilistic Model-Building Genetic Algorithms

Probabilistic Model-Building Genetic Algorithms Martin Pelikan Dept. of Math. and Computer Science University of Missouri at St. Louis St. Louis, Missouri pelikan@cs.umsl.edu Foreword! Motivation Genetic