Convergence Time for Linkage Model Building in Estimation of Distribution Algorithms

Convergence Time for Linkage Model Building in Estimation of Distribution Algorithms Hau-Jiun Yang Tian-Li Yu TEIL Technical Report No. 2009003 January, 2009 Taiwan Evolutionary Intelligence Laboratory (TEIL) Department of Electrical Engineering National Taiwan University No., Sec. 4, Roosevelt Rd., Taipei, Taiwan http://teil.ee.ntu.edu.tw

Convergence Time for Linkage Model Building in Estimation of Distribution Algorithms Hau-Jiun Yang, Tian-Li Yu National Taiwan University r9692057@ntu.edu.tw, tianliyu@cc.ee.ntu.edu.tw Abstract This paper proposes a convergence time model for linkage model building in estimation of distribution algorithms (EDAs). The model utilizes the result of population sizing for EDAs in previous work. By investigating the building-block identification rate of linkage model. We give a recurrence relation to express the proportion of identified building block in each generation. The recurrence fails to yield a closed form solution due to varying identification rate. Therefore, we derive upper and lower bounds instead by assuming rapid allelic convergence and fixed identification rate respectively. Specifically, The linkage model convergence time is bounded by Ω(ln(m)) and O(m), where m is the number of building blocks. Empirically, experiment results agree with the proposed bounds. INTRODUCTION Estimation of distribution algorithms (EDAs) (Larrañaga, Lozano, & eds, 2002) have lots of attention in recent decades. It has been proposed as a strong extension of genetic algorithms. EDAs solve deceptive and nonseparable functions that genetic algorithms have difficulties to optimize by learning linkage (Mhlenbein & Hns, 2006). Linkage learning is essential to EDA success, because it can conserve building blocks under recombination. Many methods (Goldberg, 2002) have been developed to solve the so-called linkage-learning problem, and resulted in many different EDAs, including ecga (Harik, 999), BOA (Pelikan & Goldberg, 999), COMIT (Baluja & Davies, 997), UMDA (Mühlenbein & Paass, 996), MIMIC (Bonet, Isbell, & Viola, 997), D 5 (Tsuji, Munetomo, & Akama, 2004), EBNA (Etxeberria & Larrañaga, 999) and DSMGA (Yu, Goldberg, Yassine, & Chen, 2003). To solve problems with EDAs, we need to understand the race between alleles and linkages (Goldberg, 998; Goldberg, 2002). Allelic convergence means a superior individual taking over the population. Linkage convergence means that all of the building block are identified which ensures the superior individual never be recombined and makes the answer to the question comes convincingly. Here comes a question: Whether should alleles converge faster than linkages or not? The condition where the linkages convergence time is less than or equal to the takeover time is most advantageous, because prior to the best alleles dominating the population, linkage should be created to innovate and recombine a better individual. Therafter this better individual starts to dominate the population. The convergence time of simple genetic algorithm (sga) has been modeled under different selection schemes (Thierens & Goldberg, 994). The issue of convergence time is equally critical in EDAs. Once linkage converge to the best combination of gene, alleles will be hard to converge in wrong solution. So linkage convergence is more important than alleles convergence in EDAs. The purpose of this paper is to develop a convergence time model for linkage model building in EDAs. The model utilizes the population-sizing model of EDAs in previous work (Yu, Sastry, Goldberg, & Pelikan, 2007). First, this paper derives the linkage model identification rates of EDAs at the first generation from prior research. Then observing the reason of the changing model identification rate and some approximation will be used to approach the phenomena and then we give a recurrence relation to express the proportion of the identified building block. This paper then shows the derivation of the bound by the proposed recurrence relation function which

is hard to derive to a closed form. Finally, experiments are conducted to verify the linkdage time convergence model. 2 Convergence Time Time to convergence model in sga have been proposed. Since our methodology is depends on EDAs, we outline the models in sga conducted by takeover time and convergence model briefly. 2. Takeover Time At the beginning of GA, convergence time is characterized with takeover time (Goldberg & Deb, 99). The takeover time is compared by a number of selection schemes commonly used in sga. The comparison is given by selection scheme takeover time Proportionate Selection c n ln n Linear ranking c 0 = 2 log n + log(ln n) Tournament s ln s [ln n + ln(ln n)] Genitor c 0 ln(n ) where n is population size, and s is the selection pressure. 2.2 Time Convergence model Thierens (Thierens & Goldberg, 994) proposed a model by assuming a normally distributed fitness function and predict the proportion of optimal alleles in function of the number of generations. Different selection schemes are derived that allow us to compare the GA selection schemes in terms of their convergence characteristics. The comparison is given by selection scheme Convergence time(g conv ) Proportionate l ln(2ɛ) π Tournament s 2 πl π Truncation 2. l i π Elitist Recombination 2 πl where l is the problem size, i is selection intensity (Blickle & Thiele, 995), ɛ is a failure probability. 3 Linkage Model identification rate for EDAs Linkage model identification rate means the proportion of the building blocks (BBs) will be identified within the unidentified BBs for next generation. In this section, we investigate the linkage model identification rate of EDAs at the first generation. If we want to identify all BBs at the end of EDA, sufficient population size is needed. If population is too small, alleles will converge faster than linkages, so as to we can not convince the solution is correct. The following derivation adopts a decomposable problem composed of bipolar Royal road function. The Royal road function serves as a worst case scenario for model building because given the minimal fitness difference d min, the fitness differences between the best schema and all other (2 k ) schemata are all d min. A bipolar Royal road function of order k is defined as follows to equalize the growth rates of 0 and for every gene and the total number of BBs is proportional to the problem size. R k ( x) =, if x =... }{{} k, if x = 000...0 }{{} k d, otherwise. () 2

3.5 x 0 4 3 2.5 2.5 0.5 0 0 0.5.5 2 2.5 3 x Figure : The difference between error function (erf) and its approximation (erf (x)). The difference is shown by erf (x) erf(x) erf(x). The maximum error happens when x is close to.3 The fitness of an additively decomposable problem with m BBs is defined as f( x) = m i=0 R k (x ik+ x ik+2...x ik+k ). (2) The means and the variances of the sampled mutual information for finite population was derived in (Yu, Sastry, Goldberg, & Pelikan, 2007). Using these model, we can derive linkage model identification rate of EDAs at the first generation. The distribution of sampled mutual information can be approximated as a Gaussian distribution(hutter & Zaffalon, 2005). The decision-making error can be calculated as follows. Define a variable τ as τ = E[Z] ln s to.6 c m n (3) V ar[z] sto d where m = 2 k πmσ BB where c is constant, s to is selection pressure with tournament selection, n is population size, m is number of BB, k is the order of each BB and σbb 2 is the fitness variance of a BB The decision error ɛ is Φ(τ) and one of the approximation of error function is given by erf 2 (x) e 8(π 3) where a = 3π(π 4) 4/π+ax 2 x2 +ax 2 (4) The difference between error function and its approximation is shown in Figure. The approximation have maximum error 3.5 0 4 when x is close to.3 and the error have little influence to our work. π 4 4/π+ax 2 +ax 2 only influence the result when x is close to zero, it make error of the approximation is at most. So we can approximate it to and calculate the error function as follow. erf(x) By the approximation of error function, we can calculate ɛ as e x2 (5) ɛ = Φ(τ) = 2 ( + erf( τ )) (6) 2 2 e τ2 2 (7) 2 3

For large m, τ is small and the decision error can be simplified as ɛ τ2 e 2 (8) 4 For a problem with m BBs, each BB is order k. There are k bits in the BB should be identified and the BB should be identified in m BBs. So the identification rate of the initial state Q 0 of EDAs should be ( ɛ) km Q 0 (9) Usually, Q 0 is not close to zero when we provide sufficient population size that make all the BBs be identified. With a not too small Q 0 and large m, we can infer the value of ɛ is small from equation (9) easily. So equation (9) can be approximated as Q 0 kmɛ = kme c (ln s to.6 )2 2 m n 2 s to = kme cn m (0) where c = c 2 (ln s to.6 )2 d 2 2 2k s to πσ 2 BB 0.9 0.8 proportion of BB (m) 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 000 2000 3000 4000 5000 6000 7000 8000 9000 0000 population size(n) Figure 2: n vs Q 0 graph fixed m. Given a small population size, the quality will become zero, but given sufficient population size, the trend of line will grow rapidly. Figure 2 shows the experimental result for ecga on an (m, k)-trap problem(goldberg, 989), where m = 40 and k = 3. When we supply sufficient population size, the trend of Q 0 grows rapidly. In this paper, we only consider the case when we supply sufficient population size. So the experiment result is similar to equation (0). 4 TIME TO MODEL CONVERGENCE Previous section derived the linkage model identification rate of EDAs at first generation. This section we will discuss the identification rate in the underway EDAs. Define Q t is the ratio of identified BBs at t generation. β t and α t are the linkage model identification rate and linkage model destruction rate at the t generation respectively. The destruction rate means the proportion of identified model will be destroyed in the next generation. 4

4. The Changing Linkage Model Identification Rate With empirical result, we know that the model identification rate β is increasing though the running time (Figure 3). We start to analyze the reason of the increasing β. proportion of BB (m) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Q t β t α t 0.2 0. 0 0 5 0 5 20 25 30 35 40 45 generation (t) (a) bipolar Royal road function proportion of BB (m) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Q t β t α t 0.2 0. 0 0 2 4 6 8 0 2 4 6 generation (t) (b) Trap function Figure 3: The effect of α is too small to neglect, β is increasing when we supply enough population size. Trap function need a larger population parameter to ensure the entire model identified. All the EDA methods share a similar high level algorithm presented as below (Shan, McKay, Essam, & Abbass, 2006):. Generate a population P randomly. 2. Select a set of fitter individuals G from P 3. Estimate a probabilistic model M over G 4. Sample from the probabilistic model M to obtain a set of new individuals G 5. Incorporate G into population P 6. If the termination condition is not satisfied, go to 2. EDAs start with random initialize and execute from step 2 to step 6 circularly. The variance of fitness between superior and inferior individual larger can help the linkage model better. In the other word, the more superior individual generate, the better linkage model we have. Remove step 3 and modify step 4 to random shuffle in each gene that make EDAs work without linkage learning. Execute EDAs for bipolar Royal road function without learning linkage, the ratio of 5

0s and s in each gene is fixed in 0.5. Without learning linkage, we can discover the number of superior individual copies will increase slowly. So the key to cause the number of superior individuals climbing is in the process of learning linkage. When EDA is learning linkages, EDA will check all possible combination of gene, and EDA will adopt the combination which have the highest entropy. Once a correct BB is identified, the gene in the BB will converge to the best solution of the model. When the best alleles dominate the population in the BB, the BB will have maximum fitness which supply the maximum entropy to the entire linkage model. To other BBs, the identified BBs provide no external noise to them. So we can consider the noise come from the identified BBs is zero. Then we show the progress that follow the step in (Yu, Sastry, Goldberg, & Pelikan, 2007). Define the follow notation for schemata: H xy = xy... }{{} l 2 () where x and y are 0 or, and l is the problem size. Define two sets H + = {H 00, H } and H = {H 0, H 0 }, and let F + and F be their corresponding fitness values: F + = f(h + ) = f(h 00 ) + f(h ) (2) F = f(h ) = f(h 0 ) + f(h 0 ) (3) According to the central limit theorem, the distribution of F + and F can be approximated as Gaussian distributions when population size is large. The variances of F + and F, defined as σ 2 F + and σ 2 F respectively, are different but very close. By treating other (m ) BBs as external noises, these variances can be bounded and approximated as: (m )σ 2 BB σ 2 F + mσ 2 BB (m )σ 2 BB σ 2 F mσ 2 BB σf 2 + σf 2 = mσbb( 2 O( )) (4) m where σbb 2 is the fitness variance of a BB. The difference between those two variances is small and can be neglected when m is large. Define Z = F + F. Z is a normal distributed random variable with the following mean and variance. E[Z] = d 2 k 2 (5) V ar[z] = σ 2 F + + σ 2 F = 2mσ 2 BB( O( m ) (6) Once a BB is identified, the best alleles of the BB will never be destroyed. The expected value of the best alleles copies is n 2 k, because the ratio of the 0 and for every gene are equal in bipolar Royal road function in each generation, where n is population size. The taking over time t of the best alleles can be calculated as: n 2 t = n 2 k t = k (7) the take up time t is equal to k, k is not too large usually. So we approximate t to. That means once BB is identified, the alleles in the BB will takeover the population immediately. In the other word, the BB will have no noisy to the linkage identification immediately. So, the means and variances are changed and shown as bellow. E[Z] = d 2 k 2 (8) V ar[z] = 2(m )σbb( 2 O( m ) (9) 6

Extend to m( Q t ) BBs has been identified in generaion t. The means and variances of Z at generation t can be calculated as: E[Z] = d 2 k 2 (20) V ar[z] = 2m( Q t )σbb( 2 O( m( Q t ) ) (2) Follow the step in (Yu, Sastry, Goldberg, & Pelikan, 2007), we can derive the result finally as bellow. β t = km( Q t )e cn m( Q t ) (22) 4.2 The Recurrence relation to the identified building block ratio Q t+ is depend on α t and β t and Q t. The recurrence formula can be derived easily as follow by combining destruction ratio ( α t ) Q t and identification ratio β t ( Q t ), and the initial state Q 0 has been derived in previous section (Equation 0). Q t+ = ( α t )Q t + β t ( Q t ) (23) This recurrence relation can provide exact Q t+ by providing exact α t, β and Q t. According to the experimental result (Figure 3), α t is always zero in bipolar Royal Road function. In (m, k)-trap function (Deb & Goldbeg, 99;?), α is small, too. At the beginning of EDAs, Q t is often small, so the destruction ratio ( α t )Q t can be negligible. So we assume α to 0 and the recurrence function can be simplified to Q t+ = Q t + β t ( Q t ) (24) 4.3 The upper bound of linkage model convergence time With some simple transposition and substitution of equation (24), the equation will be cn Q t+ Q t = ( km( Q t )e m( Q t ) )( Qt ) (25) Approximating the difference equation with the corresponding differential equation dq t dt dt = = ( km( Q t )e cn m( Q t ) )( Qt ) (26) dq ( km( Q t )e cn t (27) m( Q t ) )( Qt ) Integrate the both side of the above equation, the right side of the equal mark is hard to integrate to a closed form straightly. So we make an approximation that plus Q 2 t Q t to the denominator. Q 2 t Q t is smaller than zero, because the value of Q t is between 0 and. And we can integrate the equation as follow. cn t Qt Q 0 ( kme cn m( Q t ) )( Qt ) 2 dq t (28) let P t = m( Q t ), then dq t = m( Qt)2 c n dp t m t c n kme P dp t (29) t m e P t = c n e P dp t t (30) km m ] = [ln(e cn Qt m( Q t ) km) + C (3) c n Q 0 = m c n ln e e cn m( Q t ) cn m( Q 0) 7 km km (32)

We obtain a time convergence model expressing the proportion Q t in function of the number of generations t. c n Q t (33) m ln(e nt c m (e cn m( Q 0) km) + km) To calculate the linkage convergence time, we let the proportion Q t come close to m. Q t t m c n ln c n m ln(e nt c m (e cn m( Q 0) km) + km) e cn km e cn m( Q 0) km m (34) From initial state Q 0 in equation (0), we can replace n by n = m c ln km Q 0 t = ln km Q 0 ln km Q 0 ln e m ln km Q 0 km ln e ln Q km 0 ( Q 0 ) km ( km Q 0 ) m km ( km ( Q Q 0 ) 0) km (35) (36) For large m, km can be neglect then inequality 36 can be simplified as t m Q 0 (37) 4.4 The lower bound of linkage model convergence time In our approximation, once BB identified, the noise come from the BB become zero immediately. To avoid overestimation in the approximation, we give the lower bound roughly. We consider the case that β t never increase when EDAs is working. In the other words, β t is fixed in Q 0. The equation can describe as Q t = Q t + Q 0 ( Q t ) (38) = ( Q 0 ) t+ (39) Let the proportion Q t come close to m and then we can obtain convergence time t Q t = ( Q 0 ) t+ m t + ln m ln( Q 0 ) (40) (4) ln( Q 0 ) can be approximated as Q 0 when 0 ( Q 0 ) 2. Finally, the lower bound of m t relation is 5 Experimental Result and discussion t + lnm Q 0 (42) The relation between number of building block and linkage model convergence time is presented in Figure 4. In equation (37) and (42), we know the linkage model convergence time t can be bounded by O(m) and Ω(ln m). Figure 4(a) shows the experiment with bipolar royal road function where Q 0 is 0. and selection pressure is 4 and figure 4(b) shows the experiment with trap function where Q 0 is 0.25 and selection pressure is 6. When m is small, the trend is almost scaled as linear. Actually, the external noise mentioned in previous section is not zero and alleles does not takeover 8

convergence time(generation) 25 20 5 0 ecga, k=3, Q=0. O(m / Q 0 ) O(ln(m)) 5 20 40 60 80 00 20 40 60 number of BBs (a) bipolar Royal road function convergence time(generation) 22 20 8 6 4 2 0 8 ecga, k=3, Q=0.25 O(m / Q 0 ) O(ln(m)) 6 20 40 60 80 00 20 40 60 80 200 number of BBs (b) Trap function Figure 4: The relation between the problem size and convergence time. The convergence time t is bounded by O(m) and Ω(ln m) 9

the population immediately when BB identified. These are the reason to explain why the upper bound is not predict accurately. In this paper, we does not discuss the condition that BBs are not fully identified. An example to the condition, a build block is defined as -2-3, but EDAs only identify -2 as a building block. In our proposed model, superior alleles start to grow only full BB is identified. When the order of BB is very large, EDAs are hard to identify the full BB directly, and the superior alleles will never increase without learing linkage. Therefore, the order of BB is smaller, the upper bound of our proposed model is more accurate. 0.9 model allele 0.8 0.7 proportion 0.6 0.5 0.4 0.3 0.2 0. 0 2 4 6 8 0 2 4 6 8 generation(t) Figure 5: The comparison between alleles and linkage. Figure 5 shows the comparison between alleles and linkages on trap function where m = 50. 6 CONCLUSION In this paper we present a model that bound the linkage model convergence time for EDAs. By observing the EDAs flow, we discover the external noise decrease that make the BB identified rate increase. Furthermore, we give a recurrence relation to express the proportion of identified building block. The upper bound is based on an approximation that the noise comes from a BB, which is identified in a previous generation, becomes zero. Because it is difficult to give a closed form to the recurrence relation, we take some approximate in the derivation. And we give a lower bound to the convergence time assuming the BB identification rate is always fixed in Q 0. Our model presents the required linkage model convergence time t is bounded by Θ(ln m) and Θ(m). Our proposed model is scaled as Θ(m) originally. With the grow of problem size, the error grows, too. Such the trend is similar to Θ(ln m). Experimental results are show good agreement with the bound. References Baluja, S., & Davies, S. (997). Combining multiple optimization runs with optimal dependency trees (Technical Report). Blickle, T., & Thiele, L. (995). A comparison of selection schemes used in genetic algorithms. Bonet, J. S. D., Isbell, C. L., & Viola, P. (997). Mimic: Finding optima by estimating probability densities. In Advances in Neural Information Processing Systems (pp. 424). The MIT Press. Deb, K., & Goldbeg, D. E. (99). Analyzing deception in trap functions. IlliGAL Report No. 9009, University of Illinois at Urbana-Chmapaign, Illinois Genetic Algorithms Laboratory, Urbana, IL. Etxeberria, R., & Larrañaga, P. (999). Global optimization using bayesian networks. Second Symposium on Artificial Intelligence (CIMAF-99), 332 339. Goldberg, D. E. (989). Simple genetic algorithms and the minimal, deceptive problem. Genetic Algorithms and Simulated Annealing, 70 79. Goldberg, D. E. (998). The race, the hurdle, and the sweet spot: Lessons from genetic algorithms for the automation of design innovation and creativity. IlliGAL Report No. 98007, University of Illinois at Urbana-Chmapaign, Illinois Genetic Algorithms Laboratory, Urbana, IL. 0

Goldberg, D. E. (2002). The design of innovation: Lessons from and for competent genetic algorithms. Norwell, MA, USA. Goldberg, D. E., & Deb, K. (99). A comparative analysis of selection schemes used in genetic algorithms. Harik, G. R. (999). Linkage learning via probabilistic modeling in the ecga (Technical Report). Hutter, M., & Zaffalon, M. (2005). Distribution of mutual information from complete and incomplete data. Computational Statistics & Data Analysis, 48 (3), 633 657. to appear. Larrañaga, P., Lozano, J. A., & eds (2002). Estimation of distribution algorithms: A new tool for evolutionary computation. Kluwer Academic Publishers, Boston, MA. Mühlenbein, H., & Paass, G. (996). From recombination of genes to the estimation of distributions i. binary parameters. In PPSN IV: Proceedings of the 4th International Conference on Parallel Problem Solving from Nature (pp. 78 87). London, UK: Springer-Verlag. Mhlenbein, H., & Hns, R. (2006). Scalable optimization via probabilistic modeling: The factorized distribution algorithm and the minimum relative entropy principle. Springer Berlin / Heidelberg. Pelikan, M., & Goldberg, D. E. (999). Boa: The bayesian optimization algorithm. (pp. 525 532). Morgan Kaufmann. Shan, Y., McKay, R. I., Essam, D., & Abbass, H. A. (2006). Scalable optimization via probabilistic modeling: A survey of probabilistic model building genetic programming. Springer Berlin / Heidelberg. Thierens, D., & Goldberg, D. E. (994). Convergence models of genetic algorithm selection schemes. In PPSN III: Proceedings of the International Conference on Evolutionary Computation. The Third Conference on Parallel Problem Solving from Nature (pp. 9 29). London, UK: Springer-Verlag. Tsuji, M., Munetomo, M., & Akama, K. (2004). Modeling dependencies of loci with string classification according to fitness differences. Genetic and Evolutionary Computation GECCO 2004, 246 257. Yu, T.-L., Goldberg, D. E., Yassine, A., & Chen, Y.-P. (2003). Genetic algorithm design inspired by organizational theory: Pilot study of a dependency structure matrix driven genetic algorithm. Proceedings of Artificial Neural Networks in Engineering 2003. Yu, T.-L., Sastry, K., Goldberg, D. E., & Pelikan, M. (2007). Population sizing for entropy-based model building in discrete estimation of distribution algorithms. In GECCO 07: Proceedings of the 9th annual conference on Genetic and evolutionary computation (pp. 60 608). New York, NY, USA: ACM.