Linkage Identification Based on Epistasis Measures to Realize Efficient Genetic Algorithms

Size: px

Start display at page:

Download "Linkage Identification Based on Epistasis Measures to Realize Efficient Genetic Algorithms"

Curtis Reynolds
5 years ago
Views:

1 Linkage Identification Based on Epistasis Measures to Realize Efficient Genetic Algorithms Masaharu Munetomo Center for Information and Multimedia Studies, Hokkaido University, North 11, West 5, Sapporo Japan. Abstract Genetic algorithms (GAs) process building blocks (BBs) mixed and tested through genetic recombination operators. To realize effective B- B processing,linkage identification is essential which detects a set of loci tightly linked. This paper proposes a linkage identification with epistasis measures (LIEM) that detects linkage groups based on a pairwise epistasis measure. I. Introduction Genetic Algorithms (GAs) are considered robust search technique to solve difficult problems such as combinatorial optimization problems. GAs repeatedly apply genetic operators such as crossover, mutation, and selection to a population of strings each of which represents a point in a problem domain. Recombination operators such as crossover exchange substrings between a pair of strings, and perturbation operators such as mutation flips a bit of a string. Generally speaking, the essence of genetic search lies in its processing of building blocks (BBs) short, well-performed sub-solutions through recombination operators to speed up genetic optimization. GAs have been applied to wide spectrum of application problems and a number of researchers reported GAs are quite efficient in solving their problems; on the other hand, some reported quite poor performances on their applications of GAs. Various reasons are considered for the poor performances such as the essential GA-difficulty of the problem itself, not enoughly-sized initial population, and so on. However, in most cases that GAs perform poorly, genetic recombination mechanism which is considered essential in genetic optimization does not work effectively due to improper encoding, failure of designing recombination operators mainly because the importance of tight linkage in encoding and recombinations is ignored. To insure tight linkage is essential for recombination operators to work effectively. A set of loci that belong to a BB should be tightly linked in order to be survived along genetic optimization process. In this paper, the word linkage means a set of loci that belong to a same BB and linkage group or linkage set is defined as a set of loci tightly linked that may form a BB. This paper proposes a linkage identification procedure that detects linkage groups to realize efficient genetic recombinations. This paper continues as follows: In section II, an introduction is given on a history of linkage identifications from messy GAs to the current work on linkage identification algorithms. In section III, a detailed description is given on the proposed algorithm called the linkage identification with epistasis measures (LIEM). In section IV and V, empirical investigations and their discussions are given on some nonlinear functions of the sum of GA-difficult trap functions, which were considered d- ifficult to detect accurate linkage groups. II. Linkage identification Decomposability of a problem into subproblems is essential to realizing effective search with optimization algorithms based on divide and rule strategy. For example, a linear combination of sub-functions is apparently decomposable: a function h(x 1,x 2 )=f(x 1 )+g(x 2 ) can be optimized separately by solving optimization sub-problems for functions f(x 1 ) and g(x 2 ). For genetic optimizations, problem decomposition based on BBs and their mixing by recombination operators are essential to realize effective search. However, this does not mean that strict decomposability is necessary for effective GA search, rather, quasi-decomposability is required. Problems without any decomposability cannot be solved effectively with any divide and rule strategy. Classical GAs did not consider the problem decomposition explicitly they process BBs indirectly with general or problem-specific crossover operators. To search BBs directly, the messy GA (mga)[2] processes underor over-specified strings that represents schemata. The mga does not have an explicit linkage representation, which tries to learn linkages indirectly by searching BBs through cut and splice operators applied to a number of BB candidates (schemata) generated in its primordial phase. On the other hand, in the gene expression mga (gemga)[5, 4, 6] assigns a weight to each locus in order to check local optimality concerning the locus and detect candidates of loci that belong to a same linkage group. Another approach employed in the linkage learning GA

2 (LLGA)[3] is based on two-point-like crossover operator applied to circular encoded strings. The crossover of the LLGA tends to preserve tight linkages of the problem, if the problem is loosely separable to sub-functions which is not uniformly-scaled (that is, contribution of each subfunction to overall function value is not uniform). These GAs try to identify linkage groups along their optimization process. On the other hand, linkage identification procedures generate linkage groups directly from an initial population before starting an optimization process. The linkage identification by nonlinearity check (LINC)[9, 8] pioneered direct linkage identification procedures. The LINC detects arbitrary nonlinearity for each pair of loci based on a population of strings to obtain a set of tightly linked loci. The algorithm of the LINC is shown in figure 1. It calculates the following fitness changes by bitwise perturbations for each pair of loci (i, j): f i (s) = f(.. s i...) f(..s i...) f j (s) = f(... s j..) f(...s j..) f ij (s) = f(.. s i. s j..) f(..s i.s j...), (1) where f(s) is a fitness functions of s and s =1 s (0 1or1 0) stands for a perturbation. algorithm LINC P = initialize N strings for each s in P for i = 0 to length-1 s = Perturb(s, i); df1 = f(s ) - f(s); for j = i to length-1 if i!= j then s = Perturb(s, j); df2 = f(s ) - f(s); s = Perturb(s, i) df12 = f(s ) - f(s); if df12 - (df1 + df2) > epsilon then /* nonlinearity detected between i and j */ adding j to the linkage_set[i]; adding i to the linkage_set[j]; endif endif Figure 1: The Linkage Identification by Nonlinearity Check (LINC) In the LINC algorithm, if f ij (s) f i (s)+ f j (s) (practically, we should allow some small amount of error ɛ in the condition; therefore, we employ f ij (s) ( f i (s)+ f j (s)) >ɛ) is satisfied in at least one string s in a population P, we consider loci i and j are tightly linked and include them to the same linkage group. It is essential to checking the above condition for all strings in an enoughly-sized population. Even if a GAeasy linearity is detected in a string, this does not mean the problem is GA-easy because of the possibility that a strong nonlinearity may be detected in another string (For example, GA-difficult trap function is linear along its deceptive attractor). Therefore, it is necessary to check the conditions for all possible strings in order to detect nonlinearity of the problem. It is practically impossible to check the condition for all possible strings of order O(2 l ), where l is the string length. Instead, when we assume that maximum order of BBs is fixed to k that satisfies k << l (this is the condition that GA works effectively), the necessary number of strings n has been calculated as n =2 k log(1 r) where r is the probability to obtain correct linkage groups, which is much smaller than the original exponential estimation [9]. III. Linkage identification with epistasis measures In this paper, we propose a linkage identification algorithm with epistasis measures (LIEM). The LIEM replaces strict condition of the LINC with condition based on a linkage measure that represents strength of epistasis a- mong loci. Figure 2 shows this idea of linkage identification strong epistasis (e ij : large) tightly linked (linkage group) weak epistasis (e ij : small) loosely linked (separable) calculate epistasis measures (e ) ij Figure 2: Overview of the LIEM We believe that linkage should be identified by detecting differences between strong epistasis and weak one (including linearity). Weak epistasis among a set of loci means that the problem can be decomposed into subproblems concerning the loci and will be easily optimized separately. On the other hand, a set of loci with strong epistasis are difficult to separate and optimize, therefore they should be treated all together along optimization process through recombination operators. Basically, genetic search with recombination operators processes relatively weak epistasis (or GA-easy epistasis) and strong (GA-difficult) epistasis such as deception can only be processed with enumerative search realized in an enoughly-sized population of strings (therefore, the population needs to have O(2 k ) strings where k is the maxi-

3 mum order of BBs). The LIEM we propose realizes efficient linkage identifications based on a clear definition of strength of epistasis. This approach is based on a pairwise epistasis measure e ij 0 defined for each pair of loci (i, j) that measures a strength of epistasis between the loci. Several papers have been published on epistasis measures such as epistasis variance proposed by Davidor[1], normalized epistasis value and bit decidability proposed by Naudts et al[7]. However, they concentrate their discussions on relation between epistasis measures of w- hole fitness functions and their GA-difficulties. Therefore, such measures only give approximate estimations of GA-difficulty in solving the problem, and are useless in identifying linkage groups to realize efficient genetic optimizations. On the other hand, the proposed pairwise epistasis measure is designed for linkage identifications by detecting tightness of linkage for each pair of loci. In this paper, we employ a simple pairwise epistasis measure based on the LINC criterion defined as follows (We can assume another definition of linkage measure.): e ij = max f ij(s) ( f i (s)+ f j (s)), (2) s P where f i (s), f j (s), f ij (s) are the same as those in equations (1). The above equation calculates maximum distance from the border whether the LINC condition is satisfied or not. Employing the definition, the LINC condition can be rewritten as follows: if e ij 0 then pair (i, j) are tightly linked because a nonlinearity is detected and they should be included in a same linkage group. if e ij = 0 then pair (i, j) are separated as a linear combination because f ij (s) = f i (s) + f j (s) is satisfied for all s, and therefore they should not be included in a same linkage group. Therefore, smaller value of e ij indicates relatively weak nonlinearity and pair of loci with smaller e ij should be treated separately, and larger value of e ij indicates relatively strong nonlinearity and pair of loci with larger e ij should be included in a same linkage group because they are considered tightly linked. Based on the above idea, in the LIEM, a linkage group of a locus is identified by sorting the epistasis measures concerning the locus and picking up a fixed number of loci k from those with larger value of the measure. For example, when we have the following values for pairwise epistasis measures e ij for locus i, e 12 =0.5, e 13 =1.1, e 14 =0.3, e 15 =0.0, e 16 =0.1, and we pick up three loci as a linkage group, first, e ij are sorted as follows: e 13 >e 12 >e 14 >e 16 >e 15, and second, we pick up three loci according to the sorted e ij and obtain {3, 2, 4} as tightly linked with locus 1 and consequently, the obtained linkage group is {1, 2, 3, 4}. Note that loci with zero epistasis measure (e ij =0) should not be included in the linkage group. In the above definition of epistasis measure, a pair of loci (i, j) do not consider to be tightly linked when e ij =0. To apply the LIEM, we need to assume the maximum length of BBs as the fixed number of loci k defined above. In this paper, we call the order difficulty number d because it represents the problem difficulty for genetic recombinations. According to the population sizing discussed in the previous section[9], an initial population of O(2 k ) strings becomes necessary to obtain correct linkage groups. Rather, it is more natural to argue that when the initial population size is fixed, the maximum length of BBs how many order of BBs can detect is fixed. In figure 3, we show a detailed description of the LIEM algorithm. The algorithm start with setting the initial population size N = c 2 d and randomly generate a population P consisting of N strings. After the initialization, epistasis measures e[i][j] are calculated for each pair of loci (i, j) based on fitness changes for all strings s in a population by applying perturbations (perturb(s,i) means i-th position of string s is perturbed by s i = s i =1 s i ). And the calculated epistasis measures are sorted. A linkage group of locus i (l[i][*]) is obtained by selecting a set of locus according to the sorted measures (except when the measure is equal to zero, practically, smaller than a small value of epsilon). The time complexity of the LIEM algorithm is O(2 k l 2 ) because perturbations are applied for all pair of loci of order O(l 2 ) to each string in properly sized O(2 k ) strings. IV. Empirical investigations The LIEM algorithm is expected to identify correct linkage groups not only for quasi-linear combination of GAdifficult sub-functions but also some (weak) nonlinear combination of the sub-functions. It is considered difficult to validate the effectiveness of the LIEM for wide spectrum of practical problems. Therefore, in this empirical study, we perform experiments on some linear and nonlinear functions of the sum of GA-difficult trap functions. In the following simulation experiments, we employ the following sum of trap functions defined as follows: f i (u i )= 10 h(x) = f i (u i ). (3) i=1 { 4 ui if 0 u i 4 5 if u i =5 (4) where u i is the number of ones (unitation) in each 5-bit substring of x. In order to control difficulty of linkage identification by changing nonlinearity of the whole fitness functions, we employ f(x) =h(x) n (n =1, 2, ) as test functions.

4 Table 1 shows the percentage of linkage groups correctly identified by the LIEM and the LINC. (Even though strings are randomly enocoded, the same result should be obtained from the nature of the algorithms.) Table 1: % of linkage correctly identified h(x) n % of correct linkage n LIEM LINC The LINC is vulnerable to nonlinearity of overall fitness functions because the algorithm detects strict nonlinearity to obtain linkage group. We can introduce some threshold in the condition, however, such modification only removes minor effects caused by small amount of noise and cannot solve the above problem caused by relatively weak nonlinearity. On the other hand, the LIEM achieves robust linkage identification for these nonlinear test functions. The obtained linkage groups for functions h(x), h(x) 2, h(x) 5 are shown in figure 4. The numbers after : represent a set of loci tightly linked to the locus specified before :. For example, 0 : represents that a set of loci {0, 1, 2, 3, 4} belong to a linkage group for locus 0. For functions h(x), linear combination of trap functions and h(x) 2, weak nonlinear one, the LIEM can obtain correct results in which each 5-bit sub-function is detected as a linkage group. For h(x) 5, relative strong nonlinear function of the sum of trap functions, the LIEM sometimes failed to identify linkage groups. For example, in linkage group of locus 48, the obtained linkage group is {48, 45, 47, 49, 0}, that should be {48, 45, 47, 49, 46}. In order to understand the reason linkage identification failed for nonlinear functions, we plot the values of epistasis measures for the test functions in figure 5 (h(x)), figure 6 (h(x) 2 ), and figure 7 (h(x) 5 ). In the figures, X- axis and Y-axis represent a pair loci (bit position), and Z-axis shows the value of epistasis measure of the pair. As in figure 5, the landscape of the epistasis measures for linear function h(x) has clear distinction between tightly linked and loosely-linked pair of loci. For h(x) 2 (in figure 6), its landscape becomes more complex because of its nonlinearity, however, it is still not difficult to detect linkage groups. On the other hand, for h(x) 5, a function with strong nonlinearity, the landscape of the epistasis measures has complex structure which is difficult for the LIEM to identify correct linkage groups. This difficulty is considered natural because the nonlinearity caused by function of h(x) 5 is large enough compared with the nonlinearity by the trap functions and it becomes difficult to detect epistasis difference between the functions. V. Discussions Through empirical studies above, we have shown that the LIEM can obtain correct linkage groups for weak nonlinear functions of the sum of GA-difficult trap functions. This result implies that correct linkage can be detected when overall nonlinearity of fitness functions is small e- nough compared with nonlinearity inside a linkage group. This is because the LIME observes the difference between strong epistasis inside a set of loci in a linkage group and relatively weak epistasis among the other locus pairs. In this paper, we adopt a simple pairwise epistasis measure based on the condition of the LINC. Although the proposed definition seems reasonable, it is still important to seek for more general definition of epistasis measures. In our definition in equation (2), we calculate a maximum difference between simultaneous and individual fitness changes by perturbations. This definition only considers one maximum instance in the population and does not deal with overall fitness landscape. By replacing the maximum part with some nonlinear functions, we may design another measure taking populational fitness landscape into consideration. Other than designing epistasis measures based on the LINC, we may consider another definition of epistasis measures based on the linkage identification by nonmonotonicity detection (LIMD) proposed elsewhere[10, 11]. The LIMD generates linkage groups by detecting violation of its monotonicity condition. The monotonicity condition is based on an idea that monotonicity of a function is considered easy for all search algorithms including GAs and violation of monotonicity should be detected to find GA-difficulty. Although the basic idea of the LIMD is different from that of the LINC which is based on epistasis, we might define an epistasis-like measure from the LIMD condition that may give more accurate identifications of linkage groups. VI. Conclusion Linkage identification is essential for genetic recombinations to work effectively and reliably. This paper proposes a linkage identification procedure based on a pairwise epistasis measures calculated for each pair of loci. The key idea of the linkage identification is to detect difference between strong and weak epistasis. The LIEM we propose is a simple yet powerful procedure that generates correct linkage groups even for some nonlinear functions of the sum of trap functions for which the LINC cannot obtain correct results. Through empirical studies that illustrates the landscape of the epistasis measures for some test functions, we show the effectiveness of the linkage i- dentification based on the measures.

5 References [1] Y. Davidor. Epistasis variance: A viewpoint on GA-hardness. Foundation of Genetic Algorithms 1, pages 23 35, [2] D. E. Goldberg, B. Korb, and K. Deb. Messy genetic algorithms: Motivation, analysis, and first results. Complex Systems, 3(5): , [3] G. R. Harik and D. E. Goldberg. Learning linkage. Foundations of Genetic Algorithms 4, pages , [4] H. Kargupta. The gene expression messy genetic algorithm. Proceedings of the 1996 IEEE Conference on Evolutionary Computation, pages , Piscataway, NJ, IEEE Service Center. [5] H. Kargupta. SEARCH, evolution, and the gene expression messy genetic algorithm. Unclassified Report LA-UR96-60, Los Alamos National Laboratory, Los Alamos, NM, [6] H. Kargupta and S. Bandyopadhyay. Further experiments on the scalability of the GEMGA. Proceedings of the Parallel Problem Solving From Nature V, pages , [7] Naudts, B. and Suys, D. and Verschoren, A. Epistasis as a basic concept in formal landscape analysis In Proceedings of the Seventh International Conference on Genetic Algorithms, [8] Masaharu Munetomo and David E. Goldberg. Designing a genetic algorithm using the linkage identification by nonlinearity check. Technical Report IlliGAL Report No.98014, University of Illinois at Urbana-Champaign, [9] Masaharu Munetomo and David E. Goldberg. Identifying linkage by nonlinearity check. Technical Report IlliGAL Report No.98012, University of Illinois at Urbana-Champaign, [10] Masaharu Munetomo and David E. Goldberg. Identifying linkage groups by nonlinearity/nonmonotonicity detection. In Proceedings of the 1999 Genetic and Evolutionary Computation Conference, [11] Masaharu Munetomo and David E. Goldberg. Linkage identification by non-monotonicity detection for overlapping functions. Evolutionary Computation, 7(4), algorithm LIEM N = c*2^difficulty; P = initialize N strings; /* Calculate epistasis measure e[i][j] */ for i = 0 to l-1 for j = 0 to l-1 e[i][j] = 0; if i!= j then for each s in P s = perturb(s, i); f1 = fitness(s ) - fitness(s); s = perturb(s, j); f2 = fitness(s ) - fitness(s); s = perturb(s, j); f12 = fitness(s ) - fitness(s); ep[s] = f12 - (f1+f2) ; if(ep[s] > e[i][j]) then e[i][j] = ep[s]; endif /* Generate linkage group l[i][k] where k = 0, 1,..., difficulty-1 */ for i = 0 to l-1 for j = 0 to l-1 id[j] = j; sort e[i][j] with j by descendent order; /* select linkages */ for k = 0 to difficulty-1 if(e[i][k] > epsilon) l[i][k] = id[i][k]; else break; Figure 3: The Linkage Identification with Epistasis Measure

6 0 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : h(x) 2 h(x) h(x) 5 ˆ ˆ ˆ ˆ ˆ ˆ Œ ˆŠ ˆŽ ˆ Œ ŠŠ ŠŽ ˆ Œ ª ª ˆ ªŠ ªŠˆ ª ª ˆ ªˆ ªˆˆ ª ªˆ Figure 4: Results of the LIEM for test functions Figure 6: Epistasis measures for h(x) 2 ˆ ˆ ˆ ˆ Ž Œ ˆ ˆ Ž ˆ ˆŠ ˆ ˆ Œ Šˆ Š ŠŽ Š ª ª ˆ ªŠ ªŠˆ ª ª ˆ ªˆ ªˆˆ ª ªˆ Š ˆ ˆ Œ ˆŠ ˆŽ ˆ Œ ŠŠ ŠŽ ˆ Œ ª ª ˆ ªŠ ªŠˆ ª ª ˆ ªˆ ªˆˆ ª ªˆ Figure 5: Epistasis measures for h(x) Figure 7: Epistasis measures for h(x) 5

Interaction-Detection Metric with Differential Mutual Complement for Dependency Structure Matrix Genetic Algorithm

Interaction-Detection Metric with Differential Mutual Complement for Dependency Structure Matrix Genetic Algorithm Kai-Chun Fan Jui-Ting Lee Tian-Li Yu Tsung-Yu Ho TEIL Technical Report No. 2010005 April,