An Improved Selection of GEP Bsed on CPCSC-DSC Approch Li Wu 1,, Yonghong Yu 2, nd Zhou Zhou 2, c 1 School of Finnce nd Pulic Mngement, Anhui University of Finnce & Economics, Bengu 233030, Chin; 2 School of Mngement Science nd Engineering, Anhui University of Finnce & Economics, Bengu 233030, Chin. Astrct wuli@163.com, c120107@163.com, c 1057578619@qq.com This pper minly discusses the effects of duplicted chromosomes on genetic diversity, popultion fitness level, evolutionry efficiency, nd evolutionry serch retention during the selection opertion of gene expression progrmming(gep). It introduced the concepts of strict nd non-strict implicit duplicted chromosome nd proposed novel method nmed creting popultion copy for the second choice-deleting sme chromosome(cpcsc_dsc) to remove the duplicted or implicit duplicted chromosomes ccording to the popultion fitness, which cn ensure popultion diversity, evolutionry efficiency nd cn void premture convergence. The experiment result shows tht the CPCSC_DSC ccelertes evolution, increses the fitness of optiml chromosome compred to the result not using CPCSC_DSC. Keywords Gene expression progrmming, Selection opertion, Creting popultion copy for the second choice, Deleting sme chromosome. 1. Introduction After GEP[1,2,3,4] completes certin genertion of evolutionry opertion, new individuls re dded to the popultion, nd the numer of individuls is expnded. These individuls hve different fitness, nd only prt of them re dded to the next genertion. GEP otins fitness of individul through the estlished fitness function, nd tkes the fitness s the criteri of selection, the higher the fitness of individuls is, the greter the proility of eing choosed to dd to the next genertion is. Ech individul in popultion hs selection proility, which is generlly llocted ccording to fitness or rnk ccording to fitness. The proposl ccording to the rtio of fitness is nmed Monte Crlo method, nd the proility of eing selected depends on the proportion of individul fitness proility. The proposl ccording to fitness rnking sorts the trget vlues of the popultion, nd the fitness vlues re determined ccording to the ordinl position. The commonly used selection methods re roulette selection, rndom trversl smpling, elitist selection nd tournment selection. The introduction of elitism strtegy in roulette cn gurntee the glol convergence of the lgorithm, ut hs slow evoultion speed. The tournment strtegy nd elite selection strtegy re esy to fll into locl optimum though they hve fst convergence speed. The popultion initiliztion process of the sic GEP[2,4] lgorithm is completely rndomized, nd the selection scle of the roulette is constnt, nd the diversity of the GEP popultion is reltively simple nd unevenly distriuted. Reference [5] suggests tht elite individuls produce strtegy EPS nd genetic sptil distriution of GSBS initil popultion strtegy, y producing individuls with higher fitness nd improving the genetic diversity of initil popultion, the evolutionry efficiency is improved, nd convergence is voided s locl optiml. Reference[6] proposed OBS lgorithm which mkes use of evolutionry process to eliminte close reltives nd propgte distnt species to enhnce popultion diversity. Reference [7] proposed DAIP lgorithm which uses the weighted diversity mesure to quntify the popultion diversity nd to mximize the popultion diversity. Reference [8] proposed GDM-GEP lgorithm to mesure the diversity of the popultion y the 234
individul fitness vrince, y designing n dptive muttion opertor, the lgorithm chnges the muttion rte with the popultion diversity, nd mintins the stility of the popultion nd the preservtion of the excellent individuls. The evolutionl chromosomes sme s the prent chromosome will e dded to the popultion, nd will prticipte in the selection opertor fter fitness evlution. If the duplicted chromosome individul hs higher fitness, the proility of eing selected s the prent chromosome of next genertion will e incresed ccording to the roulette selection method nd the rndom trversl smpling method. In tournment selection nd elitist strtegy, s long s their fitness vlues rech the top of the rnking, they will e selected for the next genertion. Duplicted chromosomes s prent chromosomes re dded to the next popultion my result in three prolems, such s: (1) The diversity of excellent chromosomes in descendnt popultions ws reduced. (2) The popultion fitness level nd evolution efficiency of next genertion efore genetic opertion were reduced. (3) Duplicted chromosomes ssimilte other individuls mlignntly, nd cuse evolutionry serch retention. To sum up, repeted chromosomes hve certin stility. Once certin numer of chromosomes re formed in the popultion, they re difficult to e destroyed y the genetic fctor nd they expnd rpidly. In order to ensure popultion diversity, evolutionry efficiency nd voiding premture convergence, it is necessry to eliminte duplicted chromosomes in the genetic process. This pper introduces the concepts of strict implicit duplicted chromosome nd non-strict implicit duplicted chromosome, nd proposes novel pproch to delete duplicted chromosomes y creting popultion copy for the second choice-deleting sme chromosome(cpcsc-dsc). It dopts the fitness vlue s the judgment rule to determinte whether n individul is duplicted chromosome or strict implicit duplicted chromosome, nd uses the CPCSC-DSC method to supplement the numer of individuls to improve the GEP selection process. 2. Preliminry Terminls nd functions constitute the primitive elements of GEP. The terminls represent to the constnt, input or no prmeter functions in the mthemticl expression, nd correspond to the lef nodes of the expression tree. The functions represent opertors of espicl ppliction, components of progrmm nd middle symols of system, they correspond to the non lef node of expression tree. The expression tree(et) is used s the phenotype of GEP individuls to represent n expression visully. In order to fcilitte computer genetic mnipultion, the nonliner structure of the expression tree is trnslted into liner structure which Ferreir clled K-expression s chromosome of GEP, corresponding to the genotype of GEP individuls. GEP genes re composed of hed nd til[2,4]. The hed contins symols tht represent oth functions nd terminls, wheres the til contins only terminls. Therefore two different lphets occur t different regions within gene. For ech prolem, the length of the hed h is chosen, wheres the length of the til t is function of h nd the numer of rguments of the function with the most rguments n, nd is evluted y the eqution: t = h* (n-1) + 1 (1) Consider gene composed of {+,*,ln,sin,, }. In this cse n = 2. For instnce, for h = 7 nd t = 8, the length of the gene is 7+8=15. One such gene is shown elow: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 * sin ln nd it codes for the following ET: (2) 235
+ * sin ln + The corresponding mening of ech chromosome memer in the initil popultion creted y the GEP lgorithm is different. The genetic opertor cts on the genetic mnipultion of the popultion, nd the higher the degree of fitness of the resulting offspring individuls, the greter the proility of eing selected. GEP dopts liner equl-length coding, so long s the length of the gene is constnt nd the til consists only of terminls, legitimte offspring chromosome cn e otined. Therefore, the GEP genetic opertor cn e designed more simply nd flexily. The sic genetic opertors of GEP minly include selection, muttion, trnsposition of insertion sequence elements, root trnsposition, gene trnsposition, one-point recomintion, two-point recomintion, nd gene recomintion. Fitness is n indictor of iologicl ility to mesure the ility of species to dpt to the environment, it cn lso pply to the GEP lgorithm. The fitness function is used to clculte the individul fitness to evlute the pros nd cons. By evluting the chromosomes, expression trees cn e otined, then the ojective function vlues re clculted from the phenotypes, nd finlly, ccording to the conversion rules, the phenotypic (individul) fitness vlues cn e otined using the ojective function vlues. Ferreir proposed two fitness evlution models: one is tht the fitness f j of n individul chromosome j is clculted sed on the solute error, nd the other is clculted sed on the reltive error. Let T e dtset contining n smples, T i is the i-th input dtset, v i is the oserved vlue of T i, nd is the estimted vlue. the fitness equtions re: v ˆi f n j i i i 1 f ( K v vˆ ) (3) j n v ˆ i vi ( K ) (4) v i 1 Through the constrint of the fitness function, the popultion retins the high-dpttion chromosomes during the evolution process. When the individul chromosomes do not hve reltive or solute errors in the smple estimtes, the ccurcy is mximized nd the fitness is mximized. 3. Creting Popultion Copy for the Second Choice-Deleting Sme Chromosome Approch Chromosoml muttion nd recomintion positions re rndom in evolution, nd the lengths involved in muttion nd recomintion re rndom, the chromosomes my not chnge during genetic mnipultion, tht is to sy, duplicted chromosomes re produced, it occurs t the following cses: (1) The elements re identicl fter muttion. (2) The sustring elements involved in the inverted string re symmetric out the middle elements. (3) Sustrings hve the sme type nd order of the elements of string eing inserted during the insertion string opertion, nd the type nd order of the elements of string eing inserted re the sme with the originl elements of position of susequent shift. (4) The type nd order of the two string elements involved in recomintion re the sme. When the length of rndomly generted sustrings is reltively short, the proility of these situtions will increse gretly. The concept of implicit duplicted chromosomes is tht the genotypes of two or more chromosomes in the popultion re different, ut their corresponding phenotype is the sme, nd cn eventully e converted into the sme expression tree. i 236
Definition 1. Strict Implicit Duplicted Chromosome: Two or more gene expressions hve the sme ORF nd different noncoding regions, nd cn trverse to the sme expression tree directly. Eqution 5 shows tht two gene expressions re strict implicit duplicted chromosomes, their ORF is the sme (the underlined prt), nd the corresponding expression tree is: exp (5) * Strict implicit duplicted chromosomes entering the prent chromosome constitute descendnt popultion my cuse the following three prolems:reduce genetic diversity of efficient coding segments,reduce the popultion fitness level nd evolution efficiency of next genertion efore genetic opertion, generte more new nd strict implicit repet chromosomes nd even generte duplicted chromosomes. The strict implicit duplicted chromosomes lso hve certin stility, once there is certin numer of them in the popultion, they re difficult to e destroyed y the genetic fctor, nd they expnds nd even produces repeted individuls under the ction of the recomintion opertor. In order to ensure gene frgment diversity, evolutionry efficiency nd prevent the genertion of duplicted chromosomes in the effective coding segment, strict implicit repet chromosomes need to e eliminted in the genetic process. Definition 2. Non-Strict Implicit Duplicted Chromosome: Two or more gene expressions hve different ORF nd noncoding regions, nd the expression trees otined y trversing directly re lso different, ut oth the expression tree cn e simplified to the sme tree structure. Eqution 6 shows tht two gene expressions re not strict implicit duplicte chromosomes, their ORF different (underlined prts), nd their corresponding expressions trees re different s follows, ut cn e simplified s the tree structure shown in mentioned ove. (6) exp exp * - * - + - + - The coding regions nd non-coding regions of non-strict duplicted chromosomes re different. Therefore, s prent chromosome, it will not destroy the diversity of popultion gene frgments. The non-strict duplicted chromosomes hve different direct phenotypes nd sme indirect phenotypes, nd the numer of existence of non-strict duplicted chromosomes is very smll compred to duplicted chromosomes nd strict implicit duplicted chromosomes in the popultion. In the process of evolution, the non-strict implicit duplicted chromosomes re esily destroyed y the genetic opertor, nd the proility of producing duplicted chromosomes is very low, nd they often convert into totl different chromosomes. Therefore, the indirect phenotype of the non-strict implicit duplicted chromosomes is unstle. To sum up, the non-strict implicit duplicted 237
chromosomes cn e selected s the prent chromosomes of the next genertion. Due to the smll numer, even the excellent non-strict implicit chromosomes re not selected to e the prent chromosomes, the cost is very smll. Duplicted chromosomes nd strict implicit duplicted chromosomes re highly stle in the evolution process. To prevent the cretion of lrge numer of cloned individuls from dmging the ecology of the popultion, it is necessry to remove the duplicted chromosomes nd strict implicit duplicted chromosomes when the selection opertor cts on the popultion. If the remove process occurs efore the opertion of the selection opertor, the numer of duplicted chromosomes nd strict implicit duplicted chromosomes is reduced to 1 in the cndidte popultion, the possiility of the genotype eing selected decreses gretly. In the ctul process of popultion genetics, individuls eing generted repetedly lso reflect tht the genotype hs strong survivility nd convergence trend, so it is still necessry to mintin the proility of the g genotype eing selected. The remove process only prevents the mlignnt genertion of implicit duplicted individuls in next genertion, ut does not decrese the proility of the genotype eing selected in current genertion. Therefore, the remove process should occur fter the selection opertor opertion, the ojects eing removed should e the prent chromosome group eing selected in current genertion insted of the previous genertion popultion efore the selection opertor opertion. During the remove process, the sis for judging duplicted chromosomes nd strict implicit duplicted chromosomes is s follows: (1) duplicted chromosomes: The gene expression is the sme s the gene expression of nother individul in the popultion. (2) strict implicit duplicted chromosomes: The ORF segment is the sme s the ORF segment of nother individul in the popultion. Since the ORF segment is prt of the gene expression, we cn remove duplicted chromosome individuls ccording the second rule. Tht is, when individuls with the sme ORF segments re removed, ll duplicted chromosomes nd strict implicit duplicted chromosomes will lso e removed. To determine whether the ORF segments re the sme, it is necessry to scn the gene expression to determine the length of the ORF first, then to scn the specific elements of the ORF segment from hed to til, nd compre them with other existing ORF segment elements. If the elements re sme, it mens tht the individul is duplicted chromosome or strict implicit duplicted chromosome. The judgment lgorithm y compring ORFS mentioned ove is less efficient. This pper dopts the fitness vlue s the judgment rule to determinte whether n individul is duplicted chromosome or strict implicit duplicted chromosome: (1) Fitness s criterion for evluting the qulity of n individul hs een clculted with the irth of chromosome individul in genetic opertor, no dditionl clcultion is needed. (2) The fitness vlue is generlly expressed s doule dt type, it hs high precision nd cn e used s the judgment rule to determinte whether n individul is duplicted chromosome or strict implicit duplicted chromosome. (3) The numer of non strict implicit duplicted chromosomes in the popultion is extremely low. It is smll proility event tht the prent chromosomes which re not selected s descendnt re deleted, nd the cost is cost-effective in terms of lgorithmic efficiency. The steps of lgorithm of DSC y compring fitness vlues re follows: Algorithm1. Deleting sme chromosomes y compring fitness vlues Input : List of chromosome popultions(chromosomeslist) Output: List of chromosome popultions(chromosomeslist) 1. chromosomefitness=1.0000; //define nd initilize the fitness of chromosome 2. chromosomeslist.sort(); 3. for (t=0;t<length(chromosomeslist);t++) 4. if(chromosomeslist[t]==chromsomefitness) then 5. romove chromosomeslist[t] from chromosomeslist; 6. t ; 7. else 238
8. chromosomefitness=chromosomeslist[t]; 9. end if 10. end for 11. return chromosomeslist; After deleting duplicted individuls, the individul numer of prent chromosomes, which re selected s the next genertion, my e reduced, nd need to e supplemented to the numer of individuls efore deletion. We propose novel method CPCSC to improve the GEP selection process. The improved selection process is shown in Fig. 1 : Begin Genetic opertion of this genertion Crete copy of the popultion Empty the popultion copy of the popultion Choose the selectmount of prent chromosomes of next gerention Add to popultion Delete k duplicted or hidden duplicted individuls popultion K>0 yes copy of the popultion Delete duplicted or hidden duplicted individuls Delete selected individuls Sort the remining individuls in the popultion copy no Select the top k individuls nd dd to the popultion Generte individuls rndomly to supplement the popultion to ensure the initil size of next gerention Genetic opertion of next genertion end Fig.1 Selection flow of introducing CPCSC_DSC The specific steps re: first, crete the copy of popultion. Second, empty the popultion. Third, the selection opertor chooses the prent chromosomes of next genertion from the copy popultion nd dd them to the popultion, fourth, delete hidden duplicted individuls. Fifth, if the mount of popultion decrese fter deleting, the duplicted individul nd selected individuls re removed from the copy popultion, nd the remining dominnt individuls re dded to the popultion, nd the numer of the dominnt individuls dded to the popultion is the sme s the numer of duplicted individuls deleted. Finlly, generte chromosomes rndomly, nd expnd the prentl chromosoml popultion to the required size. 239
4. Experiment nd Anlysis A prt of the Iris dt test of UCI dt set is used to prove the vlidity of the improved GEP lgorithm compred to the stndrd GEP. The dt contin three flower smples, setos, versicolor nd virginic. Ech smple contins 4 ttriutes, including sepl length, sepl width, petl length nd petl width. The clyx length of the setos smple ws selected s the dependent vrile y, nd the other ttriutes were used s independent vriles, which were expressed s, nd c respectively. Setos smple dt, s shown in Tle 1: Tle 1. Iris-Setos Smple Dtset. Num S ep l le ng t h y Sepl width P e t l l e ng t h Petl width c 1 5.1 3.5 1.4 0.2 2 4.9 3.0 1.4 0.2 3 4.7 3.2 1.3 0.2 4 4.6 3.1 1.5 0.2 5 5.0 3.6 1.4 0.2 6 5.2 3.5 1.5 0.2 7 4.6 3.4 1.4 0.3 8 5.0 3.4 1.5 0.2 9 4.4 2.9 1.4 0.2 10 4.9 3.1 1.5 0.1 In order to prove the effect of CPCSC_DSC lgorithm on evolution, the experiment ws divided into two groups, the experimentl group nd the control group, oth groups use the sme dt set in Tle 1, nd they re executed sed on reverse polish expression_sck decoding method to evlute gene expression nd sed on rough multiple liner regression initiliztion_dptive correction constnt to initilize constnt, in ddition, the experimentl group dopts CPCSC_DSC lgorithm to delete the duplicted or hidden duplicted inidividls during generic evolution process. Set the size of popultion to 60, the hed length of gene expression to 15, the proility of selection to 0.3, the proility of recomintion to 0.75, nd the muttion proility to 0.1. Compring the optiml fitness of the two individuls otined y the CPCSC_DSC nd no CPCSC_DSC lgorithm fter the 2000 genertion,the individul with higher fitness wins. Fig.2 shows the results:.7 Fitness vlue.75.8.85 0 20 40 60 80 100 The numer of experiments CPCSC_DSC NOCPCSC_DSC Fig.2 Fitness vlue of CPCSC_DSC nd non CPCSC_DSC Fig.2 shows tht in the comprison of the 100 experimentl results, the optiml fitness of the chromosomes otined y the lgorithm CPCSC_DSC is 88 times greter thn tht of the GEP without CPCSC_DSC (oth group using reverse polish expression sck_ decoding method to 240
evlute gene expression, nd using rough multiple liner regression initiliztion dptive correction constnt to initilize constnt) fter the 2000 genertion. The result shows tht the CPCSC_DSC lgorithm cn ccelerte the evolution speed. One group uses CPCSC_DSC lgorithm nd the other group does not use CPCSC_DSC, ech group is executed 100 times respectively, nd fter 2000 genertions of evolution, the verge fitness of optiml chromosome is 0.80267, compred to 0.76258 of not using CPCSC_DSC, incresed y 5.25721%. 5. Conclusion The deletion of duplicted chromosomes in the process of selection opertion of GEP plys n importnt role during the genetic process. An improved selection lgorithm sed on CPCSC_DSC deling with the deletion of duplicted chromosomes in the process of selection is presented. The CPCSC_DSC introduced the conception of strict implicit nd non-strict implicit duplicted chromosome, which cn ensure the numer of vrile of individuls nd cn ensure popultion diversity, evolutionry efficiency nd voiding premture convergence. The experiment results show tht the CPCSC_DSC ccelertes evolution, increses the fitness of optiml chromosome compred to the result not using CPCSC_DSC. References [1] Informtion on https://www.gene-expression-progrmming.com. [2] Ferreir Cndid. Gene Expression Progrmming: A New Adptive Algorithm for Solving Prolems. Complex Systems, vol. 13(2001), p.87-129. [3] Ferreir Cndid. Genetic Representtion nd Genetic Neutrlity in Gene Expression Progrmming. Advnces in Complex Systems, vol. 4(2002), p.389-408. [4] Ferreir Cndid. Gene Expression Progrmming: Mthemticl Modeling y n Artificil Intelligence (Springer- Verlrg, Germny 2006). [5] Jinjun Hu: The Reserches of Key-techniques in Knowledge discovery System for TCM Phrmcology (Ph.D., Sichun University, Chin 2006). [6] Yue Jing, Chngjie Tng nd Mingxiu Zheng. Outreeding Strtegy with Dynmic Fitness in Gene Expression Progrmming. Journl of Sichun University, vol. 139(2007), p.121-126. (In Chinese) [7] Tiyong Li, Chngjie Tng nd Lei Dun. Adptive Popultion Diversity Tuning Algorithm for Gene Expression Progrmming. Journl of University of Electronic Science nd Technology of Chin, vol. 39(2010), p.279-283. (In Chinese) [8] Bing Shn, Shihong Ni nd Xing Ch. GEP sed on Popultion Diversity Mesure y Vrince of Individul s Fitness. Computer Engineering nd Design, vol. 34(2013), p.3094-3098. (In Chinese) 241