1 Operon Expression and Regulation with Spiders Paul J. Kennedy and Thomas R. Osborn School of Computing Sciences University of Technology, Sydney PO Box 123 Broadway NSW 2007 Australia Abstract Gene expression and regulation may be viewed as a parallel parsing algorithm translation from a genomic language to a phenotype. As a rst step towards determining the applicability of gene expression to evolutionary computation, this paper describes a simplied model of gene expression and regulation based on the operon model of Jacob and Monod. An articial cellular metabolism expresses operons encoded on a genome in a parallel genomic language. This is accomplished using an abstract entity called a \spider". Some hypotheses that may be tested with the model are discussed. 1 Introduction Gene expression may be viewed as a parsing algorithm translation from a 1-dimensional genomic language of bases to a 3-dimensional structural language of proteins. Expression of genes occurs throughout the lifetime of a cell and is an important component of the (ongoing) development of the phenotype (or cell body) from the genotype (cell DNA). We wish to determine whether the process of gene expression is a useful notion to apply to evolutionary computation (EC). To answer this question we must build a model that is able to test hypotheses associated with gene expression and regulation in EC. It is important that the gene expression model sits at an appropriate level. It should be an abstraction of biology so as to embody some of the exibility and complexity of gene expression in real cells and it should also bear close relationships to EC so that it can tell us something about gene expression in the context of EC. The model we present in this paper, therefore, sits between biology and EC. It is a model of a single celled organism that inhabits a simple environment. Populations of cells are bred with a genetic algorithm (GA) [3] to adapt to the environment. As in a real cell, the phenotype used in our model varies with time the genome controls the articial cellular metabolism at all times. Operons are encoded on a bit string genome. In 1961, Jacob and Monod developed the operon model of gene expression and regulation [1] to explain expression in prokaryote cells. An operon is the smallest section of the genome able to be expressed or regulated. It is a sequence of genes preceded by a regulatory region (called the promoter region). In general, each gene codes for one protein or part thereof. Operons, then, group genes together that are regulated the same way. Figure 1 shows a schematic representation of an operon. Our cell model is similar in spirit to work carried out by Rosenberg [6] in His simulation was, by necessity, simpler than ours because of the limits of the computing machinery of the time. Other, more recent, cell models (e.g. [4]) also share similarities with our model of expression. See [5] for a survey of related cell models. After describing our model, we present preliminary results from an experiment showing that cells using the gene expression algorithm may be evolved to adapt to a simple environment. Finally we discuss some of the hypotheses we plan to test with our model. 2 Overview of the Cell Model When building a model of a cell with particular focus on gene expression and regulation we feel that it is important to consider the genotype as a string in some well dened genomic language or some representa-

2 In Proceedings of the GECCO-2000 Workshop Program: Gene Expression. July Las Vegas 2 tion of a (pseudo biological) computer program on a (mostly read only) storage medium. In our model (and in biology) the genome is a highly parallel program. The phenotype may be regarded as a kind of memory and computer. In our model, chemical species (and their relative concentrations) act as a memory and the way that the chemical species interact (via reactions in the metabolism) represents the computational machinery. The algorithm that generates a phenotype can be viewed as a sort of parsing algorithm and is part of the metabolism or computational machinery. It takes a program (genotype) in a parallel genomic language, parses the string from this language and instantiates computational elements (enzymes) that manipulate the memory locations of the computer (i.e. the concentrations and kinds of chemical species). Our genome encodes bases which are tokens of a genomic language. When read in sequence, the bases encode sentences that specify a set of simulated protein molecules (enzymes) able to be generated in the cell. Enzymes permit only a small subset of all the possible chemical reactions to occur in an articial metabolism. Indirectly, through the enzymes, the genome controls the metabolism and the cell. The metabolism, however, regulates the genome (by turning on or o genes) and actually produces the simulated enzyme molecules. Five main processes are embodied in the simulated metabolism: modication of chemicals with enzyme catalysed reactions; production of proteins (by gene expression and regulation); protein degradation; growth of the cell; and diusion of chemicals through the cell membrane. Our model requires this wide variety of metabolic processes to permit enough exibility for its use as a testbed. Chemical polymers are modelled as a string of digits according to the model proposed by Farmer, Kauman, and Packard [2]. Each digit represents one of ten possible articial monomers. Ten monomers were chosen to permit combination into chemical species of considerable complexity. The chemical \shape" (i.e. the list of digits) species how the chemical may interact with other chemicals. Chemical reactions are an abstraction of polymer condensation and cleavage reactions. Each reaction models an equilibrium joining two polymers under catalysis from an enzyme. When catalysing a reaction, the closer the \shape" of the enzyme is to the \shape" of reactants (i.e. the more specic the enzyme), the faster the reaction can occur. The chemical species and ve metabolic processes are encoded into a large system of coupled nonlinear ordinary dierential equations. An example with bases: <start operon> Operon promoter gene 1... gene i... gene n... <start enzyme> Switch Data <start carrier> Non coding region <end operon> or <start operon> Figure 1: Schematic of operon structure. An operon consists of a promoter (regulatory) region followed by one or more genes, each coding for a protein. 3 Model of the Genome A genome comprising a string of bits encodes tokens from the genomic language. At the highest level, the genome consists of a list of operons separated by noncoding regions (see gure 1). Our language dissociates genes from particular loci. Four bit blocks (i.e. nibbles) are read from the genome bit string. The sixteen possible coded words are divided into ten digits (00002 to 10012) and six control codes (10102 to 11112). The ten digits are used to specify monomers to build into proteins or as parts of keys for gene regulation and the remaining six bases encode four special codes: <start operon> (10102, 10112), <start enzyme> (11002, 11012), <start carrier> (11102) and <end operon> (11112). The two kinds of proteins (enzymes and carriers) produced from the genome may be viewed as two distinct types of parameters that are found and optimised by the GA. The experiment in this paper, however, does not use the <start carrier> code. Promoter regions may contain a sequence of bases that specify a key used to switch the operon on or o. Operons may be constitutive (always active), repressible (active unless a matching key is available), or inducible (inactive unless an appropriate key is present). The closer the shape of a chemical is to the key specication of an operon, the more readily that chemical will be able to change the activation of the operon. Gene specications follow the promoter region. The context-free grammar for generating valid operon strings is given in table 1. Although this language is not regular, it is possible to parse a genome with a modied deterministic nite state automaton. The nite state automaton manipulates a few additional variables that are used to accumulate information across states [5]. It is important to stress that we are not implying that the parsing operation is accomplished with one molecule: the nite state automaton is an abstraction of a group of molecules working together.

3 In Proceedings of the GECCO-2000 Workshop Program: Gene Expression. July Las Vegas 3 Table 1: Parallel genomic language used to encode genes in an operon. The genome contains strings in this language. <operon>! <startoperon> <operonbody> <operonbody>! <promoter> <genelist> <operonbody>! <genelist> <promoter>! <switchtype> <promoter>! <switchtype> <baselist> <switchtype>! <repressible> j <inducible> <repressible>! 0 j 2 j 4 j 6 j 8 <inducible>! 1 j 3 j 5 j 7 j 9 <genelist>! <endofoperon> <genelist>! <enzyme> <genelist> <genelist>! <carrier> <genelist> <enzyme>! <startenzyme> <base> <baselist> <carrier>! <startcarrier> <baselist> <baselist>! <base> <baselist> <baselist>! <base> <base>! 0 j 1j 2 j 3 j 4 j 5 j 6 j 7 j 8 j 9 <endofoperon>! <endoperon>j <startoperon> <startoperon>! a j b <startenzyme>! c j d <startcarrier>! e <endoperon>! f 4 Gene Expression and Regulation Protein production in biological cells is a complex multistage process with stages occurring in dierent places in a cell. This allows the cell ne control over production of proteins [1]. However, in our model we are interested in the basic idea that information encoded in a genome guides the creation of proteins and that this production can be regulated. We have, therefore, combined the processes of transcription and translation into one operation that is embodied in an entity called a \spider". Spiders should be viewed as an abstraction of complex protein machinery that is able to (i) locate the start of operons; (ii) read and identify bases on the genome; and (iii) produce protein strings by traversing the genome. The analogy is of a spider moving along a surface (the genome) spinning a strand of its web (a protein chain). 4.1 A Discrete Spider Model Spiders are modelled as a family of similarly shaped chemicals. Speed of initialisation of transcription for a spider is proportional to the similarity of the shape of the spider chemical to an arbitrary but xed \ideal" spider. The motivation behind this \blurriness" of spider shape and transcription ability is to encourage evolution to nd spiders. A hard distinction between spi- Attach to genome at transcription initiation site Spider Pool Rejoin pool Producing a protein Ending a gene Promoter Gene 1 Gene 2 Gene 3 Figure 2: Overview of the discrete model of gene expression der and non-spider makes discovery of spiders by evolution more dicult. The problem of spider discovery in such a scheme becomes akin to the \needle in the haystack" problem which is dicult for evolution to solve. We imagine a population of spider molecules crawling along the genome reading tokens (gure 2). With each step, a spider adds a monomer to a growing protein chain that it creates. A spider may only attach to the genome at the promoter region and leave it at the end of the operon. As it moves from one gene to the next, the spider nishes the protein and releases it into the metabolism. When a spider falls o the end of an operon, it randomly joins another operon. Furthermore, a spider may only attach to an operon if the operon is ready for transcription (switched on). Constitutive operons (see section 3) are always ready for expression. Spiders may attach to them at any time. Repressible operons are active unless a \blocker" molecule is bound to the promoter region, in which case they are turned o. Spiders may only attach to repressible operons if a blocker molecule isn't bound to the promoter region at that time. Inducible operons function in the other way. They are inactive unless an activator molecule is bound to the promoter. Molecules bind to the promoter for only short periods.

4 In Proceedings of the GECCO-2000 Workshop Program: Gene Expression. July Las Vegas 4 The more bases matching between the molecule and the promoter region, the stronger the bond, and the longer the molecule stays attached. 4.2 A Continuous Spider Model The gene expression scheme of section 4.1, however, is unsuitable for the cell model because it is discrete, whilst the rest of our model is encoded continuously using dierential equations. We therefore recast the discrete model of molecular interactions into a continuous reaction graph. Each step along the genome is rewritten as an irreversible reaction with a spider molecule (S) and monomer (M i.e. matching the base being read) as reactants and a modied spider molecule (S 0 ) and perhaps a protein (P ) as products. The modied spider molecule becomes the reactant (S) for the next base along the genome, thus making a biological pathway. S + M?! S 0 + P (1) The product spider molecule (S 0 ) will be either a spider from the pool (if this reaction is for the addition of the last monomer of the last protein of an operon) or a spider clinging to the next position along the genome (when there are more genes left on the operon). The protein molecule (P ) is only produced if the reaction represents the nal base of a gene. The rst reaction in each pathway has an additional parameter that gives the probability that the spider can bind to the operon at the current time. This parameter regulates the operon and works similarly to a variable reaction rate. In this case, the rate \constant" for the reaction is G i K T where G i is the activation of the ith operon (in the range [0; 1]) and K T is the transcription initiation rate constant for the spider. When this reaction is not the rst of the operon, we choose a rate constant of 1 to expedite expression. The activation of an operon (G i ) depends on the availability of chemicals in the metabolism that can switch the operon on or o. We model this activation by calculating the probability that a spider may attach to the start of an operon. The probability of a spider binding to a constitutive operon is 1. Inducible and repressible operons only accept spiders when a switch chemical is bound to the promoter region or not respectively. For inducible operons, the probability of accepting a spider is equal to the probability of an activator chemical being bound at that time to the promoter region (^ n ). In the case of repressible operons, the probability is one minus the probability of a blocker being bound to the promoter region (1? ^ n ). Determination of the probability that a switching chemical is bound to the promoter region (^ n ) is more dicult because there may be many (n) such similarly shaped species able to bind to the promoter region. We calculate this probability by combining the probabilities of a particular chemical species binding to the promoter in isolation ( j ), which is simpler to quantify. Derivation of these are presented in [5] but results are summarised below. The probability that one molecule of a particular chemical species j is bound to the promoter region (ignoring other species) is j = 1? e?kjsj (2) where s j is the concentration of species j and K K j = : (3) (1? q j ) e?njb q j is the probability that switching chemical j will not immediately bind to the promoter region. This is calculated by looking at all the possible ways chemical j can bind to the promoter region and counting all the positions where there are no bonds (i.e. matches) between the chemical and the promoter region. q j is the ratio of the number of positions with no matches to the total number of positions. The exponential part of the denominator of equation (3) species the length of time chemical j will bind to the promoter region. This time follows the Boltzmann distribution and depends on the average number of bonds between the chemical and the promoter region (n j ) and the relative strength of each bond (b, typically 0.25). K is a constant used to calibrate the concentration of a switching chemical with the probability that the chemical will be bound to the promoter region. The actual value used (1:010 3 ) is arbitrary but its general relationship with the other parameters is important. Given the probability of chemical species binding to the promoter region in isolation, we may calculate the probability that a molecule of one of n competing chemical species is bound to the same promoter region as ^ n = 1 + (1? 1) I (n) 1 + (1? 1) I (n) (4)

5 In Proceedings of the GECCO-2000 Workshop Program: Gene Expression. July Las Vegas 5 I (n) = nx i=2 i 1? i (5) where n is the number of competing switch chemical species and i is the probability that a molecule of species i will bind to the promoter region (assuming no other competition). There is a variable ^ n for each inducible or repressible operon on the genome and derivatives are added to the system of dierential equations. 5 A Preliminary Experiment An experiment was developed to provide a sanity check on the model. We ask the question: is it possible to evolve cells that are able to exist in a simple environment? We use a GA to evolve a population of cells. Each cell is simulated in isolation inside its own environment. The environment causes the cell to grow by bathing it in chemicals that make the cell membrane grow. The simulation involves building the cell from its initial chemical concentrations and genome. Initial chemical concentrations of a cell are derived from the - nal concentration values of the mother's chemicals, the mother being a random choice between the cell's two parents. Next, the genome is parsed into a list of operons. A graph of chemical reactions in the cell is then determined by matching the operon list with the available chemical species. The reaction graph is encoded into a system of non-linear dierential equations. Additional terms are added to equations for diusion of chemicals through the cellular membrane. A typical cell simulation might contain around 185 enzyme-catalysed reactions, 700 dierential equations each containing 10 to 25 terms, around 200 chemical species and 10 enzymes. This may seem large, but compared to an actual cell, our simulations are mere caricatures. The system of dierential equations is numerically integrated to determine how the phenotype (or cell) changes over a time period. Integration occurs from time (t) 0, until t has passed a given value (1: ), or more than 2500 time steps are made, or one of the chemical concentrations in the phenotype has moved above the (arbitrary) value 1:0 10?4. This latter event denotes cell death. When chemicals appear with concentration greater than that of one molecule, new reactions become possible and the system of dierential equations is updated similarly to [2]. A steady-state GA with a population of 100 individuals evolves the cell models. Breeding occurs only when the population contains at least 75 individuals that have nished simulation. Fitness proportionate selection with the roulette wheel algorithm nds breeding pairs. Mutation (with probability of bit mutation 0.005), crossover (approximately four points per genome) and inversion (with probability 0.15 per genome) are applied as dened in [5]. The objective function is a combination of six metrics in the range [0; 1] (also dened in [5]). Fitness components were derived with the intention of forming a canonical set. Fitness values are bounded above at 63, but as metrics conict with one another, the maximum tness score cannot be realised. Qualitatively the metrics are: the absolute change in the cell's volume over the course of its lifetime relative to a target value the length of time the cell lived how closely correlated the genome switching regions are to the chemicals controlling cell growth the complexity of the metabolic reaction graph the number of chemicals aecting cell growth the number of dierential equations in the cell Additionally, the tness of cells that died during the simulation is halved. Cells that died may have a tness value greater than zero so as to help evolution at the beginning of the run. A simpler tness function that contains only the rst metric (i.e. keeping the volume constant) would seem more appropriate, however, this tness function was not able to evolve cells that could live in the environment. 5.1 Results An experiment was run evolving cells in the simple environment [5]. The rst population of cells started with random genomes and simple initial chemical ensembles. The initial chemical ensemble included a basic set of \spider" chemicals and simple chemicals that a cell could use as building blocks but no enzymes. Figure 3 shows a graph of the tness attained. The X-axis represents each cell in the order they were run. Values near the maximum tness were reached early (soon after ospring of the initial population are produced). We observed a large diversity of genes in the nal population of cells. This reects the fact that there does not appear to be one \right" solution for the experiment with maximum tness but a variety of valid approaches, all having similar (high) tness. Genomes

6 In Proceedings of the GECCO-2000 Workshop Program: Gene Expression. July Las Vegas 6 Fitness Fitness in Environment Average Fitness Maximum Fitness Cell ID or time Figure 3: Graph showing the average and maximum tness (Y-axis) attained for the each cell simulation run (X-axis). with dierent approaches to controlling a metabolism may be maintained as long as valid contexts for their use (i.e. initial metabolic conditions) also exist. This suggests that future experiments may benet from speciation. 6 Conclusion Use of the gene expression model in the experiment outlined above encourages us to formulate hypotheses associated with gene expression and regulation in EC. Evolution of regulation requires a dynamic environment. We observed, in the experiment, that operons weren't dynamically regulated: operons were either always active or inactive throughout simulations. This seems to be a result of the simple static environment used: when changing stimuli are not presented to a cell throughout its lifetime, there is no selective pressure to adapt to them. An experiment to test this hypothesis requires a more dynamic environment. Time lags due to the expression algorithm may increase the \expressive power". Gene switching is not instantaneous: there is a time lag between a switch chemical binding to the promoter and a change in the amount of protein produced. This is because regulation aects only spiders joining the operon from the pool. The size of this time lag for each gene is proportional to its distance from the promoter region. Evolution may conceivably take advantage of this by (i) placing genes that need fast switching earlier in the operon; (ii) using shorter genes when they need to be switched quickly; or (iii) using two operons with similar promoter regions for faster gene expression and regulation (parallel processing). We wish to investigate whether evolution takes advantage of these time lags. Sensitivity of the genomic language to attack from genetic operators. The four special control codes used in the parallel genomic language cause semantic linkages within operons. The meaning of a base in an operon is not an intrinsic property, but is dependent on the preceding \special" tokens. This linkage implies that genetic operators may have a signicant eect on a genome. For example, one mutation could obliterate an operon. Experiments are required to test the sensitivity of the genome and genomic language in this respect. Factors such as the degree of redundancy in the genomic language and the density of operons on the genome appear to be signicant but further investigation is required. Although it is not clear whether gene expression enhances EC or what diculty would be involved in incorporating it into existing applications, we feel encouraged that this model may be used to solve some problems and that it is exible enough to test the above hypotheses. The next stage of investigation involves clarifying and testing the hypotheses and dening a more general discrete phenotype (incorporating the discrete spider expression model) for use in less biological problems. References [1] Alberts, Bray, et al. Molecular Biology of the Cell. Garland Publishing, New York, third edition, [2] J. D. Farmer, S. A. Kauman, and N. H. Packard. Autocatalytic replication of polymers. Physica 22D, pages 50{67, [3] J. H. Holland. Adaptation in Natural and Articial Systems. MIT Press, Cambridge, Massachusetts, rst MIT press edition, [4] N. Jacobi. Harnessing morphogenesis. Cognitive Science Research Paper 423, School of Cognitive and Computing Sciences, University of Sussex, [5] P. J. Kennedy. Simulation of the Evolution of Single Celled Organisms with Genome, Metabolism, and Time{Varying Phenotype. PhD thesis, University of Technology, Sydney, [6] R. S. Rosenberg. Simulation of Genetic Populations with Biochemical Properties. PhD thesis, University of Michigan, 1967.

