Replication and Mutation on Neutral Networks: Updated Version 2000

Size: px

Start display at page:

Download "Replication and Mutation on Neutral Networks: Updated Version 2000"

Hillary Charlene Carr
5 years ago
Views:

Replication and Mutation on Neutral Networks: Updated Version 2000 Christian Reidys Christian V.

necessarily represent the views of the Santa Fe Institute.

1 Replication and Mutation on Neutral Networks: Updated Version 2000 Christian Reidys Christian V. Forst Peter Schuster SFI WORKING PAPER: SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. SANTA FE INSTITUTE

2 Replication and Mutation on Neutral Networks By Christian Reidys b,c, Christian V. Forst b, and Peter Schuster a,c;y a Institut fur Theoretische Chemie und Molekulare Strukturbiologie der Universitat Wien, A-1090 Wien, Austria, b Los Alamos National Laboratory, Los Alamos, NM , U.S.A., and c Santa Fe Institute, Santa Fe, NM , U.S.A. Bull. Math. Biol., in press y Address Correspondence to: Professor Peter Schuster, Institut fur Theoretische Chemie und Molekulare Strukturbiologie der Universitat Wien, Wahringerstrae 17, A-1090 Wien, Austria Phone: , Fax: tbi.univie.ac.at Universitat Wien: TBI Preprint No. pks

3 Reidys, Forst, Schuster: Replication on Neutral Networks 1 Keywords. Error threshold { neutral evolution { neutral network { random graph { RNA secondary structure. Abstract. Folding of RNA sequences into secondary structures is viewed as a map that assigns a uniquely dened base pairing pattern to every sequence. The mapping is non-invertible since many sequences fold into the same minimum free energy (secondary) structure or shape. The preimages of this map, called neutral networks, are uniquely associated with the shapes and vice versa. Random graph theory is used to construct networks in sequence space which are suitable models for neutral networks. The theory of molecular quasispecies has been applied to replication and mutation on single-peak tness landscapes. This concept is extended by considering evolution on degenerate multi-peak landscapes which originate from neutral networks by assuming that one particular shape is tter than all others. On such a singleshape landscape the superior tness value is assigned to all sequences belonging to the master shape. All other shapes are lumped together and their tness values are averaged in away that is reminiscent of mean eld theory. Replication and mutation on neutral networks are modeled by phenomenological rate equations as well as by a stochastic birth-and-death model. In analogy to the error threshold in sequence space the phenotypic error threshold separates two scenarios: (i) a stationary (ttest) master shape surrounded by closely related shapes and (ii) populations drifting through shape space by a diusion like process. The error classes of the quasispecies model are replaced by distance classes between the master shape and the other structures. Analytical results are derived for single-shape landscapes, in particular, simple expressions are obtained for the mean fraction of master shapes in a population and for phenotypic error thresholds. The analytical results are complemented by data obtained from computer simulation of the underlying stochastic processes. The predictions of the phenomenological approach on the single-shape landscape are very well reproduced by replication and mutation kinetics of trna phe. Simulation of the stochastic process at a resolution of individual distance classes yields data which are in excellent agreement with the results derived from the birth-and-death model.

4 Reidys, Forst, Schuster: Replication on Neutral Networks 2 1. Neutral evolution and molecular evolutionary biology The notion of neutral evolution was coined by Motoo Kimura [37, 38] in order to account for the well established fact that the majority of mutations recorded at the level of DNA (or RNA or protein) sequences in nature appear to be selectively neutral [39]. Most point mutations occur at frequencies that seem to be independent of the morphological changes observed with species and hence can be used to reconstruct phylogenies for molecular data [69]. This approximate independence represents the basis of a molecular clock measuring time in evolution (For systematic deviations and ner details see the \nearly-neutral theory" of Tomoko Ohta [43] and a recent paper [1]). In essence, the neutral theory states that a substantial fraction of point mutations has no measurable eect on tness and assumes that the primary role of selection is elimination of deleterious variants. Thus, neutral mutations are often considered as unavoidable byproduct of the molecular mechanism of evolution, eventually caused by the relations between biopolymer sequences, structures and functions. Emile Zuckerkandl, more recently, discussed a creative role of neutral mutations in the evolution of complexity [68]. Recent studies based on computer simulation of RNA structure optimization [14, 15, 33, 52] revealed a constructive role of random genetic drift: It improves substantially the search capacity of populations in sequence space. Here, we shall be concerned with an analysis of this phenomenon by means of mathematical models based on chemical reaction kinetics and stochastic processes. The simplest successful experimental approach to study evolution was implemtented by Sol Spiegelman [57] in his test-tube assay of RNA replication and mutation. His work was complemented by the seminal paper of Manfred Eigen [6] who presented a theory of molecular evolution based on chemical kinetics. Within this frame the concept of molecular quasispecies was developed [8, 9]. Evolutionary optimization and adaptation to the environment are not bound to the existence of

5 Reidys, Forst, Schuster: Replication on Neutral Networks 3 cellular life. They occur also in vitro provided the following minimal requirements are fullled: a population of objects which are capable of replication, a replication mechanism that allows for heritable variation through mutation (and recombination), and a suciently rich reservoir of accessible genotypes and phenotypes. The third prerequisite needs explanation: Sucient capacity for variation is guaranteed by the combinatorial construction principle of biopolymers which leads to numbers that grow exponentially with genotype length n (There are, for example, 4 n dierent RNA sequences of chain length n). In principle, the same need not be true for phenotypes, but it seems to hold universally in nature because all phenotypes are built through combination from modules and the number of modules becomes larger with increasing lengths of genotypes. We can expect exponential incerease in the number of structures and state that both, the set of sequences and the set of structures are of combinatorial complexity. Basic to the Darwinian scenario is the fundamental dualism of evolution: all heritable variation is introduced into genotypes through mutations (and recombination), and variation is uncorrelated to selection which operates on the phenotypes. The quasispecies describes the distribution of genotypes in a stationary population of individuals which replicate independently with a given mutation rate [8, 9]. For point mutations and position independent mutation frequencies the error rate can be expressed by a single parameter (p). At low values of p the quasispecies settles around a master sequence of highest tness. At some critical error rate (p cr ) the quasispecies becomes unstable and then populations drift randomly through sequence space (see also section 3). Quasispecies have been studied on various model landscapes (for a review see [8]) the simplest being the single-peak landscapes with one privileged genotype having a higher tness value than all others [42, 59] (This landscape is the result of a kind of a mean

6 Reidys, Forst, Schuster: Replication on Neutral Networks 4 eld approximation to the real situation. It is obtained by averaging over the tness values of all genotypes except the best one). Realistic landscapes derived from RNA folding have been studied as well [12, 13, 14, 15, 33] and, in general, the error threshold phenomenon is retained. There are also examples of ad hoc assumptions on tness landscapes often used in population genetics where the transition from ordered to random replication is smooth [65]. In vitro evolution deals with replicating RNA molecules and represents the simplest thinkable system which does indeed fulll all three requirements listed above. As correctly recognized already by Sol Spiegelman [57], RNA molecules are particularly well suited for detailed investigations since the genotype-phenotype dichotomy is encoded in a single molecule: The genotype is the sequence of nucleotides and the phenotype is the three-dimensional molecular structure which determines all tness relevant properties. In RNA test-tube evolution the relation between genotypes and phenotypes boils down to mapping RNA sequences onto molecular structures. Despite enormous progress in understanding threedimensional RNA structures (For an excellent review see [3]) their correct prediction from known sequences is still kind of an art, far from being routine, and in most cases not possible yet [56]. In addition, systematic studies on sequence- 3D structure relations would exceed the capabilities of present day computing facilities. Secondary structures, being kind of physically relevant, coarse grained approximations to full structures, are readily computed by means of ecient algorithms [31, 70]. This does not necessarily mean that secondary structures are predicted in full agreement with experimental data, but most evolutionarily relevant statistical properties of ensembles of RNA molecules are reected correctly by means of the current techniques [60]. Secondary structures, in essence, are listings of base pairs formed through superposition of the RNA sequence upon a(two-dimensional) graph which is free of

7 Reidys, Forst, Schuster: Replication on Neutral Networks 5 pseudoknots (See section 2.1). This notion of structure allows straightforward application of combinatorics to solve counting problems. In particular, the numbers of possible RNA secondary structures were derived by means of recursion formulas [62]. Constraints resulting from steric and energetic criteria reject structures with too small hairpin loops and too short stacks. Their numbers can also be computed through recursion [32]. Based on this approach the numbers of physically acceptable secondary structures 1 formed by the sequences of constant chain length n were shown to fulll the asymptotic expression [32, 53]: S n 1:4848 n,3=2 (1:84892) n : Deviation from the exact results of the recursion are smaller than 1% for chain lengths n 1000 [51]. The number of acceptable secondary structures, although exponentially increasing, is always much smaller than the numbers of natural (AUGC)oreven of binary (GC or AU) RNA sequences, which are given by 4 n or 2 n, respectively. The mapping from RNA sequences into (secondary) structures, which is central to the forthcoming analysis, is thus redundant or many-to-one. Sequence { secondary structure relations are modeled as mappings from sequence space onto shape space. A structure s k is uniquely assigned to the sequence x j : s k = f(x j ). At constant chain length n this map is expressed by f : fq n ; d h ijg,! fs n ; d s ijg : (1) The Hamming distance of end-to-end aligned sequences, dij h, is the metric in sequence space Q n and d s ij is a properly dened structure distance serving as metric in shape space S n. Since both, sequence and shape space are of combinatorial complexity, the mapping (1) has been characterized as combinatory map. The 1 A structure is acceptable when its hairpin loops are at least three unpaired nucleotides long and when it does not contain single base pairs. Stabilizing contributions result from base pair stacking and require at least two neighboring base pairs. Structures with single base pairs may occur when global geometries or special interactions are favorable (See also [51, 55]).

8 Reidys, Forst, Schuster: Replication on Neutral Networks 6 pre-image of structure s k in sequence space is the set of all sequences forming structure s k, G[s k ] = f,1 (s k ) = : fx j jf(x j )=s k g : (2) The neutral network G[s k ] is a graph on this set with edges connecting all pairs of sequences with Hamming distance one (gure1). For many purposes sequence-structure maps can be approximated well by random graphs [45] which encapsulate generic features: All sequences folding into the same structure form a neutral network which is represented by a (random) graph. The random graph approach is based on a single parameter,, which expresses the fraction of neutral neighbors averaged over all members of the network. The random graph approach provides a theoretical concept that allows for tuning neutrality from absence to full coverage of sequence space (0 1). Neutral networks are the basic objects in RNA genotype-phenotype relations since they are mapped one-to-one onto structures and thus are elements of an invertible map from sets of sequences onto structures and vice versa. An evaluation of structures resulting in a scalar quantity (tness being an example) induces a landscape upon sequence space [49, 50, 58]. Fitness landscapes derived from folding RNA sequences into secondary structures represent the basis of the RNA model of molecular evolution [48, 52] which allows to treat phenotypes explicitly and which is accessible to computer simulation and mathematical modeling. Optimization processes on neutral networks of RNA structures in sequence space combine the essentials of the quasispecies concept and neutral evolution. Computer simulations of evolutionary optimization and neutral evolution on a landscape derived from trna phe were reported recently [14, 15, 33]. They have shown that evolution occurs on two time scales: fast periods of adaptive walks on tness landscapes interrupted by long epochs of random drift on neutral networks. Experimental data proved dierent predictions of the model [4, 5, 10, 44, 47]. RNA replication in the test-tube provided also

9 Reidys, Forst, Schuster: Replication on Neutral Networks 7 A B Figure 1: Asketch of connected and disconnected neutral networks. The sequence space is drawn schematically as a two-dimensional grid with grid points (open circles) representing individual sequences (We remark, that the real sequence space, Q n,isan n-dimensional object). Sequences forming the same structure s k (shown on the left hand side of the diagram) are symbolized by full circles (). They form the set G[s k ], which is the pre-image of s k in sequence space. Connecting all pairs of sequences with Hamming distance one yields a graph called the neutral network G[s k ]. Neutral networks may be connected (A) or disconnected (B). Generic disconnected networks consist of a largest \gaint" component and many small islands [28, 29, 45].

10 Reidys, Forst, Schuster: Replication on Neutral Networks 8 the basis for applied molecular evolution, an already established new area of lively interest in biotechnology (See, for example, the special issue [64], the corresponding chapters in [21], and the recent review [66]). In addition, the quasispecies model has been applied successfully also to cases where phenotypes are still too complex to be modeled by computer simulation, for example to the evolution of viruses [5, 7]. Populations of RNA viruses share high genetic diversity with those of RNA molecules replicating in test-tubes. Although virus populations live in rapidly varying environments and presumably never reach stationarity, the quasispecies concept has nevertheless provided valuable insights into the dynamics of virus evolution. This paper is organized in three main sections 2 to 4. 2 Section 2 introduces the concept of neutral networks as examplied by RNA secondary structures and presents a mathematical model based on random graph theory. It presents several denitions and reviews results required later on. Section 2.1 deals with RNA secondary structures and section 2.2 presents the random graph approach. In section 3 we introduce, dene and analyze the concept of a phenotypic error threshold for replication of RNA molecules. Analytical expressions are derived from phenomenological kinetic equations of evolution on the single shape landscape in section 3.1. A stochastic description of evolution on neutral networks is conceived as a replication-deletion process (section 3.2). Simplications in full analogy with those made in the phenomenological approach lead to a birth-and-death process which can be analyzed by conventional tools (section 3.3). In sections 4.1 and 4.2 we test the results of the phenomenological approach and the birth-and-death model, respectively, by computer simulation on model landscapes and on a realistic landscape derived from trna phe. The concluding section 5 discusses experimental evidence for and applications of the concept of neutral networks. 2 As a hint for the reader we mention that the phenomenological approach in sections 3.1 and 4.1 is independent from the random graph model and hence does not require knowledge of section 2.2 and the denitions and results presented therein.

11 Reidys, Forst, Schuster: Replication on Neutral Networks 9 2. Neutral networks of RNA secondary structures RNA secondary structures are listings of Watson-Crick and GU base pairs that can be represented by planar graphs. They do neither contain knots nor pseudoknots which is made precise by the following denition [62]: A secondary structure s is a vertex-labeled graph on n vertices with an adjacency matrix A : =(a i;k ) 1i;kn fullling three conditions: (1) a i;i+1 =1for 1 i n, 1, (2) for each i there is at most one k 6= i, 1;i+1with a i;k =1, and (3) if a i;j = a k;` =1and i<k<jthen i<`<j. Vertices of the graph represent the individual nucleotides in the order dened by the RNA sequence x, which is a string of length n over a nucleotide alphabet A: x = fx 1 ;:::;x n ; x i 2Ag. Edges (i; k) with k 2fi,1;i+1g constitute the ribosephosphate backbone of molecule. An edge (i; k) with k 6= i, 1;i;i+1is called a base pair of the secondary structure: [i; k] 2 s. A vertex i, that is connected only to its neighbors in the backbone, i, 1 and i +1, is called unpaired. The numbers of base pairs and unpaired nucleotides in the structure s are denoted by n p (s) and n u (s), respectively. Accordingly, the chain lengths of the molecule is n = n u (s)+2n p (s). Condition (3) guarantees that the secondary structure can be represented by a planar graph without knots or pseudoknots. Modeling evolution of RNA molecules requires an extension of the conventionally considered \one sequence { one structure" relation to a mapping of whole sequence space onto shape space. The space of RNA sequences with constantchain length n is a generalized hypercube of dimension n. For binary alphabets ( =2: A = fgcg or A = faug) the sequence space Q n is the conventional hypercube. Four letter alphabets ( = 4: A = faugcg require a generalization (See, for example [49]). The mapping from sequence space onto shape space is many to one and hence a model of the map has to deal with the pre-images of individual structures in sequence space, these are the neutral networks. A kind of zero-th

12 Reidys, Forst, Schuster: Replication on Neutral Networks 10 order approach models neutral networks as random graphs. Random graph theory applied to hypercubes [46] provides useful tools for an analytical approach to these mappings of RNA molecules [45]. Here we shall briey repeat a few results which are important for the forthcoming analysis Hypercubes, pairing schemes, and compatible sequences Investigations of sequence-structure relations are simplied substantially when the chain length n is a (constant) parameter. Then, the space of all sequences over an alphabet of size is the generalized hypercube Q n. The sequences are the elements of the vertex set, v[q n ], of the hypercube. Admissible base pairs are dened through pairing rules subsumed in B. A pairing rule on an alphabet A is a symmetric relation, for example if AU is a base pair then UA is a base pair too (In natural RNA we have, for example, A = fa,u,g,cg with = 4 and B = fau,ua,ug,gu,gc,cgg with = 6). It is worth pointing out that the relation between sequence and structure is introduced only via the pairing rules. The secondary structure is characterized by a pairing scheme or set of (non-trivial) contacts: 3 (s) = : [i; k]jai;k =1;k 6= i, 1;i+1 : A sequence or a vertex of the hypercube, x 2 v[q n ], is compatible with s if (and only if) the condition [x i ;x j ] 2 B is fullled for all [i; j] 2 (s). In other words, the nucleotides in the positions i and j of a compatible sequence are capable of forming a base pair included in B when the pair [i; j] 2 (s). The set of all compatible sequences is denoted by C[s]. The cardinality ofthis set is the number of sequences which are compatible with structure s: is jc[s]j = nu(s) n p(s). The inversion of the (structure) compatibility relation searches for the set of all structures, that are compatible with a given sequence x: S[x] = fs (0) ; s 1 ;:::;o;:::g, 3 Trivial contacts are the bonds along the polynucleotide backbone, [i,1;i] and [i;i+1].

13 Reidys, Forst, Schuster: Replication on Neutral Networks 11 which contains the minimum free energy structure denoted by s 0 (or s) and all suboptimal structures [67] including the open chain, \o". Considering the combinatory map, f : Q n! S n, we know a priori that the vertex set of the pre-image f,1 (s), which consists of all sequences folding into the secondary structure s, is contained in the set of compatible sequences. For example, all neutral neighbors of a sequence x are located in the set C[f(x)]. The induced subgraph Q n C[f(x)], however, is not connected { it decomposes into hyperplanes dened by the particular choice of pairing rules. 4 Therefore we introduce the graph C[s]: Let s be a secondary structure, then the graph of compatible sequences is C[s] = Q n u(s) Q n p(s). Obviously C[s] has the vertex set C[s] and by denition of the product of graphs, two sequences x; y 2 C[s] are neighbors, if they dier either in a single position i which is unpaired in s, or in two positions i and j which form a base pair [i; j] 2 s. Two graphs C[s]; C[s 0 ] are isomorphic (See p.346 in [63]) if (and only if) they have the same numbers of unpaired nucleotides and base pairs. Thus, two dierent secondary structures s; s 0 2S n may lead to the same graph of compatible sequences Neutral networks as random graphs The combinatory map, f : Q n! S n, is a (non-invertible) mapping from sequence space into shape space. It denes the neutral network G[s] for a given secondary structure s as the subgraph of the graph of compatible sequences, C[s], which is induced by the vertex set [f,1 (s)]: G[s] : = C[s] f,1 (s) : 4 Accessibility through single point mutations leads to a partitioning of base pairs into two disconnected groups [30]: [AU] $ [GU] $ [GC] and [CG] $ [UG] $ [UA]. We note that the two triples of base pairs are characterized by constant positions of pyrimidines and purines, [RY] and [YR], respectively. Base pair exchanges may require one or two point mutations, [AU] $ [GU] and [AU] $ [UA], respectively. Elementary moves in the factorized space, Q n u related by simple expressions to those in the regular sequence space, Q n, therefore. Qn p, are not

14 Reidys, Forst, Schuster: Replication on Neutral Networks 12 Neutral networks are modeled by random graphs in sequence space in order to provide a rigorous frame for the derivation of analytical results that can be used as a reference for neutral networks of RNA [16, 45]. The starting point for the construction of the model network is the set of compatible sequences C[s] of a given structure is s. Sequences are chosen at random from this set but unpaired nucleotides and base pairs are distinguished. Within the set of compatible sequences there are two elementary moves: base exchanges for unpaired nucleotides and pair exchanges for base pairs. 4 Each sequence, x i, has a certain number of neutral neighbors, i in the base exchange neighborhood and i in the pair exchange neighborhood. The fractions of neutral neighbors are given by ( u ) i = i (, 1) nu and ( p ) i = i (, 1) np, respectively. neutral net yields two neutrality parameters u = 1 jg[s]j X i2g[s] ( u ) i and p = Averaging over all sequences of the 1 jg[s]j X i2g[s] ( p ) i : There is no a priori reason why the probability to create a neutral neighbor in sequence space should be the same for a base exchange and a pair exchange and, in general, u will be dierent from p. The regularities in shape space suggest the construction of a probability space formed by randomly induced subgraphs of the hypercube Q n. Each subset X v[q n ] induces a subgraph Q n [X] establishing a oneto-one correspondence between X v[q n ] and Q n [X]. The set of all induced subgraphs of Q n with 0 1 is called,(q n ). A probability measure for the neutral net G 2,(Q n ) is derived by fgg : = jv[g]j (1, ) n,jv[g]j : Random graph theory states the existence of a threshold value for density and connectivity of randomly induced subgraphs on generalized hypercubes. A subgraph G < Q n is called dense in Q n if and only if v[g] = v[q n ] (Proofs are found in [45]; sketches of generic neutral networks are shown in gure 1).

15 Reidys, Forst, Schuster: Replication on Neutral Networks 13 Given a threshold value : p,1 =1,,1, we obtain p,1 1 for >1,,1 lim n!1 n fg is denseg = 0 for <1,,1p,1 : Let (Q n ) be a sequence of generalized hypercubes and G < Q n random induced subgraphs. Then p,1 1 for >1, lim n!1 n fg is connectedg =,1 0 for <1,,1p,1 : (3) The result is readily extended to the factorized hypercube. Taking the graph product of the two randomly induced subgraphs corresponding to the unpaired and paired parts of the secondary structure s as illustrated in diagram 1 we obtain a probability space of random induced subgraphs of Q n u Q n p as follows: Let G u 2,(Q n u ) and G p 2,(Q n p ) be random subgraphs as introduced in the above model. We set G[s] = : G u G p and u ; p (G[s]) : = nu ; u (G nu ) np ; p (G np ) : Then u ; p is a probability measure and G[s] < C[s]. in G u,! G u G p - G p Q n u in # Q n p in Diagram 1: Relations between the subgraphs referring to unpaired and paired parts of RNA secondary structures and their embedding in the space of compatible sequences. Random graph theory provides simple analytical expressions that allow to predict generic structures of neutral networks. The predictions are fullled very well unless certain structural features require modications [28, 29]. Most relevant for replication-mutation dynamics is the fact that the neutral networks of common structures in Q n AUGC fulll almost always the connectivity condition [24].

16 Reidys, Forst, Schuster: Replication on Neutral Networks Phenotypic error thresholds Fitness values are assigned to individual phenotypes and the resulting tness landscape, f G[m] : Q n! IR +, sets the stage for replication dynamics on the neutral network. Realistic tness landscapes are complicated objects but they can be approximated well by simplications (See section 4.1). For an investigation of the generic features of error propagation in replicating populations, it is sucient to use a single-peak landscape in shape space which distinguishes the master shape from other shapes by assigning higher tness to it. Then, the neutral network of the master shape G[m] denes the tness landscape f G[m] (v) : = >1 i v 2 v G[m] 1 i v 2 v Q n n v G[m] : At rst the error threshold for the maintenance of phenotypes will be derived by means of a phenomenological approach based on the kinetic dierential equation for replication and mutation [6], then we shall apply a stochastic replicationdeletion process in the derivation Genotypic and phenotypic error thresholds Error thresholds in replication-mutation equations were studied extensively in systems lacking selective neutrality [6, 8, 9]. The frequencies of genotypes x i are denoted by i and their time derivatives are determined by the kinetic equations _ i = Q ii a i, d i, E i + X j6=i Q ij a j j ; i; j =1;:::;r : Genotype frequencies are normalized, P r i=1 i =1. The rate constants for replication and degradation of the molecular species x i are a i and d i, respectively. Correct replication and mutations are modeled as parallel reactions whose relative frequencies are contained in the mutation matrix Q : = fq ij ; j =1;:::;rg. The fraction of

17 Reidys, Forst, Schuster: Replication on Neutral Networks 15 error copies derived from template x j yielding genotype x i is given by Q ij. Combinations of rate constants and mutation rates turned out to be useful: (i) the excess productivity e i = a i, d i and its mean value averaged over the entire population E = P r i=1 e i i, and (ii) the value matrix W : = fw ij = a i Q ij,d i ij ; i; j =1;:::;rg with ij = 1 or 0 for i = j and i 6= j, respectively. The diagonal elements of the value matrix are called selective values. For constant total numbers of genotypes and vanishing mutational backow, lim P j6=i Q ija j j! 0, the selective value of a genotype is tantamount to its tness. Accordingly, the genotype with maximal selective value, x m, w mm = max(w ii ji =1;:::;r) ; dominates the population at selection equilibrium. It is called the master sequence. Degradation rates can be assumed to be equal, d 1 = ::: = d r = d, and then they cancel in the kinetic equations. The stationary value of the frequency of the master genotype ( m ) is readily calculated from the conditions _ i = 0; i = 1;:::;r. Indeed, the stationary distribution of genotype frequencies can be computed exactly as the solution of an eigenvalue problem [34, 61]. In correspondence with a single peak landscape the master genotype (x m ) is distinguished from all other genotypes (x j ; j 6= m, which are lumped together to form a mutant cloud with P j6=m j = 1, m ). \Means except the master" are introduced for replication rate constants and excess productivities, a = P j6=m a j j (1, m ) and e = P j6=m (a j, d) j (1, m ), respectively. The former is used to dene a superiority of the master sequence, m = a m a. At zero mutational backow the stationary concentration of the master sequence becomes: m = w m, e e m, e = a mq mm, a a m, a = mq mm, 1 m, 1 A simple, but interestingly fairly accurate, zero-th order approximation (See p.177 in [8]) to the localization threshold of molecular quasispecies is obtained by computing the replication accuracy Q mm, at which the stationary concentration of the :

18 Reidys, Forst, Schuster: Replication on Neutral Networks 16 master genotype vanishes: Q mm ( m =0)=Q min =,1 : Application of the uniform error rate model 5 yields for the genotypic error threshold: q min = 1, p max =,1=n ; (4) where p = 1, q is the mutation rate per (nucleotide) site. In case of selective neutrality (a m! a) the error threshold converges to the limit of absolute replication accuracy (q min! 1). Thus, the model in its original form cannot be used to describe evolutionary stability ofphenotypes in the selectively neutral case. In order to deal with the neutral case we modify the kinetic equations in straightforward way: Genotypes are ordered with respect to non-increasing tness, the rst k dierent genotypes are assumed to have maximal tness, w 1 = ::: = w k = ~w m = w max, as well as identical replication and degradation rate constants a 1 = ::: = a k = ~a m and d 1 = ::: = d k = d ~ m. In addition, we dene new variables, ` (` = 1;:::;s), which lump together all genotypes folding into the same phenotype: ` = Xk` i with sx ` = rx i =1; i = k`,1 +1 `=1 i=1 The master phenotype is characterized by "m"(m 1, k 0 = 0, and k 1 = k) m = P k i=1 i, and the following kinetic dierential equation is obtained for the class of sequences forming the neutral network of the master phenotype: _ m = kx i=1 _ i = m ~a m Q kk, ~ d m, sx `=1 ~e`` + kx X i=1 j6=i a j Q ij j : 5 The uniform error rate model [9] assumes that the single digit replication accuracy (q) is independent ofnucleotide and site. Hence, we obtain Q ii =q n (i=1;:::;r) for polynucleotides of chain length n.

19 Reidys, Forst, Schuster: Replication on Neutral Networks 17 The mean excess productivity of the population is, of course, independent of the choice of variables: E = sx `=1 ~e`` = rx i=1 e i i : In order to derive a suitable expression for a phenotypic error threshold we split the mutational backow into two contributions, (i) mutational backow on the neutral network and (ii) mutational backow from genotypes outside the network: kx X i=1 j6=i a j Q ij j = 8 < : ~a m kx kx i=1 j=1;j6=i Q ij j 9 = ; + 8 < : kx rx i=1 j=k+1 a j Q ij j 9 = ; : The rst term is readily computed under two assumptions: (i) the fraction of selectively neutral neighbors ( m ) is constant and (ii) mutation rates are equal (Q ij = Qj ; i; j =1;:::;k; i 6= j) on the master network: kx kx i=1 j=1;j6=i Q ij j m(1, Q mm ) k, 1 = m(1, Q mm ) k, 1 kx kx i=1 j=1;j6=i kx kx j=1;j6=i i=1 j = j = m (1, Q mm ) m : Mutational backow from other networks (`;` 6= m) need not be evaluated explicitly since it will be neglected in zero-th order as it was done in the derivation of the genotypic error threshold. The kinetic equation for the master phenotype can now be rewritten and takes the form _ m = ~a m Qmm ~, d ~ m, E m + Mutational Backow : It is formally identical with the equation for the genotypic concentration variables i ; the actual dierence between the two equations concerns the use of an eective replication accuracy that accounts for the degree of neutrality ~Q mm = Q mm + m (1,Q mm ) = q n ((q) m +1); with (q) = 1 q n, 1 : (5)

20 Reidys, Forst, Schuster: Replication on Neutral Networks 18 Neglecting mutational backow from non-master phenotypes we nd in analogy with the genotypic error threshold ~Q min = Q mm + m (1, Q mm ) min =,1 m ; where m is the superiority of the \master phenotype". Introducing the uniform error rate model we obtain by neglect of mutational backow for the stationary frequency of master phenotypes: m (p) = ~ Q mm (p) m, 1 m, 1 = (1, p)n m (1, m ) + m m, 1 m, 1 : (6) Application of the \zero-th order approximation", m = 0, yields the phenotypic error threshold: q min = (1, p max ) = 1=n 1, m m (1, m ) m : (7) The function q = q min (n; m ; m ) is illustrated in gure 2. Limits are easily visualized: (i) the phenotypic error threshold converges to the genotypic value q min =,1=n m in the limit m! 0 and (ii) the minimal replication accuracy q min approaches zero in the limit m!,1 m. The second case expresses the fact that the single digit accuracy plays no role in case of suciently high degree of neutrality. Recapitulating the results on stationary distributions of phenotypes derived in this section we state that selective neutrality allows to tolerate more replication errors than in the non-neutral case. Above the genotypic error threshold and below the phenotypic threshold stationarity in phenotype space is accompanied by changing genotypes corresponding to a population which drifts randomly [14, 15, 33] on the neutral network of the master phenotype. The master phenotype is conserved as long as the mutation rate is below the phenotypic threshold. In case the error rate increases above this critical value the population drifts through

21 Reidys, Forst, Schuster: Replication on Neutral Networks 19 A B Figure 2: The phenotypic error threshold from phenomenological equations. The upper part (A) shows the stationary frequency of master phenotypes, m, as a function of the error rate p. The parameter values are: n=40; m =10 and m =0 (gray);0:025;0:050, 0:1 and 0:2. The frequency m (p) shows roughly exponential decrease until it becomes very small near the error threshold p max. Exceptions are the two highest curves where the degree of neutrality is equal or larger than the reciprocal superiority ( m m,1 ) and hence the master phenotype is never lost from the population. In the lower part (B) we show the error threshold p max =1,q min as a function of the degree of neutrality of the master phenotype, m for n=40 and m =10. It is interesting to note that q min approaches zero when m becomes m,1 in agreement with plot A showing that errors do no lead to a loss of the master phenotype.

22 Reidys, Forst, Schuster: Replication on Neutral Networks 20 both, sequence space and shape space, and no stationary state is approached. It is particularly interesting to note that there is a degree of neutrality related to the superiority of the master phenotype ( =,1 ) above which the error rate does not matter. In other words, the master phenotype will never be lost when the degree of neutrality increases the inverse superiority Replication-Deletion Process For consistency with the random graph model of neutral networks we describe a general fomalism for populations on graphs and their changes in time. A population V is a (nite) family of vertices on the hypercube f v i j i 2 IN N g v[q n ] where IN N denotes the set of the natural numbers up to the population size N: IN N f1; 2;:::;Ng. 6 In the theory of point processes such a family may beidentied by an integer valued measure: V =(v i j i 2 IN N )! (v) : = NX i=1 g vi (v); with g vi (v) : = 1 for v = vi ; 0 for v 6= v i : (8) This measure counts numbers of individual genotypes present in the population. The population is divided into ` vertices on the neutral network G[s] and N, ` other vertices. Formally, we express the vertices belonging to the population and the neutral network by (v i j i 2 IN N V resg[s] (v i ) 1) and have consistently: ` = res v[g[s]] (v G[s] ). The replication-deletion process is known as Moran model [41] in mathematical biology and consists of two strongly coupled random events: An arbitrarily chosen member of the population is subjected to replication whereas another randomly chosen member is deleted. The population size strictly equals N between the pairs of coupled events. In the notation introduced above, a replication-deletion event 6 For formal reasons which will become clear when we dene the replication-deletion process we do not consider single vertices as populations: N2.

23 Reidys, Forst, Schuster: Replication on Neutral Networks 21 is a mapping from the family (v i j i 2 IN N ) into the family (v 0 i j i 2 IN N ): An ordered pair (v j ;v k )isdrawn from the population V. Assuming a single peak tness landscape with the superiority for the master phenotype, the rst vertex (v j ) chosen with uniform probability `, (N, `)+` from the ` elements located on the neutral net G[s] and with uniform probability (N, `), (N, `) +` from the remaining N, ` elements. The second vertex (v k )oftheordered pair is chosen with uniform probability 1=(N, 1) from all vertices except the rst one, (v i 6= v j j i 2 IN N ). Making use of the uniform error rate approximation the two events can eventually be formulated by (i) The vertex v j = (x 1 ; :::; x n ) is mapped randomly into v = (x 1; :::; x n) by mapping each coordinate x i with uniform probability p into x i x; x 2A) and leaving it unchanged otherwise (x i = x i), and (ii) the vertex (v k ) is deleted. 6= x i (with The random mapping (i), v j 7! v, is a replication event. With probability(1,p) n we have v v j and an error-free copy isproduced, otherwise amutation occurs. The complete mapping (i) and (ii), (v j ;v k ) 7! (v j ;v ), is the elementary act of the replication-deletion process. Individual elementary acts are assumed to be independent events and, accordingly, time intervals ^T between consecutive events are exponentially distributed, f ^T tg = expf,[(n, `)+`] tg, on a time axis which is scaled by the mean tness of the population. Replication-deletion is modeled by a stochastic process in continuous time. As in the previous section 3.1 we lump together the neutral vertices (v G[s] ) and all other vertices (v[q n ] n v G[s] ) [42]. The stochastic variable ^Y (t) 2 f0; INN g describes the evolution of the number of elements on the neutral network. The population is split into two parts, precisely, the map f(s) induces a bipartition of the population V on Q n : V : = fv 2 V j v 2 v G[s] g and V : = fv 2 V j v 62 v G[s] g ;

24 Reidys, Forst, Schuster: Replication on Neutral Networks 22 where V = V [ V. The elements of V are called master phenotypes and those of V are the mutant phenotypes. The analysis of the stochastic process described by ^Y (t) aims at a computation of the distribution function P`(t) = Prob f ^Y (t) =`g ; ` =0; 1;:::;N ; whose time dependence is commonly described by a master equation (See, for example, [17]): dp` dt = P`+1;` P`+1 (t) + P`,1;` P`,1 (t), (P`;`+1 + P`;`,1 ) P`(t) : Implicitly, the common assumption for chemical reactions is made that all transitions changing ` by more than one are of order O(t) and thus vanish in the limit lim t! 0 [40]. Non-zero transition rates are conned to P`;`+1, P`;`, and P`;`,1 and suggest to model replication-deletion dynamics by a birth-and-death process Birth-and-death model A birth-and-death process is a Markovian process in continuous time which is completely described by the transition rates, P`;`+1, P`;` and P`;`,1. The replicationdeletion process with the approximations adopted above, and with a nite population of constant size N, is a birth-and-death process [25, 35, 36] that converges to a stationary distribution, dened by: dp`=dt = 0 8 `. The time axis can be transformed such that the probability for a replication-deletion event becomes independent of the population average. This \reactor time" ~t is dened by d~t : N = dt (, 1)` + N After scaling we obtain for the birth-and-death rates ~P`;`+1 = ~P`;`,1 = (N, `) N(N, 1) ` N(N, 1) ~P`;` =1, ~ P`;`+1, ~ P`;`,1 ; = dt (`) : ` W; +(N, `, 1) W; for 0 ` N, 1 ; (`, 1) W; +(N, `) W; for 1 ` N; and

25 Reidys, Forst, Schuster: Replication on Neutral Networks 23 and ~ P`;k = 0 otherwise. The mean values, W:;:, represent averages of the conditional probabilities, W :;: (v), over the whole neutral network G[s] and refer to the mappings discussed for the replication-deletion process (see section 3.2). In particular, (i) W ; (v) is the conditional probability of a mapping of v 2 G[s] into G[s]: a master vertex maps into a master vertex, (ii) W ; (v) is the conditional probability of a mapping from v 2G[s] into the set Q n ng[s]: a master vertex maps into a vertex of a mutant phenotype, (iii) W ; (u) is the conditional probability of a mapping from u 2Q n ng[s] onto the neutral network G[s]: a mutant phenotype vertex maps into a master vertex, and (iv) W ; (u) is the conditional probability of a particular mapping within the set Q n ng[s]: amutant phenotype vertex maps into a mutant phenotype vertex. By denition we have W ; (v)+w ; (v) = 1 and W ; (u)+w ; (u) = 1, and the same is, of course, true for the corresponding mean values. It is thus sucient to compute the two probabilities W; and W; for a full description of the process Transition probabilities By construction the neutral network G[s] has binomially distributed vertex degrees since compatible adjacent vertices are selected with independent probabilities, u and p, respectively. Application of the birth-and-death process to the stochastic dynamics on the hypercube thus is an approximation which treats the neutral network G[s] as a regular graph. The transition probability W; on the product space Q n u Q n p is readily computed: W ; = (1, p) n (p) u +1 (p) p +1 ; (9) with the two functions (p) and (p) expressing the dependence of the coecients of u and p on the replication accuracy p: (p) : = 1 (1, p) n u, 1 and (p) : = p 2 np (, 1)(1, p) +1, 1 : 2

26 Reidys, Forst, Schuster: Replication on Neutral Networks 24 Computing of the remaining transition probability is simplied by the notion of compatibility classes. For a vertex v 2 v[q n ] the incompatibility distance with respect to G[s] is d(g[s];v) : = jf[v i ;v k ] j [v i ;v k ] 62 B ^ [i; k] 2 (s)gj and the i-th compatibility class, C i [s], is dened by C i [s] def === fv 2 v[q n ] n v[g[s]] j d(g[s];v)=ig 8i =0;:::;n p : The class with i = 0 is the set of compatible sequences that do not form the master phenotype and we have: C[s] = G[s] +C 0 [s]. The other compatibility classes with i = 1; 2;:::;n p are incompatible with the master phenotype in one, two, :::;n p base pairs, respectively. Next, we assume that the mutant elements of the population are uniformly distributed in the corresponding compatiblility classes: f(v) = kg is independent of v 2 v[q n ] n v[g[s]] for 0 k N: The computation of the expectation values is simplied by the assumption of an alphabet in which the number of bases equals the number of base pairs: =. In particular, we nd for binary alphabets ( = = 2): jc i [s]j = (, np i 2 n p +n u and obtain for the expectation value: nx hx W ; = p p h (1, p) n,h nu h, ` with = 8 < : h=1 0 if h, ` =0 1 otherwise, for 1 i n p 2 n u+n p,jg[s]j for i =0 ; `=0 `X u i=0 N i np, i N, ` 2i (, 2)`,i ; `, i and N i = jc i [s]j being the number of vertices with i incompatible base pairs, and P n p i=0 N i = N, `. The quantities N i =(N, `) are the densities of sequences in the i-th compatibility classes. We shall assume that these densities are given by j C i [s] j =( n,jg[s] j). In other words, sequences are uniformly distributed among their i-th compatibility classes.

27 Reidys, Forst, Schuster: Replication on Neutral Networks Stationary distributions The stationary distribution can be derived by making use of the special form of the transition probabilities: P`;`+1 = (`) + 1+ C + ` P`;`,1 = (`), 1+ C, N, ` with (`)=(N, `)`, N(N, 1) and and + = W;, W; and C + = (N, 1) W; W ;, W ; ; ;, = W;, W; and C, = (N, 1) W; W ;, W; ; respectively. 7 The stationary distribution of the birth-and-death process is obtained from k = k P N k=1 k with k = ky `=1 P`,1;` P`;`,1 = W;, W; k,1 W ;, W; B(N;C, ) (k + C + ) B(1 + C + ;k) B(N, (k, 1);C, ) N(N, 1) W; k, (k, 1) W; +(N, k) W; (10) and B(z, y; y) =,(z, y),(y),(z) being the Beta and the Gamma function, respectively. The expectation value of the number of master vertices in the long time limit is then obtained by lim t!1 Ef ^Y (t)g =Ef Y g = P N k=1 k k. The stationary distribution of the birth-and-death process is evaluated as a function of the mutation rate p. For this goal we approximate the factorials of the Beta function by means of Stirling's formula and plot the stationary distribution as a function of p (gure 3). The maximum of this stationary distribution { although not identical but very close to Ef Y g { corresponds to the stationary frequency of phenotypes ( m in equation 6) shown in gure 2A. 7 The function (`) happens to be independent of in this particular case. It would, for example, depend on if we used the unscaled time axis t instead of ~t.

28 Reidys, Forst, Schuster: Replication on Neutral Networks Number of Masters p Density A Number of Masters p B Figure 3: The stationary distribution of the number of master phenotypes Y` as a function of the mutation rate p. The upper part (A) shows a 3D-plot of the distribution k (p) and the lower part (B) is a contour plot of A. The neutral network G[s] assumed to be regular and the parameters u = p =0:5, =10 and N=1000 were used.

29 Reidys, Forst, Schuster: Replication on Neutral Networks The error threshold in nite populations The stationary distribution of the birth-and-death process is applied now to derive an expression for the phenotypic error threshold. Without loosing generality for all realistic situations we assume that the populations size N is small compared to the total number of sequences: N jq n j = n. In addition, it is straightforward to check that the size of a neutral network is usually much smaller than sequence space: jg[s]j n 1. The number of compatible sequences fulls the relation jc[s]j = n u n p n u+2n p for suciently long sequences and suciently large numbers of base pairs. Then, the probability f Y =0g = 0 becomes the maximum of the distribution function in the range of random replication, 8 although ^Y =0is not an absorbing state. A stochastic extension of the zero-th order condition for the occurrence of a phenotypic error threshold ( m =0)is to ask for the condition that the variance of the stochastic variable ^Y (p) becomes larger than the square of its expectation value in the long time limit. It is meaningful to correct the expectation value by substraction of its minimum value at random replication. Then, we nd for the phenotypic error threshold of structure s as master phenotype: p max(n) : = max ( p j Varf Ys (p)g Ef Ys (p)g, jg[s]j n 2 ) : (11) This criterion for the error threshold appears to be more widely applicable than one previously derived for the genotypic error threshold by means of a birthand-death model [42]. Attempts to derive an analytical expression for p max(n), however, have failed so far. Some numerically computed data for equation (11) are shown in table 3. 8 Random replication means that correct and incorrect digits are incorporated with equal probabilities. It occurs for a replication accuracy in the neighborhood of q rnd =1,1=.

Neutral Networks of RNA Genotypes and RNA Evolution in silico

Neutral Networks of RNA Genotypes and RNA Evolution in silico Peter Schuster Institut für Theoretische Chemie und Molekulare Strukturbiologie der Universität Wien RNA Secondary Structures in Dijon Dijon,