Competing selective sweeps

Size: px

Start display at page:

Download "Competing selective sweeps"

Clinton Richardson
5 years ago
Views:

1 Competing selective sweeps Sebastian Bossert Dissertation zur Erlangung des Doktorgrades der Fakultät für Mathematik und Physik der Albert-Ludwigs-Universität Freiburg im Breisgau

2 Pr r t r rö r r t Pr r P t r P r Pr r rt t3 t r t r ü Prü

3 Abstract The principle of natural selection, characterised by Darwin and Wallace in 1858 changed the way of thinking about the development of species completely and radically. They were the first to point out, that certain traits improve an individual s chance to survive and reproduce and are therefore inherited more frequently. Scientific progress and technical developments have in the meantime led to the identification of underlying molecular mechanisms and to the decoding of the DNA, the biopolymer that stores the genetic information of an individual. This progress enables scientists to search nowadays for beneficial traits on the molecular level. In doing so particular detection tools are applied, which make use of the existing DNA variation in a population. Such tools are often developed using theoretical models and they try to find special DNA patterns, which indicate that selection might have happened in this DNA area. A pioneering theoretical approach that contributed to the development of first detection tools was presented by Maynard Smith and Haigh in They analysed in a theoretical framework the process when a new, strongly selected advantageous mutation becomes fixed in a population. Under the assumption of an otherwise neutral panmictic population such a (single hard) selective sweep leads to the reduction of diversity around the selected locus. In the following years other scientists were able to identify further properties that characterise such an evolution, e.g. an increased number of low- and high-frequency variants. In addition people also faced the question to what extent these characteristics still hold, when certain assumptions are modified. Such approaches and extensions are absolutely necessary to be able to identify different kinds of selective mechanisms and evolutions. In this scientific work a situation is examined, where two selective sweeps within a narrow genomic region overlap in a sexually evolving population. For such a competing sweeps situation at first a mathematical model, based on reasonable biological assumptions, is set up to identify what kind of evolutions can happen. Of particular interest is the probability of a fixation of both beneficial alleles, in cases where these alleles are not initially linked. To handle this question a graphical tool, the ancestral selection recombination graph, is utilized, which is based on a genealogical view on the population. This approach provides a limit result (for large selection coefficients) for the probability that both beneficial mutations will eventually fix and enables us to analyse the role of selection, recombination and the population size. In particular, we could establish that under certain starting conditions the fixation probability is heavily dependent on the population size. Although we limit mainly on panmictic populations, we argue heuristically how the presented results can be extended to island models. The analytical examination is complemented by a simulation study. Here we analyse on the one hand to what extent the derived limit formulas are suited for large finite populations. On the other hand simulations are conducted to identify possible signatures for the considered scenario. These simulations point out that fixation of both beneficial alleles leads in the region between the two selective loci to altered patterns compared to the single sweep case. The theoretical results suggest that this is attributable to the fact that in a competing sweeps setting different founders can appear and contribute to the genomic variation at fixation time. Altogether, the theoretical examination as well as the simulation study indicate that competing selective sweeps might be an explanation for strong haplotype patterns found in SNP data of Drosophila melanogaster.

4 Declaration of contributions as a co-author This dissertation presents the results of my doctoral research from August 2012 until February The research has been conducted in collaboration with other scientists. The thesis consists of two main parts, an analytical part and a corresponding simulation study. The first topic entitled Fixation probabilities in a competing sweeps scenario was supervised by Prof. Dr. Peter Pfaffelhuber. It presents theoretical limit results developed together with Prof. Dr. Peter Pfaffelhuber. I wrote the manuscript and proved the theorems. The second part builds up on the analytical results and illustrates simulation results. Under the supervision of Prof. Dr. Peter Pfaffelhuber, I wrote the program to simulate data for the Wright Fisher population. In this part additional simulation programs and statistics from other authors were used. The use of these programs is clearly indicated. I evaluated the obtained data and wrote the manuscript. Publication of another topic Besides the work on the dissertation the following publication in collaboration with Prof. Dr. Peter Pfaffelhuber was developed: Bossert, Sebastian and Peter Pfaffelhuber. The Yule approximation for the site frequency spectrum after a selective sweep. PloS ONE, 9(1), This publication grow out of my Diploma thesis entitled Das Frequenzspektrum nach einem Selective Sweep and is not presented in this thesis.

5 ts rst t t 2 s r Pr ss r r P t r P r r s 1 t t r r2 r t t 2 t t 2 s ss s t st r s r t t r s t s r s t s rt r t st 2 rs t r t rs 2 s t s r 2 s t t r sts 2 t s s t t r P t r P r rt r t r 3 r r rs t r t 2 t r st s ss s t P t r 3 r s r2 ts t s r t t t r t s rt t rs t rs r r t r s t str t r t s t t t t r t rs t2 t r s r t s t t Pr r t3 r Pr r s Pr r ss r t r s t t2 r r t r s t t t ö r s t s rt t rs t st st r r r r rs t t t rt t 2 s t s t 2 2 t s s t s r t t s ss

7 Contents 1 General introduction 3 2 Introduction to competing selective sweeps Models Heuristics on the event of fixation Fixation probabilities in a competing sweeps scenario Main result Proof Diffusion to ASRG Inside the ASRG Supporting lemmata Simulation results Fixation probabilities Methods Comparison results Extensions Signatures for the fixation of the double beneficial type Methods Statistical results Discussion of results Accuracy and validity of the limit result Fixation signatures Conclusions and Outlook 126 A Appendix 129 A.1 Addenda to the proof of Theorem A.2 Additional figures A.3 Extensions of the Wright Fisher model A.4 Additional statistical results A.5 Contents of the supplemental CD List of Figures 144 List of Tables 145 Bibliography 146

9 Chapter 1 General introduction Meanwhile it is well-known that the blueprint of every individual is saved on its DNA. During reproduction the DNA is copied whereby modifications can happen. Some of these modifications can improve an individual s chances to survive and reproduce. While Darwin, the first scientist determining this course of action, could only see the outcome, provided that the modifications lead to a change in the visible structure, nowadays (nearly) all molecular DNA information is available. O Connell described the impact of Darwin s discovery metaphorically by the following words: One of the most exciting detective stories in modern times began with Darwin s theory of evolution. (O Connell in Donnelly and Tavaré (1997)) And indeed one can interpret the processing in genetics as a more and more successful search for evidence of evolution and the mechanisms that underlies it. For the analysis only data from today (or some years ago) are available, while the reasons lie far back. So figuratively speaking the detective work consists in analysing and interpreting the present information to get inference about the past evolution, which underlies the examined data. At first glance this investigation seems manageable since there are mainly only four essential forces (mutation, selection, recombination, genetic drift) affecting a population s evolution. On closer inspection, however, one must notice that these factors exist in a huge diversity. Furthermore, they are influencing each other strongly. These characteristics impede the evolutionary research on the one side but on the other side these features contribute to the unique attractiveness of evolutionary biology. The tricky cases characterize a fascinating detective story. Since evolutionary processes happen over hundreds or thousands of generations it turned out that mathematical models can help to derive knowledge about evolutionary events and their likelihood. Such models were developed first by Haldane (1927), Fisher (1930), and Wright (1931) and helped to solve the discrepancies between the evolutionary theory of Darwin and Mendelian inheritance. In the context of their work they provide a decisive contribution to the standard model of evolution, the modern evolutionary synthesis (see e.g. Dobzhansky, 1937; Huxley, 1942; Mayr, 1942). During the past decades both the technical and analytical tools in genetics have improved extremely. Crucial stages of the technological development were the first DNA sequencing by Sanger et al. (1977) and the complete sequencing of the human genome by the HapMapProject (International Human Genome Sequencing Consortium, 2004). These days typically the third generation of sequencing technique (illumina sequencing) is used, which has accelerated the DNA sequencing once again while at the same time reducing its costs. On the analytical side primary the Moran model (Moran, 1958) and the introduction of the genealogical view,

10 General introduction 4 first recorded in Kingman s coalescent (Kingman, 1982), are to be mentioned. When using reasonable assumptions such mathematical models can result in helpful statistical methods, which enlarge the understanding of the complex interrelationships between the evolutionary forces. This link-up between the theoretical and practical side was put in a nutshell by John H. Gillespie (Gillespie, 2010): Population geneticists spend most of their time doing one of two things: describing the genetic structure of populations or theorizing on the evolutionary forces acting on populations. On a good day, these two activities mesh and true insights emerge. Selection represents the most important evolutionary force for shifts of allele frequencies in large populations and leads to a better adaptation to the environment. Therefore one major challenge in evolutionary theory is to identify genomic areas that appear to be affected by selection. Hereby one can distinguish between macro and micro evolutionary levels. While macro evolutionary studies focus on selective events that happen within the deep past, based on comparisons of different species, micro evolutionary studies refer to smaller more recent evolutionary changes within a population. Such micro evolutionary studies are mainly based on statistical measures derived from Single Nucleotide Polymorphism (SNP) variation data. Hereby a SNP is a DNA sequence variation, in which a single nucleotide differs between individuals of a population. In a follow-up study functional consequences induced by the selected allele need to be characterized. Meanwhile, there is a plurality of statistical methods for the identification of possible selective regions, which are portrayed e.g. in Vitti et al. (2013) or Wollstein and Stephan (2015). Some of these statistics will be handled in more detail in Section 4.2 and Section 5.2. One reason for the large variety of statistics is that there is also a considerable variety how selection acts. An overview of the different forms of selection is given in Hartl and Clark (1997). We will concentrate in the following on one type of selection, namely strong directional selection and will first review the theoretical achievements in this field of activity. When a single beneficial mutation arises and sweeps to fixation rapidly, this process is called a selective sweep. In a seminal paper Maynard Smith and Haigh (1974) determined the consequences of such a process for the genomic region surrounding the selective locus. Under the assumption of an otherwise neutral population a population-wide reduction in genetic diversity in the considered region can be detected. In the past 40 years many mathematical publications have build up on this work and supplemented the findings. Two standard techniques which help to study this fixation process are diffusion processes and coalescent processes (see e.g. Ewens, 2004, Chap. 5 or Patwa and Wahl, 2008). By using such approaches Kaplan et al. (1989), Stephan et al. (1992), Braverman et al. (1995), and Etheridge et al. (2006) have analysed the whole fixation process. Beside the reduction in genetic diversity the neighbouring region of the selective locus has further characteristics. Fay and Wu (2000) were the first to report a U-shape of the frequency spectrum as a consequence of the selective sweep. A third feature is an elevated linkage disequilibrium score (LD score) on both sides of the selective locus, but not across the selected site (Stephan et al., 2006; Pfaffelhuber et al., 2008). LD is the non-random association of alleles at different loci and can be quantified by different scores. Based on these theoretical findings statistical methods, which are sensitive for this kind of signals, were identified and developed (e.g. SweepFinder, SweeD, and OmegaPlus). These methods were adopted successfully in numerous studies and link again theory and practice. In doing so, signals for selective sweeps could be identified for example in drosophila (Voigt et al., 2015), arabidopsis (Huber et al., 2014), humans (Sabeti et al., 2007), or influenza (Strelkowa and Lässig, 2012).

11 5 At this point the following crucial aspect when interpreting the statistical results must be emphasized. In the basic theoretical selection model a special situation is assumed where evolutionary forces like population structure or further beneficial (or deleterious) mutations are excluded. However, such forces quite surely work in real populations and they influence the evolutionary process and thereby the variation pattern. Hence, when the special (hard) selective sweep situation is not fulfilled, then the statistics around a beneficial locus can change and the considered approach must be adopted. In order to take this issue into account many authors have extended the basic model in various directions to get predictions for sequence diversity in these cases. For example, Pennings and Hermisson (2006a; 2006b) have analysed the selective process under altered conditions. They replaced the assumption that exactly one mutant starts the sweep by the assumption that the sweep is based on standing genetic variation. Santiago and Caballero (2005) examined the effect of genetic hitchhiking in subdivided populations, and Kim and Stephan (2000) investigated the joined effect of background selection and genetic hitchhiking. Authors like Teshima and Przeworski (2006) or Ewing et al. (2011) have taken directional selection with different forms of dominance, like recessive beneficial alleles, into account. The list of extensions is virtually endless. Nevertheless, the models often consist of only small adjustments and expansions of the basic model. So one might ask why only such small changes in the assumptions are applied and not a combination of different changes to get a realistic scenario. The answer is that it is hard enough to handle and analyse such small changes in the assumptions theoretically. Hence more realistic scenarios have to be approached step by step and at best every model extension and derived theoretical result helps to improve the understanding of the complex interaction of the evolutionary forces. In this thesis the following extension of the basic sweep model is considered. A situation is analysed where two selective sweeps overlap. This means it is assumed that a further beneficial mutation arises at a different locus during the time course of the sweep of the first beneficial mutant. Here the interaction between the different selective types and their loci must be incorporated which leads to a considerable increase of complexity. Since more than one locus is under consideration, differences between asexual and sexual populations must be taken into account. In a sexual population the allelic types at two different loci can be inherited from different individuals (namely each from one parent) due to recombination events. Fisher (1930) and Muller (1932) were the first to point out that such recombination events can bring beneficial mutations on the same genetic background. This can result in an evolutionary advantage compared to asexually evolving populations. In asexual populations when different beneficial alleles appear on different backgrounds only one of these alleles can fix in the whole population. Later on Hill and Robertson (1966) gave a theoretical reasoning for this evolutionary advantage (including simulations). They pointed out that selectively beneficial alleles occurring on linked loci interfere with each other and recombination is able to affect the fixation of the alleles by bringing alleles together, which were not initially linked. Hence in such a situation the recombination probability between the beneficial loci plays a major role for the fixation probability. While the verbal arguments for the role of recombination in a situation with interfering selective alleles is easily given, in a concrete analysis one has to struggle with some difficulties (Taylor, 2007). Generally such a situation can be described as follows: A beneficial mutation, called A with selective advantage s 1 is undergoing a selective sweep when a second beneficial mutation B (with selective advantage s 2 ) arises at a linked locus (on the wild-type background). An evolutionary model with two beneficial alleles was already analysed by different authors like Stephan (1995), Barton (1995), Otto and Barton (1997), Chevin et al. (2008), Yu and Etheridge (2010), and Cuthbertson et al. (2012). Stephan (1995) investigates a two-locus, two-allele model with additive, directional selection, and recombination formulated in terms of

12 General introduction 6 a four-dimensional ordinary differential equation. Results on stochastic models for a panmictic population focus on fixation probabilities. Hereby panmictic means a population without any spatial structure and therefore with total random mating. Barton (1995) and Otto and Barton (1997) consider the case where the selective advantage s 1 of the first mutant A is larger than the advantage s 2 of the second positively selected mutant B and used a stochastic model to analyse the fixation probability. In that case the probability of fixation of the double mutant depends on the ratio s 1 /s 2 of the selection coefficients and on the ratio between the recombination parameter r and the selective advantage s 1, but only weakly on the effective population size. Yu and Etheridge (2010) and Cuthbertson et al. (2012) complement this with an analysis for the fixation probability when the first selection coefficient is smaller than the second (s 2 > s 1 ). They used various comparisons between the development of the different alleles and deterministic logistic growth curves. Furthermore, they used an ordinary differential equation for the spread of AB alleles (individuals with both beneficial alleles) to get an approximation for the fixation probability. Overall they observe a rather different behaviour in that case. Now the fixation probability depends on the population size. The both situations s 2 > s 1 and s 1 > s 2 need different approaches to get good approximation results. The reason for this characteristic is that the development of the first beneficial allele A can be analysed independently of the development of the B allele in the situation s 1 > s 2, while it is heavily dependent on this development in the situation s 2 > s 1. Therefore the approximation steps of Barton and Cuthbertson et al. are only appropriate in one of the two situations. In the last decades a new tool, which can be classified between the practical real data and the mathematical calculation, has moved mainstream in population genetics. Simulation studies are model based and provide results, where theoretical calculations are (so far) too challenging. Such studies have assisted to gain further insight in evolutionary processes. We have mentioned already an early elementary simulation study of Hill and Robertson in Thanks to the technical advances in hardware and software the simulation tools used today are much more mature and can simulate whole chromosomes over a multitude of generations. Such simulations can on the one side be used for comparisons with real data to detect e.g. the demographic history of a population. On the other side they allow the detailed analysis of special evolutionary scenarios and their influence on genetic patterns. For a review on current simulation tools for selection and their included features we refer to Bank et al. (2014). The competing selective sweep situation was handled in a large simulation study by Chevin et al. (2008), which revealed interesting outcomes. These authors analysed different statistics like Tajima s D and the frequency spectrum at the moment of fixation of the double beneficial mutant. They pointed out that these statistics can look very different compared to a single selective sweep scenario. Furthermore, they argue that the case s 2 > s 1 is more likely to be encountered in real data exhibiting fixation at both selected loci, since in such a situation a smaller recombination probability is sufficient for the chance of fixation. In this work the competing sweeps situation is handled comprehensively. In doing so limit results for the fixation probability in the case s 2 > s 1 are given based on novel analytical methods. So basically a similar situation like in Cuthbertson et al. (2012) is studied leading to comparable conclusions about the importance and influence of the selection and recombination coefficients. But different to their results we start with a diffusion model and use a combination of the ancestral selection and ancestral recombination graph as a key tool for the proof of an analytical formula for the fixation probability of the double beneficial type AB. Beside these mathematical calculations extensive simulations of the competing sweeps situation with different parameter choices for selection, recombination etc. were carried out.

13 7 These simulations are used to illustrate to what extent the limit results are suited for large finite populations. In addition whole chromosomes of a model population were simulated conditioned on the fixation of the double beneficial type to analyse typical descriptive and inductive statistics at the moment of fixation. This study supplements the simulation results of Chevin et al. (2008) by linkage disequilibrium based statistics and up-to-date outlier tests for selection. The thesis is organized as follows: Chapter 2 introduces the well-known population genetics models (Moran model and Wright Fisher model) in a competing sweeps situation and the corresponding diffusion system. In the second part heuristic arguments for possible evolutions are given based on these models. In the course of this analysis the results of Otto and Barton (1997) and Cuthbertson et al. (2012) are recorded in more detail. Chapter 3 presents the mathematical limit result for the fixation probability of the double beneficial type based on the diffusion system. Hereby, special starting conditions and parameter ratios are assumed to get non-trivial results. The results highlight the influence of the different parameters like selection, recombination or population size on the fixation probability. Furthermore, the conducted graphical construction gives insight in the genealogy of the different types and the expected number of recombinations leading to the double beneficial type. Chapter 4 contains the realized simulations. At first the simulations to check the validity of the limit result are presented. After this comparison some simulation extensions are portrayed and discussed. In the second part the simulation software SLiM (Messer, 2013) is utilized for a whole chromosome study of the competing sweeps situation. This simulation serves mainly to identify possible signatures for such a scenario and to survey, whether the standard statistics like Tajima s D are able to detect selection in such a scenario. Chapter 5 discusses both the mathematical and the simulation results. One important issue is whether the considered scenario is possible and conceivable in natural populations. The other important question concerns possible signatures for competing selective sweeps. To what extent do common statistics differ in a competing sweeps scenario compared to a classical selective sweep? How do the scores change in specific regions? Chapter 6 handles further prospects and possible extensions of the considered scenario. For example the analytical challenges of more general models are highlighted, potential addons of the simulation study are discussed and possible applications to natural populations are debated.

14 Chapter 2 Introduction to competing selective sweeps For a single beneficial mutant in a large diploid (otherwise neutral) population the first (and still useful) approximation for the fixation probability by Haldane (1927) is more than 80 years old. The situation is completely different when it is assumed that selective sweeps overlap. Here the interaction between the different selective types and their loci complicates the calculation of fixation probabilities. Therefore it took until the 1990s that workable approximation results in such an overlap scenario were published. Before we comment on these results, models for the competing sweeps situation and underlying assumptions are presented. We will use particular settings of the models, which we will specify in the following. The differences to the models used for previously released approximation results will be discussed in Section 5.1. In this thesis two of the most popular models in population genetics will be used, the Moran model and the Wright Fisher model. We will assume in both models a fixed population size N. The main difference between these two models consists in discrete time evolution of populations in the Wright Fisher model versus a continuous time approach in the Moran model. In the Wright Fisher model each individual of the new generation picks a parent from the previous generation according to a certain probability distribution. This course of action leads to fast changes in the population and explains the popular use of this model for simulations. On the other hand the calculation of analytical results is often difficult in the Wright Fisher model and easier to handle in a continuous context. In the Moran model reproduction events happen according to certain rates. At each event only two different actions may happen. Either one individual reproduces and replaces another one or two individuals recombine and replace another one. Under proper rescaling of the parameters both models converge for N to a similar limit process. We will formulate our main result for this limit diffusion. The introduction of the finite population size models has several reasons. For instance, the reproduction process of the Moran model plays an important role in the proof of the main result. Hence, the Moran model serves as a visualization, which helps to understand the evolution of the population both viewed forward and backward in time. The presented Wright Fisher model will be used for simulations to check the applicability of the limit results for large finite populations. Besides, the presentation of both models simplifies the comparison with other approximation results.

15 Models 2.1 Models Compared to asexually reproducing populations, in sexually reproducing ones chromosomes are mainly not passed down as intact units. Through the mechanism of meiosis and mitosis every parent inherits one chromosome of his diploid chromosome pair. These two haploid chromosomes are fused to form a new diploid pair in the offspring. During the merging process recombination acts, which can result in the circumstance that the allelic types at two distinct loci on one chromosome are inherited one from each parent. In particular individuals can be formed with completely new allele combinations and beneficial (or adverse) alleles can be brought together on one chromosome through this mechanism. As is customary this complex recombination process is integrated in a simplified way in models (see e.g. Ewens, 2004). In our context, where only two loci are of matter, we only need to specify how likely both loci are inherited from the same parent or each from one parent, respectively. Hence, when assuming that this probability is constant over time only one recombination parameter r is needed in the theoretical model. Moreover, it is common to consider a haploid model of size 2N instead of a diploid model of size N, when mainly the development of allelic frequencies is analysed. This simplification is appropriate when a dominance coefficient of 1/2 is assumed for the beneficial alleles and large populations are under consideration. We will also make use of this conversion, since we will concentrate on this standard case throughout the whole thesis. The justification for this simplification is based on the fact that the differences between the two models are negligible when the population size converges to. More detailed mathematical arguments can be found in standard population genetics literature, like Durrett (2008). In order to get a plain notation we will even substitute the haploid population size of 2N by N. Consequently our models refer to diploid models of size N/2 and we assume implicitly that N is odd. In this context it should be noted that we will speak in the following inaccurately of N individuals in the haploid models, although this refers technically to only one part of the diploid genome of an individual. This imprecision shortens notation. Each individual in the (haploid) population carries an allele combination with either allele a or A and allele b or B. Only this allele combination is of matter for the fitness of an individual. Here big letters represent beneficial alleles and small letters the wild-type alleles. It is assumed that individuals, which possess both beneficial alleles also have a fitness advantage. So altogether we have to distinguish between 4 different types, the complete wild-type ab and the beneficial types Ab, ab, and AB. These beneficial types have selective advantages of s Ab, s ab, and s AB with s Ab,s ab,s AB > 0. To facilitate the differentiation we assign numbers to every type. Ab is denoted by 1, ab by 2 and type AB by 3. The wild-type has a special role and is labelled with 0, since this type has no selective advantage (s ab = 0). After this preparation the discrete Wright Fisher model with two selective loci can be introduced. Definition 2.1 (Wright Fisher model). We consider a panmictic population with N (haploid) individuals, which evolves in discrete generations. The four possible types of the individuals are ab, Ab, ab and AB. The selective advantages are given by s ab = 0 and s Ab,s ab,s AB > 0. In order to obtain the (t+1)-st generation from the t-th, the following steps are performed. Reproduction: Assume that in generation t, n t ab individuals are of type ab, nt Ab individuals are of type Ab, n t ab individuals are of type ab and nt AB individuals are of type AB. Then a parent of type ij with i {a,a} and j {b,b} is chosen with probability n t ij (1+s ij) i {a,a},j {b,b} nt ij (1+s ij). (2.1)

16 Introduction to competing selective sweeps 10 Recombination: With probability 1 r no recombination happens and the offspring is of the same type ij as the parent. Otherwise (with probability r) a second individual of type kl with k {a,a} and l {b,b} is chosen according to the probabilities of Eq. (2.1) and the offspring is generated by a combination of these two individuals. In this case the descendant is of type il with probability 1/2 and of type kj with probability 1/2. Remark 2.2. i) There are various other possibilities how to define the generational transition in the Wright Fisher model. For example a permutation of the reproduction and recombination step, whereby in the case of a recombination event the individuals are chosen according to the frequencies in generation t. As long as the selection coefficients and the recombination probability are small, this modification leads only to negligible differences in the evolution of the population (cf. Ewens, 2004). In particular the differences vanish in the limit N under the assumptions that the product of recombination probability and population size r N N and the products of the selection coefficients and population size Ns N i for i {1,2,3} converge to constants. ii) We have deliberately not included a starting value in the definition. This topic will be discussed later. In the following we will assume large population sizes N, small positive recombination probabilities of order O(1/N) and strongly selected mutations. So random drift plays only a minor role and selection is the driving force for considerable frequency changes. The case of only weakly selected alleles was e.g. analysed by McVean and Charlesworth (2000) and Gillespie (2001). The special feature of the recombination step is that it can form one of the four types, although this type was not existent in the generation before. This is of course only possible if different alleles are available in the population. Once only one type is present in the population, meaning all N individuals are of the same type, no further changes are possible in the following generations. We call this event fixation and the first time when only one type is existent is called fixation time. We are interested in the fixation probability of type AB under certain starting conditions. These starting conditions include that type AB is not present in the population at the beginning. Hence (at least) one recombination event is needed so that fixation of AB can occur. Furthermore, the fixation probability is only non-trivial as long as type AB has the highest fitness advantage. Otherwise AB is dominated by one of the other selective types and the fixation is extremely unlikely. So we should assume that s AB > max(s Ab,s ab ). Without loss of generality we also choose max(s Ab,s ab ) = s ab. Thereby the fitness order of the different types is determined. Next the Moran model is defined. In this model basic reproduction events and selective reproductive events are distinguished. Here the number coding of the selective advantages is used (s Ab = s 1,s ab = s 2 and s AB = s 3 ). Definition 2.3 (Moran model). We consider a haploid panmictic population of size N. Each individual is of one of the types ab, Ab, ab or AB, respectively. Denote for t R + by n t ab the number of individuals of type ab, by n t Ab the number of individuals of type Ab, by nt ab the number of individuals of type ab, and by n t AB the number of individuals of type AB at time t. The following events shift the state of the process. Reproduction: Any individual reproduces at rate 1/2. In such a case a second individual (it might be the same individual as the first) is randomly chosen from the population. This second individual dies and the first one splits into two.

17 Models Selective reproduction: Individuals of type Ab reproduce additionally at rate s 1, individuals of type ab reproduce additionally at rate s 2, and individuals of type AB additionally at rate s 3. Here again a second individual is chosen randomly and gets replaced. Recombination: Every individual initializes a recombination event with rate r. Then a second individual is randomly chosen and replaced by a new individual. When the first individual is of type ij and the second one of type kl with i,k {a,a} and j,l {b,b}, then the new individual is of type il with probability 1/2 and of type kj otherwise. Such a model is best visualized by a graphical representation (see Fig. 2.1). Time is running from the bottom to the top. The unnumbered arrows denote basic reproduction events. Every individual sends such an arrow with rate 1/2. All lines (including the line itself) are equally likely chosen as the tip of this arrow. Then the individual at the tip is replaced by an offspring of the individual at the tail. The additional selective reproduction events can be integrated in the graph as follows. Every line sends selection arrows at rate s 3, where again the tip is placed randomly. Such an arrow gets the label 1 with probability s 1 /s 3, 2 with probability (s 2 s 1 )/s 3 and 3 otherwise. Only individuals which have a number equal or higher than the label of the line can use such arrows and place an offspring on the tip of the arrow. In doing so, the selective reproduction rate of all individuals is as described in Def In Fig. 2.1 the arrow labelled with 1 on the bottom (in the middle) cannot be used by type ab since this type has no fitness advantage, for which reason this arrow is dashed. Whereas the arrow labelled with 2 can be used by type ab since his order in the fitness rank is 2. The arrows labelled with a and b represent recombination events. Every line sends such a recombination arrow with rate r and the tip is again chosen randomly. Then the individual present on the line at the tip is replaced by an individual with an allele combination from the individual at the tail and the tip. If the arrow is labelled with a (which happens with probability 1/2) then the new individual gets his a locus from the individual at the tail and his b locus from the individual at the tip. Hence in the given example the arrow labelled with a leads to an individual of type AB on the third line. Is the arrow labelled with b then it is the other way round and the b locus comes from the individual at the tail. In keeping with these rules the types at the top can be identified, when the types at the bottom are given. Since the fixation probability is under examination we are only interested in the number of alleles of the different types and not in the complex connection between the different types. For the given Moran model this development can be described by a (multidimensional) Markov jump process. Let (N ij (t)) t 0 be the number of lines of a type ij, with i {a,a} and j {b,b} and let r ij + and r ij be the rates at which type ij increases and decreases by 1. Then given (N ab (t),n Ab (t),n ab (t),n AB (t)) = (n ab,n Ab,n ab,n AB ) the transition rates at time t are given by r + ab = n ab(n n ab ) 2N r ab = n ab(n n ab ) 2N r + Ab = n Ab(N n Ab ) 2N r Ab = n Ab(N n Ab ) 2N r + ab = n ab(n n ab ) 2N + r 2N (n abn Ab +n ab n ab +2n Ab n ab ) + s 1n ab n Ab N + s 2n ab n ab N + s 1n Ab (N n Ab ) N + s 2n Ab n ab N + s 3n Ab n AB N + s 2n ab (N n ab ) N + s 3n ab n AB N + r 2N n ab(n Ab +n ab +2n AB ) + r 2N (n Abn ab +n Ab n AB +2n AB n ab ) + r 2N (n Abn ab +n Ab n AB +2n Ab n ab ) + r 2N (n abn ab +n ab n AB +2n AB n ab )

18 Introduction to competing selective sweeps 12 Ab Ab AB AB ab Ab Ab Ab 1 b Time t a ab Ab ab ab AB ab Ab Ab Figure 2.1: Graphical representation of the Moran model: The unnumbered arrows characterise resampling events, while the numbered arrows specify selection events. The arrows labelled with a or b refer to recombination events. Dashed arrows indicate that the type at the tail cannot use this arrow (because of his too small fitness rank). r ab = n ab(n n ab ) 2N r + AB = n AB(N n AB ) 2N r AB = n AB(N n AB ) 2N + s 1n Ab n ab N + s 3n ab n AB N + s 3n AB (N n AB ) N + s 1n Ab n AB N + s 2n ab n AB N + r 2N (n abn ab +n ab n AB +2n Ab n ab ) + r 2N (n abn AB +n Ab n AB +2n Ab n ab ) + r 2N (n ABn Ab +n AB n ab +2n ab n AB ). (2.2) Remark 2.4. i) In the Moran model a genome can have zero or two offspring (due to reproduction). Hence the Moran model can be viewed as a birth-death process. This feature makes this model analytically more tractable compared to the Wright Fisher model, where a genome can have theoretically from 0 up to 2N descendants. ii) In Cuthbertson et al. (2012) and Yu and Etheridge (2010) the dynamics of the system are slightly different. Here the selective advantage is included in the standard resampling events via the probability of replacing another individual. Nevertheless both models lead to the same margins r ij + r ij, if one bears in mind, that in Cuthbertson et al. (2012) and Yu and Etheridge (2010) the total population size is 2N instead of N. As already mentioned above after proper rescaling both models, the Wright Fisher model and the Moran model, lead to the same limit process for N. This property holds very generally and is explained in detail in many fundamental books about population genetics like Ethier and Kurtz (1986), Ewens (2004) or Durrett (2008). Since a case with selection and recombination is rather rare the connection in this special case is illustrated by some calculations. Hereby we will concentrate on the Moran model where the verification is more illustratively.

19 Models To formulate the limit result, we have to index the recombination and selection parameters with the population size N. Furthermore, we will consider the frequencies of the different types instead of the total numbers. For this purpose the process (Y N ab (t),y N Ab (t),y N ab(t),y N AB(t)) := (N ab (t),n Ab (t),n ab (t),n AB (t))/n is defined. Besides a rescaling of the space also a time scaling is needed. Denote by (X N ab (t),xn Ab (t),xn ab(t),x N AB(t)) := (Y N ab (t N),Y N Ab (t N),Y N ab(t N),Y N AB(t N)). Under the assumptions lim N s N i N = α i for i {1,2,3} and lim N Nr N = ρ one arrives at a multidimensional diffusion for N. (0)) N (x ab,x Ab,x ab,x AB ) in Proposition 2.5 (Convergence to diffusion). Given (Xab N(t),XN Ab (t),xn ab (t),xn AB (t)) (described by Eq. (2.2)), with (Xab N(0),XN Ab (0),XN ab (0),XN AB distribution and lim N Ns N i = α i for i {1,2,3} and lim N Nr N = ρ. Then for N this system converges in distribution in D(R +,[0,1] 4 ) with the Skorohodtopology towards (X ab,x Ab,X ab,x AB ), where (X ab,x Ab,X ab,x AB ) is the solution of the stochastic differential equation dx ab = [ α 1 X ab X Ab α 2 X ab X ab α 3 X ab X AB +ρ(x Ab X ab X ab X AB )]dt X ab X Ab dw 01 X ab X ab dw 02 X ab X AB dw 03 dx Ab = [α 1 X Ab (1 X Ab ) α 2 X Ab X ab α 3 X Ab X AB +ρ(x AB X ab X Ab X ab )]dt X Ab X ab dw 10 X Ab X ab dw 12 X Ab X AB dw 13 dx ab = [α 2 X ab (1 X ab ) α 1 X Ab X ab α 3 X ab X AB +ρ(x AB X ab X Ab X ab )]dt (2.3) X ab X ab dw 20 X ab X Ab dw 21 X ab X AB dw 23 dx AB = [α 3 X AB (1 X AB ) α 1 X Ab X AB α 2 X ab X AB +ρ(x Ab X ab X ab X AB )]dt X AB X ab dw 30 X AB X Ab dw 31 X AB X ab dw 32, where (W kl ) k>l is a family of independent Brownian motions with W kl = W lk for k {1,2,3} and l {0,1,2,3}, X ab +X Ab +X ab +X AB = 1, started in (X ab (0),X Ab (0),X ab (0),X AB (0)) := x = (x ab,x Ab,x ab,x AB ). Proof. The convergence of the starting value is predetermined. Due to classical results about convergence in distribution against diffusion processes (Ethier and Kurtz, 1986) the proof relies on the computation of the infinitesimal parameters and their convergence. We will not present all calculations in detail and concentrate on some characteristic rates. We analyse the drift and the covariance for(x N ab (t ),X N Ab (t ),X N ab (t ),X N AB (t )) = (x ab,x Ab,x ab,x AB ). The drift calculation of type Ab is given by d dt E[XN Ab (t)] = 1 N (r+ Ab r Ab )N = s N 1 Nx Ab (1 x Ab ) s N 2 Nx Ab x ab s N 3 Nx Ab x AB +r N N (x AB x ab x Ab x ab ) N α 1 x Ab (1 x Ab ) α 2 x Ab x ab α 3 x Ab x AB +ρ(x AB x ab x Ab x ab ). For the covariance terms between the different types, the convergence follows for instance for the types ab and Ab since d dt E[(XN ab (t) x ab)(x N Ab (t) x Ab)] = 1 ( N 2 N 2 x ab x Ab +s N 1 N 2 x ab x Ab +r N N 2( x Ab x ab + x Abx ab 2 + x )) ABx ab. 2

20 Introduction to competing selective sweeps 14 Because of r N = O(1/N) and s N i = O(1/N) this leads to d dt E[(XN ab (t) x ab)(x N Ab (t) x Ab)] = x ab x Ab +O(1/N) N x ab x Ab. The calculations for the other types and events are quite similar. Altogether the convergence follows using standard theory (see e.g. Karlin and Taylor, 1981, Chap. 15). Remark 2.6. The drift terms of the diffusion system (2.3) consist of two parts, a selection part (composed of the terms with an α i component) and a recombination part (composed of the terms with a ρ component). The tendency of the selection part of a type depends on the configuration of the whole system. For example conditioned on X ab = X AB = 0 the selection drift of X Ab is given by α 1 x Ab (1 x Ab ) and is strictly positive, while in the case of high values of X AB the tendency changes because of α 3 > α 1. The recombination drift shows a different dynamic, with highest positive values when the considered type is not existent. If for example X AB = 0, then the recombination drift of type AB is given by ρx Ab x ab. This rate represents recombination events between type Ab and ab, leading to an increase of type AB and plays therefore an important role in the analysis of the fixation probability of type AB. For example in the extreme case ρ = 0, a situation without recombination, there is no chance to escape from 0 for type AB. Naturally many features of the Moran model transfer to the diffusion system. When only one type is present in the population then no further changes happen afterwards. Formally expressed, when max(x ab (t),x Ab (t),x ab (t),x AB (t)) = 1, then we get for all times s t 0 (X ab (s),x Ab (s),x ab (s),x AB (s)) = (X ab (t),x Ab (t),x ab (t),x AB (t)). So the process stays in one of the four states (1,0,0,0),(0,1,0,0),(0,0,1,0), or (0,0,0,1), respectively. We are interested in cases with fixation of type AB and want to calculate their likelihood. As explained, in such a situation it holds that X AB ( ) = 1. The probability for reaching a certain fixation state depends heavily on the starting situation of the system. Before this topic is discussed in detail we quote, that the convergence result of Proposition 2.5 carries over to the fixation probabilities. Corollary 2.7. The probability for fixation of type ij with i {a, A} and j {b, B} in the rescaled Moran model converges for N in probability to the fixation probability of this type in the diffusion system (2.3), when the starting value of the Moran model X N 0 converges in distribution to the starting value x of the diffusion. In formula, using the notation p N ij ( ) for the event of fixation of type ij in the Moran model, lim P X N 0 N(pN ij( )) = P x (X ij ( ) = 1). Proof. Since the evolution of the Moran model converges in distribution to the diffusion process, according to Prop. 2.5, and the starting value of the Moran model converges, the convergence of the fixation probability is straight forward (see e.g. Ethier and Kurtz, 1986, Chap. 10, Cor. 2.7). At this point we go back to the biological situation, which shall be analysed with the presented models. This step back is done to clarify the choice of the starting conditions. We want to understand the interaction of two strongly beneficial partly linked alleles in a sexual reproducing population. These beneficial alleles appear by single mutation events. The case of recurrent beneficial mutation events is not treated here. We are interested in scenarios

21 Heuristics on the event of fixation where the beneficial alleles are not connected at first 1. This means they appear on different backgrounds at the beginning and only recombination can bring them together. When we neglect the biological unrealistic case that the two mutations happen exactly in the same generation, then there is a time interval where only one beneficial type is present. During this phase the evolution is comparable with the evolution in a single sweep case. The beneficial type can survive or die by chance. The latter is not interesting and we concentrate on survival situations. Furthermore, we are only interested in situations with large selection coefficients. At some time point the second beneficial allele appears on a wild-type and the exciting evolution starts. We want to start our analysis at this moment and try to calculate the fixation probability of both beneficial alleles. Hence in a Moran model the frequency of the second beneficial allele is 1/N at the beginning. Since this starting situation translates to a starting frequency of δ 0 of type ab in the diffusion limit, the considered fixation probability has to be multiplied with a compensation term to get a proper limit result. This problem as well as other required technical assumptions will be treated later before the main result about the fixation probability is calculated. Here one further biological reasonable property of the starting situation is discussed. If we treat the arrival time of the second beneficial mutation as uniformly distributed over the time course of the sweep of the first mutation, we can assume that the second mutation happens, while the frequency of the first is below ǫ or above 1 ǫ. The phase of a selective sweep between ǫ and 1 ǫ is very short (for large selection coefficients) and can therefore be neglected. When the frequency of the first mutant is above 1 ǫ the probability that the second falls on the wild-type is smaller than ǫ. Hence also this case is negligible. Therefore it is reasonable to assume that the frequency of the first mutant is below ǫ, when the second beneficial allele appears. In the next section heuristic arguments are presented to get an intuition for the impact of the different parameters and possible scenarios. For that purpose the stochastic differential system (2.3) is analysed based on the introduced starting conditions, assuming that it describes the evolution of a large panmictic population of size N. In doing so we also comment on the published approximation results in the different situations. 2.2 Heuristics on the event of fixation We start this section by summing up the properties and orders of magnitudes of the different parameters presented in detail above. The selection coefficients of the different types are ordered by s AB > s ab > s Ab > s ab = 0. The recombination probability r is rather small, so that there is a mid-size constant G, which bounds the product rn < G. Here in this section we utilize the stochastic differential equation system (2.3) somewhat imprecisely and assume that it describes the evolution of a large panmictic population of size N. Combining the large population size with the precondition of strongly selected beneficial alleles, leads to large α i coefficients in Eq. (2.3). So once a type reaches a certain frequency its evolution is dominated by the deterministic tendency according to Eq. (2.3) and the stochastic effects only play a minor role. As described in Remark 2.6, this tendency of the different types depends on the configuration of the whole system. Due to these parameter assumptions the evolution can be splitted in different phases with different possible outcomes and ends with the fixation of one type. 1 The other case, when the second beneficial mutation happens on an individual which has already the first beneficial allele, can be handled using classical results about sweeps. Only the selective coefficients of the different types are needed.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary