A coalescent model for the effect of advantageous mutations on the genealogy of a population

Size: px

Start display at page:

Download "A coalescent model for the effect of advantageous mutations on the genealogy of a population"

Arnold Mills
5 years ago
Views:

1 A coalescent model for the effect of advantageous mutations on the genealogy of a population y Rick Durrett and Jason Schweinserg May 13, 5 Astract When an advantageous mutation occurs in a population, the favorale allele may spread to the entire population in a short time, an event known as a selective sweep. As a result, when we sample n individuals from a population and trace their ancestral lines ackwards in time, many lineages may coalesce almost instantaneously at the time of a selective sweep. We show that as the population size goes to infinity, this process converges to a coalescent process called a coalescent with multiple collisions. A etter approximation for finite populations can e otained using a coalescent with simultaneous multiple collisions. We also show how these coalescent approximations can e used to get insight into how eneficial mutations affect the ehavior of statistics that have een used to detect departures from the usual Kingman s coalescent. 1 Introduction Our goal in this paper is to descrie the coalescent processes that arise when we consider the genealogy of a population that is affected y repeated eneficial mutations. The starting point for this analysis will e the continuous-time population model introduced y Moran (1958). In this model, the population size is fixed at N. Each individual independently lives for a time that is exponentially distriuted with mean 1 and then is replaced y a new individual. The parent of the new individual is chosen at random from the N individuals, including the one eing replaced. Note that we can think of the population as consisting of N chromosomes of N diploid individuals, so each memer of the population has just one parent. Suppose we sample n individuals at random from this population at time zero. To descrie the genealogy of the sample, we will define the ancestral process, which will e a continuoustime Markov process (Ψ N (t),t ) whose state space is the set P n of partitions of {1,...,n}. The ancestral process descries the coalescence of lineages as we follow the ancestral lines of the sampled individuals ackwards in time. More precisely, Ψ N () is the partition of {1,...,n} into n singletons, and Ψ N (t) is the partition of {1,...,n} such that i and j are in the same lock of Ψ N (t) if and only if the ith and jth individuals in the sample have the same ancestor at time Nt. It is well-known that the process (Ψ N (t),t ) is Kingman s coalescent, a coalescent Partially supported y NSF grants from the proaility program ( and 935) and from a joint DMS/NIGMS initiative to support research in mathematical iology (137). Supported y an NSF Postdoctoral Fellowship 1

2 process introduced y Kingman (198). Kingman s coalescent is a P n -valued Markov process that starts from the partition of {1,...,n} into singletons. All transitions involve exactly two locks of the partition merging together, and each such transition occurs at rate one. Within the last decade, progress has een made on descriing the genealogy of populations in models that allow for natural selection. Krone and Neuhauser (1997) and Neuhauser and Krone (1997) studied a model in which each individual can e of type 1 or. An individual of type i produces offspring at rate λ i, with λ >λ 1 so that type is advantageous. Each new offspring replaces a randomly chosen individual from the population, and is the same type as its parent with proaility 1 u N and the opposite type with proaility u N. Under certain assumptions, they show that the genealogy of a sample from the population can e descried using what they call an ancestral selection graph. Additional work of Donnelly and Kurtz (1999) and Barton, Etheridge, and Sturm (4) has incorporated recomination as well as selection into the model. The ancestral selection graph arises in the limit as N in the case of weak selection, where the selective advantage λ /λ 1 1 and the mutation rates u N are O(1/N ). Then, as N the fraction of individuals with the favored allele can e approximated y a diffusion process. In this paper, we consider strong selection, where the selective advantage is O(1). With strong selection, when a eneficial mutation occurs, there is a positive proaility that the eneficial allele will spread to the entire population, an event known as a selective sweep. At the end of a selective sweep, the entire population has the favorale allele, and every memer of the population will trace that favorale allele ack to the individual that had the eneficial mutation that caused the selective sweep. However, the genealogy ecomes more complicated when we consider recomination. Diploid individuals usually do not inherit an identical copy of one of their parent s chromosomes. Instead, the inherited chromosome consists of pieces of each of a parent s two chromosomes. Since a chromosome is coming from two places, we need to consider the genealogy not of an entire chromosome ut of a particular site of interest on the chromosome. When a selective sweep is caused y a eneficial mutation at a site other than the site of interest, many individuals may trace their gene at the site of interest ack to the individual that had the eneficial mutation at the eginning of the selective sweep, while others may trace their gene at the site of interest to a different ancestor ecause of recomination etween the two sites on the chromosome. This effect was first studied y Maynard Smith and Haigh (1974), who called it the hitchhiking effect. As we will show, the typical duration of a selective sweep is only O(log N). Therefore, when we speed up time y a factor of N to define the ancestral process, the selective sweep takes place almost instantaneously. Consequently, if we sample n individuals some time after a selective sweep and define the ancestral process as efore, the ancestral process ehaves like Kingman s coalescent until we get ack to the time of a selective sweep. At that time, many lineages may coalesce ecause they get traced ack to the individual with the mutation that caused the selective sweep. This possiility was oserved y Gillespie (), who referred to the resulting coalescent process as the pseudohitchhiking model. We will show that if selective selective sweeps happen repeatedly throughout the history of a population at times of a Poisson process, as proposed y Gillespie (), then under suitale assumptions the ancestral processes will converge as N to a coalescent with multiple collisions, which is a P n -valued Markov process in which many locks of the partition can merge at once into a single lock. These coalescent processes were introduced y Pitman (1999) and Sagitov (1999). While coalescents with multiple collisions are the limiting coalescent processes as N,

3 an improved approximation for finite N can e otained using a coalescent with simultaneous multiple collisions. Coalescents with simultaneous multiple collisions, which were introduced y Schweinserg () and Möhle and Sagitov (1), are coalescent processes in which many locks can merge at once into a single lock, and many such mergers can occur simultaneously. They provide a etter approximation than coalescents with multiple collisions in this context ecause, as noted y Barton (1998), Durrett and Schweinserg (4a), and Schweinserg and Durrett (4), multiple groups of lineages can coalesce at the time of a selective sweep. Coalescents with multiple or simultaneous multiple collisions arise as limits of ancestral processes in populations that occasionally have very large families ecause ancestral lines that go ack to an individual with many offspring will coalesce at the same time. Coalescents with multiple collisions arise when a single large family is possile in a given generation, while coalescents with simultaneous multiple collisions arise when one generation can contain many large families. For more details, see Sagitov (1999, 3), Möhle and Sagitov (1), and Schweinserg (3). The results in this paper provide a different iological application of these coalescent processes. The rest of this paper is organized as follows. In section, we descrie our model for how the population evolves when there can e eneficial mutations. We state our main result, which is that the genealogy of this process converges to a coalescent with multiple collisions. In section 3, we present the improved approximation involving a coalescent with simultaneous multiple collisions. The next two sections are devoted to applications of these results. In section 4, we discuss how multiple mergers affect the numer of segregating sites and pairwise differences in a sample of DNA. These quantities are used in Tajima s D-statistic (see Tajima (1989)), which can e used to detect departures from the standard Kingman s coalescent. In section 5 we discuss how multiple mergers affect the numer of mutations that appear on just a single individual in the sample, which is relevant to the test proposed y Fu and Li (1993) for detecting departures from Kingman s coalescent. Our results suggest that Fu and Li s test should have less power to detect selective sweeps, at least in large samples, than Tajima s D-statistic. Finally, in section 6, we prove the convergence and approximation theorems stated in sections and 3. Convergence to a coalescent with multiple collisions In this section, we give a precise description of our model of a population that experiences eneficial mutations, and we state our main convergence theorem. We descrie what happens following a single eneficial mutation in susection.1, and we consider recurrent eneficial mutations in susection.. Then in susection.3, we state the convergence result and give some examples..1 The effect of a single eneficial mutation In this susection we descrie how the population evolves after one of the N individuals experiences a eneficial mutation. We will denote the new favorale allele y B and the other allele y. We assume the relative fitnesses of the two alleles are 1 and 1 s, so the B alleles will tend to survive longer. Immediately after the mutation, one individual has the B allele and N 1 have the allele. Kaplan, Hudson, and Langley (1989) and Stephen, Wiehe, and Lenz (199) proposed modeling the fraction of individuals p(t) with the B allele at time t y using the logistic 3

4 differential equation dp = sp(1 p). dt This approach has een popular in simulation studies. However, Durrett and Schweinserg (4a) showed that this approximation is not very accurate. Consequently, we will consider instead a modification to the Moran model that was studied y Durrett and Schweinserg (4a) and Schweinserg and Durrett (4). At one site, each chromosome has a B or allele, ut we will e interested in the genealogy at another neutral site at which all alleles have the same fitness. As in the Moran model, each individual survives for a time that is exponentially distriuted with mean 1, and then a replacement is proposed in which the parent of the proposed new individual is chosen at random from the N memers of the population. However, to account for natural selection, whenever a replacement of a B chromosome with a chromosome is proposed, the change is rejected with proaility s. Also, to incorporate recomination into the model, we say that when a new individual is orn, it inherits its alleles at oth sites from the same parent with proaility 1 r. However, with proaility r, there is recomination etween the two sites, so the new individual inherits its allele at the neutral site from its parent s other chromosome. Because we are treating an individual s two chromosomes as two separate memers of the population, we model this y saying that, with proaility r, the new individual inherits the two alleles from two ancestors chosen independently at random from the population. Suppose the eneficial mutation appears on one chromosome at time, and let X(t) e the numer of chromosomes with the favorale allele at time t. Let τ = inf{t : X(t) {, N}} e the time at which either the B or allele disappears from the population. Suppose we take a random sample of n individuals from the population at time τ. Let Θ e the partition of {1,...,n} such that i and j are in the same lock of Θ if and only if the ith and jth individuals in the sample have the same ancestor at time zero when we follow the ancestral lines associated with the neutral site of interest. The partition Θ then descries how the eneficial mutation affects the genealogy of the sample. We have the following result concerning the distriution of Θ. Here Q p,n, for p [, 1], is the distriution of a random partition Π otained as follows. First, define a sequence of independent random variales (ξ i ) n i=1 such that P (ξ i =1)=p and P (ξ i =)=1 p for i =1,...,n. Then define Π such that one lock of Π consists of {i n : ξ i =1} and the remaining locks of Π are singletons. Proposition.1. Fix n N, and fix s (, 1). Assume there is a constant C such that r C /(log N) for all N. Letα = r log(n)/s, and let p = e α. 1. There exists a positive constant C, depending continuously on s and α ut not depending on N, such that P (Θ = π X(τ) =N) Q p,n (π) C/(log N) for all π P n.. Let κ e the partition of {1,...,n} into singletons. There exists a constant C, depending continuously on s and α ut not depending on N, such that P (Θ κ and X(τ) =) CN 1/. Note that in this proposition, the selective advantage s is assumed to e fixed, ut the recomination proaility r depends on N. Part 1 of the proposition, which is a restatement of Theorem 1.1 of Schweinserg and Durrett (4), implies that as N, the distriution of Θ, 4

5 conditional on the event that a selective sweep occurs, converges to Q p,n, where p represents the approximate fraction of lineages that coalesce at the time of the selective sweep. Part of the proposition, which we prove in Section 6, shows that lineages typically do not coalesce when the favorale B allele dies out. The proaility that a selective sweep occurs, and therefore Part 1 of the proposition applies, is s/(1 (1 s) N ) (see Durrett () or Schweinserg and Durrett (4)).. A model with recurrent eneficial mutations To model a population in which eneficial mutations can occur repeatedly, we assume that eneficial mutations at different points on the chromosome occur at times of a Poisson process. The selective advantage that these mutations provide and the rate of recomination etween the site of interest and the site of the mutation will e random. When there is a eneficial mutation in the population, the population will evolve as descried in the previous susection. Between these times, the population will follow the standard Moran model. To e more precise, we will consider the chromosome to e the line segment [ L, L]. Our goal will e to descrie the genealogy of the site. For each N, the eneficial mutations will e governed y a Poisson process K N on R [ L, L] [, 1]. If (t, x, s) is a point in K N, then at time t, a mutation, which provides a selective advantage of s, will appear at location x on one of the N chromosomes. The intensity measure of K N will e λ µ N, where λ denotes Leesgue measure on R and µ N is a finite measure on [ L, L] [, 1] which governs the rates of eneficial mutations. The recomination proailities will e determined y a function r N :[ L, L] [, 1]. We assume that r N () = and r N is nonincreasing on [ L, ] and nondecreasing on [,L]. Beginning at time t, the population will evolve according to the model descried in the previous susection of a population with a eneficial allele having selective advantage s and recomination proaility r N (x). We let τ(t) denote the first time that the eneficial mutation that appears at time t either disappears from the population or is present on all N chromosomes. Let T N = {t :(t, x, s) is a point in K N for some x and s} e the times at which eneficial mutations are proposed. Note, however, that we can not define the evolution of the population as explained aove if, for some t 1,t T N, the intervals [t 1,τ(t 1 )] and [t,τ(t )] overlap. There has een some work in the iology literature on the question of how a selective sweep is affected y another selective sweep happening at the same time (see, for example, Barton (1995), Gerrish and Lenski (1998), and Kim and Stephen (3)). However, as we will show, in our model this overlap occurs too infrequently to have any affect on our results, so we avoid the issue of defining the population during periods of overlap y allowing a new eneficial mutation to occur only when there is no other eneficial mutation currently in the population. That is, eneficial mutations will occur at the times in T N = {t T N : τ(u) <tfor all u T N such that u<t}. Let I N = [t, τ(t)]. t T N A eneficial mutation will e present in the population at time u if and only if u I N. For the intervals in I N, the evolution of the population was defined in susection.1. For the times in R \I N, we will say that the population evolves according to the standard Moran model so that the evolution of the population is well-defined for all of R. To define the ancestral process Ψ N =(Ψ N (t),t ), we sample n of the N individuals at random from the population at time zero. We then define Ψ N (t) to e the partition of 5

6 {1,...,n} such that i and j are in the same lock of Ψ N (t) if and only if the ith and jth individuals in the sample got their allele at location on the chromosome from the same ancestor at time Nt. Note that we are again speeding up time y a factor of N so that, if there are no eneficial mutations (i.e. if µ N is the zero measure), the ancestral process Ψ N =(Ψ N (t),t ) is Kingman s coalescent. When we do have eneficial mutations, the ancestral processes will converge as N, under suitale conditions, to a coalescent with multiple collisions..3 The main convergence theorem and examples Pitman (1999) introduced coalescents with multiple collisions, in which many locks of the partition can merge into one. These coalescent processes are in one-to-one correspondence with finite measures Λ on [, 1], and the coalescent process associated with a particular measure Λ is called the Λ-coalescent. We will consider here only P n -valued coalescents ecause they are what we will need to approximate the genealogy of a sample of size n. However, the constructions can e extended, using Kolmogorov s Extension Theorem, to yield coalescent processes that take their values in the set of partitions of N = {1,,...}. Suppose (Π n (t),t ) is the P n -valued Λ-coalescent. Then Π n () is the partition of {1,...,n} into singletons. If Π n (t) has locks, then every possile transition involves merging k of the locks into one, where k. Denoting the rate of this transition y λ,k, we have λ,k = 1 x k (1 x) k Λ(dx). (.1) If Λ = δ, where δ denotes a unit mass at zero, then every transition that involves two locks merging into one happens at rate one, and no other transitions are possile. Thus, the δ - coalescent is Kingman s coalescent. The theorem elow states that when we do have eneficial mutations, the ancestral processes converge as N, under suitale conditions, to a coalescent with multiple collisions. The multiple mergers happen at times of selective sweeps. Note that the convergence is in the sense of finite-dimensional distriutions. Convergence in the stronger Skorohod topology does not hold ecause, during the short time intervals when selective sweeps are taking place, Ψ N may undergo multiple transitions. Theorem.. Let µ e a finite measure on [ L, L] [, 1], and let r :[ L, L] [, ) e a ounded continuous function such that r() = and r is nonincreasing on [ L, ] and nondecreasing on [,L]. Suppose that, as N, the measures Nµ N converge weakly to µ and the functions (log N)r N converge uniformly to r. Letη e the measure on (, 1] such that η([y, 1]) = L 1 s1 {e r(x)/s y} L µ(dx ds) for all y (, 1]. LetΛ e the measure on [, 1] defined y Λ=δ +Λ, where Λ (dx) =x η(dx). Let Π = (Π(t),t ) e the P n -valued Λ-coalescent. Then, as N, the finite-dimensional distriutions of Ψ N converge to the finite-dimensional distriutions of Π. Note that in Theorem., the recomination proaility is O(1/(log N)). The function r is assumed to e monotone on [ L, ] and [,L] ecause the greater the distance etween and the 6

7 site of the mutation, the greater the likelihood of recomination etween the two sites. Also, the rate of eneficial mutations is O(1/N ), so that the multiple mergers caused y selective sweeps and the ordinary mergers of two lineages at a time are happening on the same time scale. If the rate of selective sweeps were o(1/n ), then the multiple mergers would disappear in the limit. If selective sweeps occurred on a faster time scale than O(1/N ), then the multiple mergers would dominate for large N and the limiting coalescent would have no δ component. Gillespie () considers this possiility and proposes that it may explain why oserved genetic variation does not appear to e as sensitive to population size as Kingman s coalescent model predicts. However, in this paper we focus on the case in which oth types of mergers happen on the same time scale. We now derive the limiting coalescent with multiple collisions in two natural examples. Example.3. Consider the case in which we are concerned only with mutations at a single site, all of which have the same selective advantage. Fix α>, and let µ N = αn 1 δ (z,s) for some s (, 1] and z [ L, L]. This means that eneficial mutations that provide selective advantage s appear on the chromosome at site z at times of a Poisson process. The measures Nµ N converge to µ = αδ (z,s). Assume that the recomination functions r N are defined such that the sequence (log N)r N converges uniformly to r, and let β = r(z). Then, for all y (, 1], we have η([y, 1]) = L 1 L u1 {e r(x)/u y} µ(dx du) =sα1 {e β/s y}. Therefore, η consists of a mass sα at p = e β/s. It follows from Theorem. that the limiting coalescent process is the Λ-coalescent, where Λ = δ + sαp δ p. Thus, in addition to the mergers involving just two locks, we have coalescence events at times of a Poisson process in which we flip p-coins for each lineage and merge the lineages whose coins come up heads. Example.4. It is also natural to consider the case in which mutations occur uniformly along the chromosome. For simplicity, we will assume that the selective advantage s is fixed. Let λ denote Leesgue measure on [ L, L]. Suppose µ N = N 1 (αλ δ s ), so the measures Nµ N converge to µ = αλ δ s. To model recomination occurring uniformly along the chromosome, we assume that the functions (log N)r N converge uniformly to the function r(x) =β x, so the proaility of recomination is proportional to the distance etween the two sites on the chromosome. For all y (, 1], we have η([y, 1]) = αs L L L 1 {e r(x)/s y} dx = αs 1 {e β x /s y} dx. Since e β x /s y if and only if x (s/β)(log y), we have { αs } log y η([y, 1]) = min, αsl. β Therefore, for y e βl/s, we have d η([y, 1]) = αs dy βy. Let c =αs /β. It follows that η has a density given y g L (y) =c/y for e βl/s y 1 and g L (y) = otherwise. By Theorem., the finite-dimensional distriutions of the ancestral L 7

8 processes Ψ N converge to those of the Λ-coalescent, where Λ = δ +Λ and Λ has density h L (y) =y g L (y). Note that as L, the density h L (y) converges to h(y), where h(y) =cy for y [, 1] and h(y) = otherwise. We can think of this as the limiting coalescent for an infinitely long chromosome. Example.5. Finally, we show that any Λ-coalescent with a unit mass at zero can arise as a limit of ancestral processes in this model. We first show how to otain coalescents of the form Λ=δ +Λ, where Λ is a finite measure on [ɛ, 1] and <ɛ<1. Note that in Theorem., we have Λ (dx) =x η(dx), so it suffices to show that µ and r can e chosen to make η an aritrary finite measure on [ɛ, 1]. Let G :[ɛ, 1] [, ) e any nonincreasing left-continuous function. We will choose µ and r so that η([y, 1]) = G(y) for ɛ y 1 and η([,ɛ)) =. Let L = 1 log ɛ, and let ν e the measure on [ L, L] such that ν([ L, )) = and, for ɛ y 1, ν([, 1 log y])=g(y). Suppose r(x) = x and µ = ν δ 1/. Then, for ɛ y 1, η([y, 1]) = = 1 L 1 s1 {e r(x)/s y} L L µ(dx ds) 1 {e x y} ν(dx) = 1 ν([, (log y)/]) = G(y), as claimed. Thus, we can get the Λ-coalescent in the limit if Λ ((,ɛ)) =. We can otain an aritrary Λ-coalescent y then taking a limit as L (or ɛ ) as in Example.4. 3 Approximation y a coalescent with simultaneous multiple collisions A key ingredient in the proof of Theorem. is part 1 of Proposition.1. Part 1 of Proposition.1 says that, up to an error of O(1/(log N)), we can approximate the effect of a selective sweep on the genealogy y flipping a p-coin for each lineage and merging the lineages whose coins come up heads. However, Durrett and Schweinserg (4a) oserved in simulations that for N etween 1, and 1,,, the approximation in Proposition.1 works poorly, largely ecause it is possile for multiple groups of lineages to coalesce at the time of a selective sweep. By taking this into account, they were ale to give a more complicated approximation that works much etter in simulations and has an error of only O(1/(log N) ). Before stating this result, we review Kingman s (1978) paintox construction of exchangeale random partitions of {1,...,n}. Let = { (x 1,x,...):x 1 x, x i 1 }, and let G e a proaility measure on. We define a G-partition Π of {1,...,n} as follows. Let Y =(Y 1,Y,...) e a -valued random variale with distriution G. Define a sequence (Z i ) n i=1 to e conditionally i.i.d. given Y such that P (Z i = j Y )=Y j for all positive integers j and P (Z i = Y )=1 j=1 Y j. Then define Π to e the partition such that distinct integers i and j are in the same lock if and only if Z i = Z j 1. We denote the distriution of a G-partition of {1,...,n} y Q G,n. Note that if G is a unit mass at (p,,,...), then Q G,n = Q p,n. i=1 8

9 Next, we define a family of distriutions R(θ, M) on y using a stick-reaking construction. Let θ [, 1], and let M e a positive integer. Let (W k ) M k= e independent random variales such that W k has a Beta(1,k 1) distriution. Let (ζ k ) M k= e a sequence of independent random variales such that P (ζ k =1)=θ and P (ζ k =)=1 θ for all k. For k =, 3,...,M, let V k = ζ k W k. To perform the stick reaking, we first reak off a fraction W M of the unit interval, then reak off a fraction W M 1 of what is left over, and so on until we get down to W. For k =,...,M, the length of the kth fragment is Ỹk = V M k j=k+1 (1 V j), and the length of the first fragment is Ỹ1 = M j= (1 V j). Note that M k=1 Ỹk = 1. Let Y =(Y 1,Y,...,Y M,,,...) e the sequence otained y ranking the interval lengths Ỹ1,...,ỸM in decreasing order and then appending an infinite sequence of zeros. Finally, let R(θ, M) e the distriution of Y. These distriutions R(θ, M) were studied in Durrett and Schweinserg (4), who used them to approximate the distriution of family sizes in a Yule process with infinitely many types. They arise in the proposition elow ecause, after a eneficial mutation, the numer of lineages with the B allele that do not eventually die out can e approximated y a Yule process. The result elow is Theorem 1. of Schweinserg and Durrett (4). Proposition 3.1. Fix n N, and fix s (, 1). Assume there is a constant C such that r C /(log N) for all N. Let α = r log(n)/s, and let p = e α. Then there exists a positive constant C, depending continuously on s and α ut not depending on N, such that P (Θ = π X(τ) =N) Q R(r/s, Ns ),n (π) C/(log N) for all π P n, where m denotes the greatest integer less than or equal to m. Because the improved approximation allows many groups of lineages to coalesce at the time of a selective sweep, this result suggests that, for finite N, a coalescent with simultaneous multiple collisions should provide a etter approximation of the ancestral process than a coalescent with multiple collisions. Coalescents with simultaneous multiple collisions, which were studied y Möhle and Sagitov (1), Schweinserg (), and Bertoin and Le Gall (3), have the property that many locks can merge at once into a single lock, and many such mergers can occur simultaneously. Coalescents with simultaneous multiple collisions are in one-to-one correspondence with finite measures Ξ on. Suppose π is a partition of {1,...,n} whose locks are B 1,...,B m, and suppose π is a partition of {1,...,n } with n m whose locks are B 1,...,B k. Following Bertoin and Le Gall (3), define the coagulation of π y π to e the partition whose locks are given y j B i B j for i =1,...,k. Suppose (Π n (t),t ) is the P n -valued Ξ-coalescent. If there are locks at time t and a merger occurs at time t, then there exists a unique partition π P such that Π n (t) is the coagulation of Π n (t ) yπ. If π has r + s locks, s of which are singletons and the other r of which have sizes k 1,...,k r, where = k k r + s, then the rate of this transition is ( 1 λ ;k1,...,k r;s = Q δx,(π) xj) Ξ (dx)+a1 {r=1,k1 =}, (3.1) j=1 where δ x denotes a unit mass at x =(x 1,x,...) and Ξ has een written as aδ (,,... ) +Ξ with Ξ ({(,,...)}) =. Coalescents with multiple collisions are a special case in which Ξ is concentrated on points in which only the first coordinate is nonzero. 9

10 Coalescents with multiple and simultaneous multiple collisions can e constructed from Poisson point processes (see Pitman (1999) and Schweinserg ()). Consider a Poisson process on (, ) P n whose intensity measure is the product of Leesgue measure on (, ) and a measure L on P n defined as follows. Let S P n e the set of all partitions consisting of one lock of size and n singletons. If π P n, let L(π) =ifπ is the partition consisting of n singletons. Otherwise, let ( 1 L(π) = Q δx,n(π) xj) Ξ (dx)+a1 {π S}. (3.) j=1 Since L is a finite measure, it is easy to define Π n =(Π n (t),t ) such that Π n () is the partition consisting of n singletons and, at the times of points (t, π) of the Poisson point process, the partition Π n (t) is the coagulation of Π n (t ) yπ, and these are the only jump times of Π n. This coalescent process is the P n -valued Ξ-coalescent. The construction of the Λ-coalescent is the same, except that if π has at least one lock that is not a singleton, we define L(π) = 1 Q p,n (π)p Λ (dp)+a1 {π S}, (3.3) where Λ = δ +Λ and Λ ({}) =. Under some additional assumptions, most significantly restricting the selective advantage resulting from each eneficial mutation to e at least ɛ >, we are ale to otain ounds on the difference etween the finite-dimensional distriutions of Ψ N and the finite-dimensional distriutions of the approximating coalescent process. Proposition 3. elow shows that indeed the coalescent with simultaneous multiple collisions gives a more accurate approximation. Proposition 3.. Let µ e a finite measure on [ L, L] [ɛ, 1], where ɛ>, and let r :[ L, L] [, 1] e a function such that r() = and r is nonincreasing on [ L, ] and nondecreasing on [,L]. Suppose that, for all N, we have µ N = N 1 µ. Also, assume that r N (x) =r(x)/ log(n) for all N and x. Fix times <u 1 < <u m, and let π 1,...,π m P n. 1. Define η and Λ as in Theorem.. Let Π = (Π(t),t ) e the P n -valued Λ-coalescent. Then there exists a constant C such that P (Ψ N (u i )=π i for i =1,...,m) P (Π(u i )=π i for i =1,...,m) C log N.. Let G N e the measure on such that for all measurale susets A, we have G N (A) = L 1 L sr(r N (x)/s, Ns )(A) µ(dx ds). Let Ξ N e the measure on given y Ξ N = δ (,,... ) +Ξ N,, where Ξ N, is defined y Ξ N, (dx) =( j=1 x j )G N (dx). Let Υ N =(Υ N (t),t ) e the P n -valued Ξ N -coalescent. Then there exists a constant C such that P (Ψ N (u i )=π i for i =1,...,m) P (Υ N (u i )=π i for i =1,...,m) C (log N). 1

11 4 Segregating sites and pairwise differences One motivation for modeling a population that experiences recurrent selective sweeps y coalescents with multiple or simultaneous multiple collisions is that these coalescent models can provide insight into tests used to detect selective sweeps. In view of part of Proposition 3. and the simulation results in Durrett and Schweinserg (4a), there should e little loss of accuracy in studying the ehavior of these tests under the assumption that the genealogy of a sample follows a coalescent with simultaneous multiple collisions. One commonly used test is ased on Tajima s D-statistic (see Tajima (1989)). Given a sample of n strands of DNA from the same region on a chromosome, let ij e the numer of sites at which the ith and jth segments differ, and let n = ( n) 1 i j ij e the average numer of pairwise differences over the ( n ) possile pairs. Let S n e the numer of segregating sites in the sample, that is, the numer of sites at which at least one pair of segments differs. Tajima s D-statistic compares the statistics n and S n. Suppose the ancestral history of a sample of N individuals is given y a coalescent with multiple or simultaneous multiple collisions. Let λ e the total rate of all mergers when the coalescent has locks. Assume that, on the time scale of the coalescent process, mutations happen at rate θ/. Any mutation on the ith or jth lineage efore these lineages coalesce will cause the ith and jth segments to differ at some site. Since the expected time for these lineages to coalesce is λ 1, we have E[ ij] =θλ 1. Therefore E[ n ]=θλ 1. (4.1) Note that λ = Λ([, 1]) for coalescents with multiple collisions and λ = Ξ( ) for coalescents with simultaneous multiple collisions. To calculate the expected numer of segregating sites, we note that any mutation in the ancestral tree efore all n lineages have coalesced into one adds to the numer of segregating sites. If, at some time, the coalescent has exactly locks, the expected time that the coalescent has locks is λ 1. Let G n () e the proaility that the coalescent, starting with n locks, will have exactly locks at some time. Then E[S n ]= θ λ 1 G n (). (4.) Although we do not have a closed-form expression for G n (), these quantities can e calculated recursively ecause (.1) and (3.1) allow us to express G n () in terms of G k () for k<n. As a result, it would not e difficult to evaluate the expression in (4.) numerically. Suppose the ancestral process is given y Kingman s coalescent, which would e the case if there were no selective sweeps. Then λ = ( ) for all. Also, the numer of locks never decreases y more than one at a time, so G n () = 1 whenever n. It follows that E[ n ]=θ and E[S n ]= θ ( ) 1 1 = θ 1 = θh n 1, (4.3) where h n 1 = n 1 i=1 (1/i). Thus, E[ n S n /h n 1 ] =. This oservation is the asis for Tajima s D-statistic, which is given y D = n S n /h n 1 an S n + n S n (S n 1), (4.4) 11

12 where a n and n are somewhat complicated constants that are chosen to make the variance of D approximately one when the ancestral tree is given y Kingman s coalescent. See section 4.1 of Durrett () for details. After a selective sweep, the new mutants will tend to have low frequency. As a result, a recent selective sweep should decrease n more than S n, causing the numerator of Tajima s D-statistic to e negative. Braverman et. al. (1995) found in simulations that Tajima s D-statistic indeed tends to e negative after a selective sweep. Simonsen, Churchill, and Aquadro (1995) studied this question further and argued that unless the selective sweep was recent, Tajima s D-statistic had relatively little power to detect selective sweeps. See also Przeworski (), who discusses the power of Tajima s D-statistic to detect selective sweeps. Our coalescent approximation allows us to otain the following result regarding the expected numer of segregating sites when the population experiences recurrent selective sweeps. Proposition 4.1. Consider a Λ-coalescent in which Λ=δ +Λ, where Λ ({}) =,ora Ξ-coalescent in which Ξ=δ (,,... ) +Ξ and Ξ ({(,,...)}) =.Letα = λ ( ). Suppose Then, there exists a constant ρ such that α log <. (4.5) lim n E[S n] θh n 1 = ρ. (4.6) Furthermore, defining G () = lim n G n (), we have ρ = θ (( ) 1 ) λ 1 + θ λ 1 (1 G ()). (4.7) The condition (4.5) prevents Λ or Ξ from having too much mass near zero. Note that (4.1) implies that E[ n ] decreases y a constant as a result of the eneficial mutations, while Proposition 4.1 implies that when (4.5) holds, E[S n /h n 1 ] decreases y approximately ρ/h n 1, which is O(1/(log n)). Therefore, Proposition 4.1 shows that for sufficiently large samples we do expect Tajima s D-statistic to e negative when the population is affected y recurrent selective sweeps. Before proving this proposition, we consider some examples. Example 4.. Suppose, as in Example.3, we have a Λ-coalescent in which Λ = δ + sαp δ p. Since p-mergers occur at rate sα, we have λ ( ) + sα and thus α sα for all. Condition (4.5) follows immediately. Suppose instead we have the Λ-coalescent of Example.4, where Λ = δ +Λ and Λ (dx) = cx dx. Note that α is the same as the total merger rate of the Λ -coalescent when there are locks. Using the fact that if Z Binomial(, x) then P (Z )=1 (1 x) x(1 x) 1, 1

13 we have α = = c = c c 1 1 which implies (4.5). (1 (1 x) x(1 x) 1 )x Λ (dx) 1/ 1/ (1 (1 x) x(1 x) 1 )x 1 dx c 1 (1 (1 x) )x 1 dx + c 1/ 1 (1 (1 x) )x 1 dx (1 (1 x) )x 1 dx 1 dx+ c x 1 dx = c(1 + log ), (4.8) 1/ Example 4.3. Although (4.5) holds in the natural cases given in Examples.3 and.4, we show here that it does not hold for all coalescents. Suppose Λ = δ +Λ, where Λ is the uniform distriution on (, 1). Note that there exists a constant C>such that if Z Binomial(, x) with x 1/ and, then P (Z ) C. Therefore, α = 1 1 C (1 (1 x) x(1 x) 1 )x dx 1/ 1 so (4.5) does not hold in this case. (1 (1 x) x(1 x) 1 )x dx 1/ x dx = C( 1), Proof of Proposition 4.1. When the coalescent has n+1 locks, the proaility that the next coalescence event will take the coalescent down to fewer than n locks is at most [λ n+1 ( ) n+1 ]/λn+1. Therefore, if n, then G n+1 () G n () λ n+1 ( n+1) = α n+1 α n+1 λ n+1 λ n+1 n(n +1). (4.9) Therefore, when (4.5) holds, the sequence (G n ()) n= is Cauchy and thus has a limit G (). It follows from (4.) and (4.3) that E[S n ] θh n 1 = θ λ 1 G n () θ ( ) 1 = θ ( ( ) 1 ) λ 1 θ λ 1 (1 G ()) + θ λ 1 (G n () G ()). (4.1) To prove Proposition 4.1, we need to take the limit as n of the three terms on the right-hand side of (4.1). 13

14 For the first term, we note that ( ) 1 λ 1 = λ ( ) ( ) α = λ ) ( 4α ( 1). Therefore, when (4.5) holds, we have a summale series and θ ( ( ) 1 ) lim λ 1 n = θ (( ) 1 ) λ 1. (4.11) For the second term, note that (4.9) and the fact that G () = 1 imply λ 1 (1 G ()) = m= ( α m+1 ) 1 m(m +1) m= α m+1 m(m +1) m which is finite y (4.5). Therefore, θ lim λ 1 n (1 G ()) = θ Finally, for the third term, lim sup n λ 1 G n () G () lim sup n lim sup n lim sup n 1 log n 1 m= ( 1 4α m+1 (1 + log(m 1)), m(m +1) λ 1 (1 G ()). (4.1) m=n 1 (1 + log(n 1)) log n α m+1 m(m +1) ( m=n m=n y (4.5). The proposition follows from (4.1), (4.11), (4.1), and (4.13). 5 The numer of singletons ) α m+1 log m m(m +1) α m+1 log m m(m +1) ) = (4.13) Fu and Li (1993) proposed another test to detect departures from Kingman s coalescent. They considered the ancestral tree in which the leaves are the n individuals in the sample. They defined the ranches connecting a leaf to an internal node to e external ranches and the other ranches to e internal ranches. Let η e denote the numer of mutations on external ranches, and let η i e the numer of mutations on internal ranches. Every mutation produces a segregating site, so η e + η i = S n. If a mutation occurs on an external ranch, the mutant gene appears on just one of the n individuals in the sample, while if a mutation occurs on an internal ranch, the mutant gene appears on etween and n 1 of the individuals in the sample. Therefore, to determine η e, we simply count the numer of mutations that appear on just one of the sampled chromosomes. 14

15 Note that unless an outgroup is availale, it will not e possile to distinguish etween a mutation that appears on one of the sampled chromosomes and a mutation that appears on n 1 of the sampled chromosomes. Fu and Li (1993) proposed a modification of their test for when there is no outgroup, ut for the analysis in this section, we assume that we have an outgroup that enales us to make this distinction. Let J n e the sum of the lengths of the external ranches. In terms of the associated coalescent process, J n is the sum, over i etween 1 and n, of the amount of time that the integer i is in a singleton lock. Let I n e the sum of the lengths of the internal ranches. Assuming, as efore, that mutations occur at rate θ/ on the time scale of the coalescent process, we have E[η e J n ]=(θ/)j n and E[η i I n ]=(θ/)i n. Fu and Li s D-statistic is ased on comparing η i with (h n 1 1)η e. Note that η i (h n 1 1)η e = S n h n 1 η e. To see that this has mean zero when the ancestral tree is given y Kingman s coalescent, we follow the explanation on p. 163 of Durrett (). In the case of Kingman s coalescent, (4.3) gives E[S n ]=θh n 1. Therefore, E[S n h n 1 η e ]=θh n 1 θh n 1 E[J n ]/, so it remains to show that E[J n ] =. Let K n e the amount of time that the integer 1 is in a singleton lock of the partition, so E[J n ]=ne[k n ]. Let T n e the amount of time efore the first coalescence event, and note that E[T n ]=/[n(n 1)]. The proaility that 1 coalesces with another integer at time T n is /n, and this event is independent of T n. If 1 does not coalesce at this time, then the expected additional time that 1 is a singleton is E[K n 1 ]. Therefore, we get the recursion E[K n ]= n E[T n]+ n n E[T n + K n 1 ]= n(n 1) + n n E[K n 1]. Note that E[K ] = 1, and then it is easy to show y induction that E[K n ]=/n for all n, and so E[J n ] = for all n, as claimed. We can write Fu and Li s D-statistic as D = S n h n 1 η e, (5.1) cn S n + d n Sn where, as in (4.4), c n and d n are constants chosen to make the variance of the statistic approximately one when the genealogy is given y Kingman s coalescent. Details of the variance computation are given in section 4. in Durrett (), where an error of Fu and Li (1993) is corrected. When multiple mergers cause many lineages to coalesce at once, one expects I n to e reduced more than J n ecause there is still an external ranch associated with each leaf, ut there are fewer internal ranches ecause of multiple mergers. This would cause Fu and Li s D-statistic to e negative. The next proposition shows that this is indeed the case. Proposition 5.1. Let (Π n (t),t ) e a P n -valued Λ-coalescent in which Λ=δ +Λ, where Λ ({}) =,orap n -valued Ξ-coalescent in which Ξ=δ (,,... ) +Ξ and Ξ ({(,,...)}) =. Let α = λ ( ), and suppose (4.5) holds. Then where ρ is the constant defined in (4.7). lim n E[S n h n 1 η e ]= ρ, (5.) 15

16 The key to the proof of this proposition is the following lemma. Lemma 5.. Under the assumptions of Proposition 5.1, there is a positive constant C such that E[ J n ] C n α (5.3) for all n. The first inequality in (5.3), which does not require condition (4.5), shows that the expected sum of the lengths of the external ranches is never greater than, which means that it is largest for Kingman s coalescent. The second inequality gives a rather sharp ound on the difference. Recall that in Example.3, we have α sα, soe[ J n ] C (log n)/n for some other constant C. In Example.4, (4.8) gives α c(1 + log ) c(1 + log n), which implies E[ J n ] C (log n) /n for some constant C. Thus, in these examples, the lengths of the external ranches are affected very little y multiple mergers when the sample size is large. The reason is that, in large samples, a lot of coalescence occurs very quickly, so most ancestral lines have merged with at least one other ancestral line efore the first multiple merger takes place. Proof of Lemma 5.. We start y proving the first inequality in (5.3) y induction. As efore, let K n e the amount of time that the integer 1 is in a singleton lock. We need to show that E[K n ] /n for all n. First, note that E[K ]=λ 1 1. Now, suppose for some n 3, we have E[K j ] /j for j =,...,n 1, and consider E[K n ]. Let T n e the time of the first merger when the coalescent starts with n locks, and let B e the numer of locks involved in the merger at time T n. Note that B is independent of T n. Conditional on B, the proaility that 1 merges with at least one other lock at time T n is B/n. If this does not happen, then at least n B + 1 locks remain after the merger, so y the induction hypothesis, the expected time after T n that {1} will remain a singleton is at most /(n B + 1). Therefore, E[K n T n,b] ( B n ) T n + ( n B n )( ) T n + n B +1 = T n + (n B) n(n B +1). Since B n, we have (n B)/(n B+1) (n )/(n 1). Also, E[T n ]=λ 1 n /[n(n 1)], so (n ) E[K n ] + n(n 1) n(n 1) = n, which proves the first inequality. The proof of the second inequality requires a coupling argument. Let (Π n (t),t ) e the coalescent process defined in the statement of Proposition 5.1, and let (Υ n (t),t ) e Kingman s coalescent, started from the partition of 1,...,n into singletons. We may assume that the coalescent processes Π n and Υ n are constructed from Poisson processes N 1 and N respectively on (, ) P n, as descried in section 3. That is, whenever (t, π) is a point of N 1, the partition Π n (t) is the coagulation of Π n (t ) yπ, and whenever (t, π) is a point of N, the partition Υ n (t) is the coagulation of Υ n (t ) yπ. Furthermore, these are the only jump times of Π n and Υ n. Let L 1 and L e the intensity measures of the second coordinate for the Poisson processes N 1 and N respectively. Then, for π P n, we have L (π) =1ifπ consists of one lock of size and n singletons, and L (π) = otherwise. Also, L 1 (π) L (π) for 16

17 all π P n. Therefore, we may assume that the Poisson processes N 1 and N are coupled such that if (t, π) is a point of N then (t, π) is a point of N 1. The points (t, π) in oth N 1 and N correspond to mergers in which two locks coalesce at a time, while the points (t, π) inn 1 ut not N correspond to multiple mergers caused y selective sweeps. To compare the two processes, note that K n = inf{t : {1} is not a singleton in Π n (t)}, and let K n = inf{t : {1} is not a singleton in Υ n(t)}. We have E[J n ]=ne[k n ]. By our previous results for Kingman s coalescent, we have E[K n]=/n, and so E[ J n ]=ne[k n K n ]. Let τ = inf{t :Π n (t) Υ n (t)}, where we say τ = if Π n (t) =Υ n (t) for all t. For π P n, denote y π the numer of locks in π. Since Π n (t) =Υ n (t) for all t τ, we have E[ J n ]=ne[k n K n] ne[(k n τ)1 {τ<k n } ]=n E[(K n τ)1 {τ<k n } 1 { Υn(τ) =}]. For =1,,...,n, define T = inf{t : Υ n (t) = }. If τ<k n and Υ n(τ) =, then K n >T. Therefore, E[ J n ] n E[K n τ {τ <K n} { Υ n (τ) = }]P ({K n >T } { Υ n (τ) = }). (5.4) If τ<k n and Υ n (τ) =, then {1} is one of locks of Υ n (τ), and y our previous results on Kingman s coalescent, the expected time efore it merges with another lock is /. Thus, we have E[K n τ {τ <K n} { Υ n (τ) = }] =. (5.5) Note that K n >T whenever {1} remains a singleton at the time that Kingman s coalescent is down to locks. Whenever the coalescent goes from j locks to j 1, the proaility that the integer 1 is involved in the merger is /j, so P (K n >T )= n j=+1 ( 1 ) ( exp j j=+1 ) ( n ) ( ) 1 exp 1 j x dx = e. (5.6) n If Υ n (τ) =, then oth Π n and Υ n have the same locks at time T, ut at time τ the process Π n has a transition ut Υ n does not. Since the total merger rate for Π n after time T is λ = α + ( ) and the total merger rate for Υn after time T is ( ), we have Comining (5.4)-(5.7), we get P ( Υ n (τ) = K n >T ) α λ E[ J n ] n which is the second inequality in (5.3). 4eα ( 1)n C n α ( 1). (5.7) α, 17

18 Proof of Proposition 5.1. We have E[S n h n 1 η e ]=(E[S n ] θh n 1 )+h n 1 (θ E[η e ]) = (E[S n ] θh n 1 )+ h n 1θ E[ J n ]. (5.8) By Proposition 4.1, lim n (E[S n ] θh n 1 )= ρ. It thus remains only to show that the second term on the right-hand side of (5.8) goes to zero as n. Let ɛ>. By (4.5), there exists a positive integer N such that α (1 + log ) <ɛ. Therefore, y Lemma 5., lim sup n =N h n 1 θ Ch n 1 θ E[ J n ] lim sup n n + Cθ lim sup n =N α α h n 1 n = lim sup n Cθ ( Ch n 1 θ N n lim sup n =N α + =N α (1 + log ) Since this is true for all ɛ>, and since E[ J n ] for all n y Lemma 5., we have lim n which completes the proof of the proposition. h n 1 θ E[ J n ]=, ) α Cθɛ. We conclude this section with some comments aout the power of Tajima s D-statistic and Fu and Li s D-statistic to detect selective sweeps. The numerators of these two statistics, which are n S n /h n 1 and S n h n 1 η e, each have mean zero when the ancestral process is Kingman s coalescent. The expected values of these two numerators oth converge to a negative constant as the sample size goes to infinity when multiple mergers can occur. These statistics are used to test for departures from Kingman s coalescent. If the goal is to test for multiple mergers caused y selective sweeps, one would reject the null hypothesis of no selective sweeps if the value of the statistic is too small (i.e. more negative than would e expected with Kingman s coalescent). A natural question, then, is how much power these tests have to detect selective sweeps. While a full analysis of this question would require a simulation study, we can otain some insight from the analytical results presented aove. From the values of a n and n in (4.4), which can e found in section 4.1 of Durrett (), we see that the standard deviation of the numerator of Tajima s D-statistic is O(1) when the genealogy is given y Kingman s coalescent. However, from the values of c n and d n in (5.1), which can e found in section 4. of Durrett (), we see that the numerator of Fu and Li s D-statistic has a standard deviation which is O(log n). This means that, for large n, moderate negative values for the numerator of Fu and Li s D-statistic are not strong evidence against the null model of Kingman s coalescent, and thus a test ased on Fu and Li s D-statistic will most likely have low power. These oservations are consistent with simulation results of Simonsen, Churchill, and Aquadro (1995), who found that Tajima s D-statistic has more power to detect selective sweeps than Fu and Li s D-statistic. Neither of these tests has the desirale feature of many tests in classical statistics, which is that for all α>, the power of the level α test tends to 1 as the sample size n tends to 18

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Dynamics of the evolving Bolthausen-Sznitman coalescent by Jason Schweinsberg University of California at San Diego Outline of Talk 1. The Moran model and Kingman s coalescent 2. The evolving Kingman s