The coalescent coalescence theory

The coalescet coalescece theory Peter Beerli September 1, 009 Historical ote Up to 198 most developmet i populatio geetics was prospective ad developed expectatios based o situatios of today. Most work did provide expectatios about the future. With the easy availability of geetic data retrospective aalyses did catch up oly i phylogeetics (startig i the sixties). Oly Malécot, who pioeered lookig backwards i time i 1948, developed backwards lookig results i populatio geetics. Kigma expressed this lookig backwards i time approach as the coalescece of sampled lieages. He was ot the oly oe workig o such problem at the the time as with may great solutios it was i the air, see Hudso (1983) ad Tajima (1983). The coalescet A sample of gee copies is take at the preset time ad we are iterested i the acestral relatioship of these gee copies. We express time τ icreasig the further back i real time we go: τ 1 <τ meas that τ is further i the past tha τ 1. Kigma (198) ad Ewes (004) describe this backwards i time process with equivalece classes. Two copies are i the same equivalece class at time τ whe they have a commo acestor at that time. At time τ = 0 each idividual gee ca be cosidered i its ow equivalece class ad we could express this for a sample of =8 as φ 0 = {(a), (b), (c), (d), (e), (f), (g), (h)} Kigma s -coalescet describes the moves from φ 0 to a sigle equivalece class φ = {(a, b, c, d, e, f, g, h)}. 1

ISC-5317-Fall 009 Computatioal Evolutioary Biology a b c d e f a,b d,e d,e,f a,b,c a,b,c,d,e,f Figure 1: Example of the coalescece process All idividuals are i some equivalece relatio ξ ad we ca fid a ew equivalece relatio η by joiig two of the equivalece classes i ξ. This joiig process is called a coalescece, ad a series of such joiigs is called the coalescet or coalescece process. Figure gives a example of the relatioship of a sample ad the equivalece classes describig the process. It is assumed that the probability of a coalescece depeds o the waitig time δτ Prob(process i η at time τ + δτ process i ξ at time τ) =δτ (igorig higher order terms), ad if k is the umber of equivalece classes i ξ the Prob(process i ξ at time τ + δτ process i ξ at time τ) =1 δτ =1 k δτ We will see that if we apply the right time scale to τ the we will ed up i the more familiar terms that are commo i the applied populatio geetics literature. Kigma focused o the Caig model, ad sice the Wright-Fisher model is a special case of the results carry over easily. The coalescet is a approximatio to these models because it was developed o a cotiuous time scale whereas the Caig ad Wright-Fisher populatio models have discrete time. Ay fidigs usig this coalescet machiery eeds to be rescaled to the time scale of these discrete time models. I the coalescet framework oe has oly a sigle coalescet per ifiitesimal time period. This forces us to restrict the use of the coalescet to discrete time models were we ca guaratee that there is ot more tha oe coalescet evet occurrig per time period. For example, for a Wright-Fisher populatio we ca allow oly oe coalescet evet per geeratio. this soud rather restrictive but as log as the sample is much smaller tha the

ISC-5317-Fall 009 Computatioal Evolutioary Biology populatio N this situatio rarely occurs. Fu (006) calculates that a sample eeds to be less tha the square-root of N < N, whe we use the coalescet for a model such as the Wright-Fisher or some versio of the Caig model. Joe Felsestei shows this for the discrete case i the Wright-Fisher model i his course populatio geetics (http://evolutio.geetics.washigto.edu/pgbook/pgbook.html). The discrete coalescet ad the Wright-Fisher populatio model I the Wright-Fisher model we ca calculate the probability that two radom idividuals have a commo acestor oe geeratio earlier, very simply. Pickig two idividuals at radom, ad the decide the chace of havig a commo acestor: the first idividual has with probability 1 a acestor i the last geeratio ad the other had with probability 1/(N) the same acestor. I tur we also kow the probability that the two idividuals have o commo acestor i the last geeratio: 1 1/(N). Whe we have idividuals i the sample the we ca have pairs of idividuals that coalesce with a chace of 1/(N), therefore we have ever geeratio chace to see a coalescece of two idividuals is N = ( 1) 4N This rate of coalescece i the discrete Wright-Fisher model is similar to a coi toss where we wait with some rate (that the head appears), this looks like a geometric series, but i the case of the coalescece the series does ot use a fixed rate, but the rate chages because the coalescet evet reduces the umber i the sample by 1. The expected value of a geometric is approximately 1/rate, therefore the waitig time to reduce from lieages to 1 lieages is E(T ( 1 ) )= (1) 4N ( 1) =4N(1 1 ) () To reach the fial coalescet (the most recet commo acestor) we eed to wait util all lieages have coalesced we simply add up all the waitig times E(T MRCA )=4N k= 1 (3) 3

ISC-5317-Fall 009 Computatioal Evolutioary Biology The (cotious) coalescet ad the Wright-Fisher populatio model The coalescet process is i effect a sequece of 1 Poisso processes 1, with rates r k =,k =, 1,,..., describig the Poisso process at which two of the equivalece classes merge whe there are k equivalece classes. Sice these evets are comig from a Poisso distributio we ca calculate the expectatio for each iterval, which is 1/rate, here /(). All mergers of the equivalece classes are idepedet of each other so the expectatio of the whole coalescece process is E(T MRCA )= k= = 1 Compariso of the cotet of the sum with the coalescece rate makes clear that the Wright-Fisher populatio model is the stadard coalescet uits ad we eed to multiply by N to arrive at the more familiar geeratio time scale. k= The coalescet process results i a tree of a sample of idividuals. We call this ofte a geealogy as the the idividuals are typically from the same species or populatio (hece populatio size). Ofte the probability of the coalescet process for the Wright-Fisher populatio model is expressed as p(g N) = k= e u k k(k 1) 4N 4N, where u is expressed i geeratios. The expected T MRCA is the same as above. Ofte we will ot use a time scale i geeratios but geeratios mutatio rate ad the we would express the above formula as p(g Θ) = k= e u k k(k 1) Θ where Θ is 4 N e µ, withµ as the mutatio rate per geeratio ad site (whe usig sequece data), ad N e as the effective populatio size. Θ, Uder a strict Wright-Fisher populatio model 1 From Mathworld: A Poisso process is a process satisfyig the followig properties: 1. The umbers of chages i o-overlappig itervals are idepedet for all itervals.. The probability of exactly oe chage i a sufficietly small iterval is, where is the probability of oe chage ad is the umber of trials. 3. The probability of two or more chages i a sufficietly small iterval is essetially 0. I the limit of the umber of trials becomig large, the resultig distributio is called a Poisso distributio. 4

ISC-5317-Fall 009 Computatioal Evolutioary Biology Figure : Example of a coalescece structure i the Wright-Fisher model N = N e, but uder more biological scearios oe eeds to kow more about the life history of the species to traslate the N e ito real umbers. The coalescet ad the Mora populatio model The coalescet is a exact represetatio of the Mora model because the problems with the multiple coalescet evets i oe geeratio do ot occur. The Mora model allows oly oe lieage to chage at a give time. Therefore the limitatio to small sample size as we have see for the Wright-Fisher model is ot eeded. Usig our fidigs of the discussio of the Mora model earlier, but istead of thikig forward i time, thik backward i time. Lookig backwards we see that the Mora process is similar structured like the coalescece process. We have idividuals that are reduced i their acestry to 1,,... ad evetually to oe gee, the most commo recet acestor of the sampled idividuals. Assume that we are at a time where we have a sample of k idividuals, these are descedats of k 1 parets of oe of these parets was chose to reproduce ad the offsprig is 5

ISC-5317-Fall 009 Computatioal Evolutioary Biology i acestry of the sample of gees. The probability of this evet is ad with probability 1 (N), (N), the acestors remai at j. Tracig back this acestry the umber of death ad birth evets betwee the times whe there are j ad j 1 acestors follows a geometric distributio with parameters /(N) ad thus has a mea of E(u j )= (N) ow we ca assemble the expectatio for the time to the most recet commo acestor E(T MRCA )= k= =(N) (N) =(N) (1 1 ) k= 1 (4) If we assume that we sampled the whole populatio, where =N tha we derive the same result as with stadard (forward) theory. E(T MRCA )=(N) (1 1 )=(N) (1 1 )=N(N 1) N We also ca make the same observatio as we made with the Wright-Fisher populatio model. The coalescece time scale is by a factor (N) differet from the Mora model time scale (formula 4). We ca express the probability of the geealogy uder the Mora model usig the fact that the expoetial distributio is a good approximatio to the geometric distributio p(g N) = k= e u k k(k 1) (N) 1 (N), where u is expressed i geeratios. The expected T MRCA is the same as above. Ofte we will ot use a time scale i geeratios but geeratios mutatio rate ad the we would express the above formula as p(g Θ) = k= e u k k(k 1) Θ 4 Θ, where Θ is 4 N e µ, withµ as the mutatio rate per geeratio ad site (whe usig sequece data), ad N e as the effective populatio size. Uder a strict Wright-Fisher populatio model N = N e, but uder more biological scearios oe eeds to kow more about the life history of the species to traslate the N e ito real umbers. 6