The coalescent process

Similar documents
How robust are the predictions of the W-F Model?

Genetic Variation in Finite Populations

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

6 Introduction to Population Genetics

6 Introduction to Population Genetics

The Wright-Fisher Model and Genetic Drift

Population Genetics I. Bio

Effective population size and patterns of molecular evolution and variation

ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ

Processes of Evolution

Neutral Theory of Molecular Evolution

122 9 NEUTRALITY TESTS

Mathematical models in population genetics II

Selection and Population Genetics

Statistical Tests for Detecting Positive Selection by Utilizing High. Frequency SNPs

Stochastic Demography, Coalescents, and Effective Population Size

Frequency Spectra and Inference in Population Genetics

Classical Selection, Balancing Selection, and Neutral Mutations

Notes on Population Genetics

Surfing genes. On the fate of neutral mutations in a spreading population

Demography April 10, 2015

Computational Systems Biology: Biology X

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Evolution in a spatial continuum

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Genetic hitch-hiking in a subdivided population

The Combinatorial Interpretation of Formulas in Coalescent Theory

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

Statistical Tests for Detecting Positive Selection by Utilizing. High-Frequency Variants

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Introduction to population genetics & evolution

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Estimating Evolutionary Trees. Phylogenetic Methods

Population Structure

The Genealogy of a Sequence Subject to Purifying Selection at Multiple Sites

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

The Structure of Genealogies in the Presence of Purifying Selection: a "Fitness-Class Coalescent"

7. Tests for selection

Big Idea #1: The process of evolution drives the diversity and unity of life

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Mathematical Population Genetics II

Sequence evolution within populations under multiple types of mutation

Lecture 13: Population Structure. October 8, 2012

arxiv: v2 [q-bio.pe] 26 May 2011

à 10. DC (DIRECT-COUPLED) AMPLIFIERS

Lecture 18 : Ewens sampling formula

Examples of spontaneity in terms of increased spatial arrangements Notes on General Chemistry

CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Population genetics snippets for genepop

Mathematical Population Genetics II

Theoretical Population Biology

Wald Lecture 2 My Work in Genetics with Jason Schweinsbreg

Intraspecific gene genealogies: trees grafting into networks

I of a gene sampled from a randomly mating popdation,

Neutral behavior of shared polymorphism

Challenges when applying stochastic models to reconstruct the demographic history of populations.

Analysis of the Seattle SNP, Perlegen, and HapMap data sets

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Lecture 18 - Selection and Tests of Neutrality. Gibson and Muse, chapter 5 Nei and Kumar, chapter 12.6 p Hartl, chapter 3, p.

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Notes for MCTP Week 2, 2014

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Properties of Statistical Tests of Neutrality for DNA Polymorphism Data

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Tutorial on Theoretical Population Genetics

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

(Write your name on every page. One point will be deducted for every page without your name!)

PHENOTYPIC evolution is channeled through patterns

Contrasts for a within-species comparative method

There are 3 parts to this exam. Use your time efficiently and be sure to put your name on the top of each page.

Mathematical modelling of Population Genetics: Daniel Bichener

Fitness landscapes and seascapes

Linkage and Linkage Disequilibrium

Endowed with an Extra Sense : Mathematics and Evolution

Observation: we continue to observe large amounts of genetic variation in natural populations

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

à FIELD EFFECT TRANSISTORS

Segregation versus mitotic recombination APPENDIX

Genetic Drift in Human Evolution

NEUTRAL EVOLUTION IN ONE- AND TWO-LOCUS SYSTEMS

Evolution Test Review

Supplementary Figures.

URN MODELS: the Ewens Sampling Lemma

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Population Genetics & Evolution

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Application of a time-dependent coalescence process for inferring the history of population size changes from DNA sequence data

NATURAL SELECTION FOR WITHIN-GENERATION VARIANCE IN OFFSPRING NUMBER JOHN H. GILLESPIE. Manuscript received September 17, 1973 ABSTRACT

Introduction to Advanced Population Genetics

ESTIMATION of recombination fractions using ped- ber of pairwise differences (Hudson 1987; Wakeley

Background Selection in Partially Selfing Populations

ms a program for generating samples under neutral models

Selection against A 2 (upper row, s = 0.004) and for A 2 (lower row, s = ) N = 25 N = 250 N = 2500

Rare Alleles and Selection

p(d g A,g B )p(g B ), g B

Lecture 14 Chapter 11 Biology 5865 Conservation Biology. Problems of Small Populations Population Viability Analysis

Transcription:

The coalescent process Introduction Random drift can be seen in several ways Forwards in time: variation in allele frequency Backwards in time: a process of inbreeding//coalescence Allele frequencies Random variation in reproduction causes random fluctuations in allele frequency: var HpL = pq ÅÅÅÅÅÅÅÅÅÅ N e After many generations, the distribution can be approximated by a diffusion. With random drift and mutation (PØQ at rate m, QØP at rate n) the equilibrium distribution is: prob HpL ~ p 4 N e n- q 4 N e m- The left-hand plot shows the distribution of p for N e =, 500, n =.5 µ 0-5, m = 5 µ 0-5 ; the right-hand plot is for N e =0,000 7 6 5 4 3.5 0.5 0. 0.4 0.6 0.8 0. 0.4 The diffusion approximation can also include other forces, such as selection and migration. For example, the equilibrium distribution under mutation, random drift, and selection is: prob HpL ~ p 4 N e n- q 4 N e m- W ê N e êêê Ne With heterozygote advantage (fitnesses -s;:-s), W = - shp + q L~Exp@- N e shp + q LD With N e =, 500, n =.5 µ 0-5, m = 5 µ 0-5, and s=0.000, 0.00, 0.004 (left to right):

Coalescent process.nb 5 4 3 0.8 0.6 0.4 0. 0.8 0.6 0.4 0. 0. 0.4 0.6 0.8 0. 0.4 0.6 0.8 0. 0.4 0.6 The key parameters are N e m, N e n, N e s, which give the strength of drift relative to mutation and selection. ü Further reading: Kimura, The neutral theory of molecular evolution, Chap.3 Identity by descent Definition Wright (9, 9), Haldane & Moshinsky (939), Cotterman (940) and Malécot (948) developed the idea of identity by descent. Two genes are identical by descent if they descend from the same gene in some ancestral population. ü Note: - Identity by descent is distinct from identity in state - i.b.d. is defined relative to some ancestral reference population. - Identity measures can extend to many genes; usually, however, we just deal with identity between pairs of genes. This is related to variance of allele frequency, correlation between genes, and homozygosity - Relationships among many genes are better thought of in terms of coalescence of lineages in a genealogy. The probability of identity by descent is easily calculated for pedigrees e.g. brother-sister mating

Coalescent process.nb 3 Genes are NOT ibd in this case Probability of identity by descent is /4 In general, the probability that two distinct genes in a diploid individual are i.b.d. is f = loops H ÅÅÅÅ Ln- H + f A L, where the sum is over all loops in the pedigree, n is the number of individuals in the loop, and f A the identity between genes in the common ancestor. Note that the random element here is in segregation, not reproduction

4 Coalescent process.nb The increase in i.b.d. with random mating ü Wright-Fisher model Suppose that there are N t individuals in a haploid population. In the next generation, there are N t+, drawn randomly from all N t possible parents. On this scheme, individuals produce a number of offspring which is close to a Poisson distribution. The Wright-Fisher model also applies to a random-mating diploid population, provided that individuals are as likely to mate with themselves as with anyone else. Then, the probability that two genes are i.b.d. from the previous generation is ê N t : f t+ = ÅÅÅÅÅÅÅÅÅÅ N t + J - h t+ ª - f t = J - ÅÅÅÅÅÅÅÅÅ N t N f t f 0 = 0 t- ÅÅÅÅÅÅÅÅÅ N h N t hence h t = J - ÅÅÅÅÅÅÅÅÅ N t i=0 N i With constant population size, h t declines by (-/N) per generation - approximately, as ~exp(-t/n). The typical timescale for inbreeding and random drift is N generations. t- With fluctuating sizes, h t declines (approximately) as exp H-H i=0 ÅÅÅÅÅÅÅÅ N i LL= exph-t ê N H L where N H is the harmonic mean population size. Coalescence The ancestry of a sample of neutral genes has a simple statistical distribution: the chance that any two lineages coalesce is ÅÅÅÅÅÅÅÅÅÅ N e per generation - More precisely: - suppose that each gene leaves v descendants - As NØ, the probability that any pair of lineages coalesce, per generation, tends to ÅÅÅÅÅÅÅÅÅÅÅÅ varhnl N i.e. N e = N ê varhnl The coalescent process refers to this limit - equivalent to the diffusion approximation An influential idea: - DNA sequences are best described by their genealogy - - -

Coalescent process.nb 5 - a variety of mutation models can be superimposed - tracing back samples of alleles - speeds up simulations - gives statistical tests on sampled data ü References Hudson, R. (990). Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7, -44. Hudson, R. (993). The how and why of generating gene genealogies. In Mechanisms of molecular evolution, ed. Takahata N & Clark AG, pp 3-36. Donnelly, P. and S. Tavaré. (995). Coalescents and genealogical structure under neutrality. Ann. Rev. Genet. 9, 40-4. Rosenberg, N. A., and M. Nordborg, 00 Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Reviews Genetics 3: 380-390. Properties of the coalescent process The time during which there are k lineages is exponentially distributed with expectation ÅÅÅÅ l = N ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ e khk-lê : P Ht k L = Exp@-l t k D l t k where l = k Hk - L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 4 N e ü The genealogy is dominated by the deepest split. The expected depth of the tree is: N e J ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ k Hk - L + ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Hk - L Hk - L ÅÅÅÅ 6 + ÅÅÅÅ 3 + N = N e JJ ÅÅÅÅÅÅÅÅÅÅÅ k - - ÅÅÅÅ k N + J ÅÅÅÅÅÅÅÅÅÅÅ k - - ÅÅÅÅÅÅÅÅÅÅÅ k - N + J ÅÅÅÅ - ÅÅÅÅ 3 N + J ÅÅÅÅ - ÅÅÅÅ NN = N e JJ - ÅÅÅÅ k N + N ~4 N e for large k Thus, the tree collapses to lineages in ~ N e generations; these take another N e generations to coalesce Hence, pairwise measures are uninformative ü The expected length of the genealogy is ~ 4 N e Log@.78 kd The expected length of the tree is: N e Jk ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ k Hk - L + Hk - L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Hk - L Hk - L ÅÅÅÅ 4 6 + ÅÅÅÅ 3 3 + N = N e J ÅÅÅÅÅÅÅÅÅÅÅ k - + ÅÅÅÅÅÅÅÅÅÅÅ k - + ÅÅÅÅ 3 + ÅÅÅÅ + ÅÅÅÅ N k- = 4 N e j= ÅÅÅÅ j ~4 N e Log@.78 kd for large k The distribution of length is highly variable: The dots show the quantiles at 0.00, 0.0, 0., 0.9, 0.99, 0.999.

6 Coalescent process.nb L 0 0 5 5 0 0 50 n Figure Fluctuating population size Changes in N e cause changes in timescale The standard coalescent Expanding populations Ø "star phylogeny" exponential growth: popl'n was 0% of the current size at T MRCA

Coalescent process.nb 7 Population bottlenecks Ø burst of coalescence a bottleneck equivalent to N e 'ordinary' generations of drift ü Changing timescales The "scaled time" is a measure of the total amount of genetic drift that has occurred: t t T = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 0 N HtL For a constant population size, T = t ê H NL. If the population is growing at a rate l, and the present size is N 0, then N = N 0 -lt, and so: t T = ÅÅÅÅÅÅÅÅÅ lt t = 0 N 0 ÅÅÅÅÅÅÅÅÅÅÅÅÅ N 0 l H lt - L The parameter l is a measure of the amount of population growth over the current timescale set by population size, N 0. Here is the transformation for l =.5

8 Coalescent process.nb scaled time 0 8 6 4 0.5.5 actual time Branching processes The coalescent process only applies to samples from a large population If all genes are observed, we have a branching process e.g. discrete time: # of offspring i follows a Poisson distribution with E@iD = l P More generally, for l~, P ~ Hl - L ê varhil 3 4 l

Coalescent process.nb 9 9 0 8 6 5 7 4 30 4 7 5 9 8 3 6 t d i s o q f p c k h m j a e n r l b g coalescent 7 9 0 8 6 3 5 4 9 8 0 5 4 3 7 6 o b s d g i t j k r a m p n e q c f h l sample from a branching process l =. Mutation Infinite alleles Assuming that every mutation generates a new allele, the probability of identity in allelic state ("homozygosity") is F = t f t H - ml t, where f t is the distribution of coalescence times. F ~ E@ - m t D = - m t f t t = - m t -tê N e 0 0 t ÅÅÅÅÅÅÅÅÅÅÅÅ N e = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + 4 N e m Identity coefficients, F, can easily be calculated by going back in time one generation:

0 Coalescent process.nb F = H - ml JJ - ÅÅÅÅÅÅÅÅÅÅ N e N F + ÅÅÅÅÅÅÅÅÅ N e N fl F = Identity coefficients are generating functions for the distribution of coalescence times: F ~ E@ - m t D \ F = when m = 0 df ÅÅÅÅÅÅÅÅ d m ~ E@- t - m t D \ ÅÅÅÅÅÅÅÅ df = - E@tD when m = 0 d m d F ÅÅÅÅÅÅÅÅÅÅÅÅ d m ~ E@4 t - m t D \ d F ÅÅÅÅÅÅÅÅÅÅÅÅ = 4 E@t D when m = 0 d m H - ml ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Ne H - H - ÅÅÅÅÅÅÅÅ Ne L H - ml L ~ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + 4 N e m More general models of mutation Bases mutate at rate m, and change to A, T, G, C with equal probability Probability of identity in state of two genes is: F = EA ÅÅÅÅ 4 H - - mt L + - mt E Infinite sites For DNA sequences, the 'infinite sites' model is more appropriate: each mutation is at a new site in the sequence. Two alleles may differ by mutations at, sites - giving a measure of the time for which they have been diverging. If there are mutations on every internal branch, the genealogy can be reconstructed: - a e d b c - ---

Coalescent process.nb Gene 3 4 5 6 Mut' n a 0 0 b 0 0 0 0 0 c 0 0 0 0 d 0 0 0 0 e 0 0 0 0 0 To root the tree, we must know which mutations are derived - which requires an outgroup Any pair of sites which carried all four combinations is incompatible with a tree - recombination - multiple mutations The mean pairwise diversity, p, is just E[m t] = 4 N e m The number of segregating sites, n s, in a sample is proportional to the total length of the tree: E@n s D = ml, k where L = j= jt j E@n s D = E@m LD = 4 N e m J ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Hk - L + ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Hk - L ÅÅÅÅ 3 + ÅÅÅÅ + N ~ 4 N e mlog@.78 kd Under neutrality, we expect a definite relation between the # of segregating sites and the pairwise diversity Recombination Ancestral graphs With sexual reproduction, genomes have multiple ancestors. Ancestry is described by an ancestral graph:

Coalescent process.nb Coalescence amongst klineages at a rate ÅÅÅÅÅÅÅÅÅÅÅÅÅ khk-l ÅÅÅÅÅÅÅÅÅÅ N e Recombination at a rate kr Pattern depends on R = N e r Each recombination generates a pair of unique junctions Junctions can disappear if they meet eachother in a coalescence At any time, any one genome is distributed across several ancestral lineages + R - ÅÅÅÅÅÅÅ R 3 + ÅÅÅÅÅÅÅÅ 3 54 R3 + OHR 4 L HDerrida & Jung - Muller 999L Example: R = 50 Number of ancestral lineages:

Coalescent process.nb 3 5 0 5 0 5 A typical sample, with 8 ancestors: 4 6 8 Six sampled genomes represented by colours I ÅÅÅÅÅÅÅÅÅÅ N e = 0.6M: t

4 Coalescent process.nb Looking along the genome... Different regions have different genealogies:

Coalescent process.nb 5 ü Patterns of diversity vary along the genome: Numbers of segregating sites H0 sampled genomes; q = 4 N e m = 30; sliding window width 0.5L 0 7.5 5.5 0 7.5 5.5 4 6 8

6 Coalescent process.nb 0 7.5 5.5 0 7.5 5.5 0 4 6 8 7.5 5.5 0 7.5 5.5 Mean number of pairwise differences: 4 6 8

Coalescent process.nb 7 5 4 3 5 4 6 8 0 4 3 4 6 8 0

8 Coalescent process.nb 5 4 3 4 6 8 0 Pedigrees - or an infinitely long genome Probability of ancestor repetitions in the genealogical tree of the king Edward III. The continuous and dashed lines show simulations of F@rD in a closed population with and individuals for our model. Distributio 3, 5, 7,

Coalescent process.nb 9 Derrida, B., S. C. Manrubia, and D. H. Zanette. 999. Statistical properties of genealogical trees. Physical Review Letters 8:987-990. Forwards in time What is the fate of a single ancestral genome? In an infinitely large population, this is a branching process. The chance that the pedigree will survive is ~ 80% Any finite piece of genome is certain to be lost - but very slowly The probability of survival of a neutral genome (S = 0) as a function of map length, R. From top to bottom, the curves show P t [R] for t = 0,,... 0; 0, 30...00; and 00, 300...000 generations.

0 Coalescent process.nb The distribution of blocks of genome that remain after 50 generations; map length R =. The two panels show two random realisations of this process. Each line represents one genome.

Coalescent process.nb The increase in mean block number over time (± standard error), compared with the expectation +Rt. (b) The mean amount of ancestral material over time, compared with the constant expectation R. (c) The probability of survival, P, compared with the value calculated from Eq.. (d) The distribution of block sizes at time t = 30 compared with the expectation. (R=). What do we see? What is the relation between the ancestry of segments of genome, and the patterns we see? Patil et al. 00 Science 94:79,676,868 bases, 36000 SNPs; ~4000 "blocks" identified; ~700 SNPs capture ~80% of haplotype variation What is the actual structure of these 0 chromosomes?

Coalescent process.nb

Coalescent process.nb 3 Selection on linked sites Balancing selection ü Complete linkage Kreitman & Aguade (Genetics, 986) observed excess polymorphism in the Adh region of D. melanogaster. Hudson, Kreitman & Aguade (Genetics, 987) introduced the "HKA test" to detect balancing selection. A polymorphism with two alleles P, Q divides linked markers into two separate gene pools. Eventually, there will be a set of alleles with homozygosity ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ H+4 NpmL ÅÅÅÅÅÅ associated with P, and a distinct set associated with Q, with homozygosity ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ. The overall homozygosity is: F = p H+4 NqmL ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + 4 N m p + q ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅ + 4 N m q e.g. -F vs p for 4Nm = 0. (bottom), q= (top): F p ü Recombination We must follow identities between genes both associated with P, F PP, both with Q, F QQ, or one with each, F PQ F ' PP Assuming r small: = H - r ql F PP + r q H - r ql F PQ + r q F QQ df PP = r q HF PQ - F PP L df PQ = r Hq F QQ + p F PP - F PQ L df QQ = r p HF PQ - F QQ L

4 Coalescent process.nb The effects of mutation and drift can be found in a similar way.overall: df PP = - m F PP + r q HF PQ - F PP L + H - F PP L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ N p df PQ = - m F PQ + r Hq F QQ + p F PP - F PQ L df QQ = - m F QQ + r p HF PQ - F QQ L + H - F PP L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ N q At equilibrium, df=0. The average F is: where r=r/m. F ê = H + r - 4 p q H - N m H + 3 r + r LLL ê H + r + 4 N m H + H + 4 pql r + p q r L + 6 N m p q H + 3 r + r LL Note that the effect is only over recombination rates of order m ü Plot of heterozygosity H - F èè L against r/m for 4Nm = 0. 0.8 0.6 0.4 0. 4 6 8 0 ss = SolveA90 == - m F PP + r q HF PQ - F PP L + H - F PP L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ, n p 0 == - m F PQ + r Hq F QQ + p F PP - F PQ L, 0 == - m F QQ + r p HF PQ - F QQ L + H - F QQ L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ =, 8F PP, F PQ, F QQ <E; n q

Coalescent process.nb 5 H8F PP, F PQ, F QQ, p F PP + p q F PQ + q F QQ < ê. ss@@dd ê. 8m -> g m, r -> g r m, n -> ê Hg nnl, q -> - p< êê CancelL ê. nn -> ê n êê Simplify 8HH + rl H- + 4 n H- + pl m H + p rlll ê H- - r + 6 n H- + pl p m H + 3 r + r L + 4 n m H- + H- - 4 p + 4 p L r + H- + pl p r LL, Hr H- + 4 n H- + pl p m H + rlll ê H- - r + 6 n H- + pl p m H + 3 r + r L + 4 n m H- + H- - 4 p + 4 p L r + H- + pl p r LL, HH + rl H- + 4 n p m H- + H- + pl rlll ê H- - r + 6 n H- + pl p m H + 3 r + r L + 4 n m H- + H- - 4 p + 4 p L r + H- + pl p r LL, -H + r + 4 p H- + n m H + 3 r + r LL - 4 p H- + n m H + 3 r + r LLL ê H- - r + 6 n H- + pl p m H + 3 r + r L + 4 n m H- + H- - 4 p + 4 p L r + H- + pl p r LL< Plot@ + H + r + 4 p H- + n m H + 3 r + r LL - 4 p H- + n m H + 3 r + r LLL ê H- - r + 6 n H- + pl p m H + 3 r + r L + 4 n m H- + H- - 4 p + 4 p L r + H- + pl p r LL ê. 8n -> 0.05 ê m, p -> ê, r -> Abs@rD<, 8r, 0, 0<, PlotRange -> 880, 0<, 80, <<D; Selective sweeps Fixation of a single favourable mutation carries with it a segment of linked genome

6 Coalescent process.nb mutation branching process ns >> deterministic increase p<< fixation sample An example: s = 0., N = 0 5, sampled when p = 0.. r = {-0.05, 0.5} s/log[n] = 0.008

Coalescent process.nb 7 Log@ ND s Fixation takes ~ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ generations, so a region of r ~ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ has reduced diversity s Log@ ND

8 Coalescent process.nb ü References Maynard Smith, J., and J. Haigh. 974. The hitch-hiking effect of a favourable gene. Genet.Res. 3:3-35. Hudson, R. B., and N. L. Kaplan. 988. The coalescent process in models with selection and recombination. Genetics 0:83-840. Kaplan, N. L., R. R. Hudson, and C. H. Langley. 989. The hitch-hiking effect revisited. Genetics 3:887-899. Barton, N. H. 000. Genetic hitch-hiking. Philosophical Transactions of the Royal Society (London) B 355:553-56. Kim, Y., and W. Stephan. 00. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 60:765-777. Gillespie, J. H. 00. Is the population size of a species relevant to its evolution? Evolution 55:6-69. Monte Carlo methods ü Generalities How can we make inferences from genetic data? - statistics such as # of segregating sites, pairwise diversity - likelihood: the probability of observing the data, given some hypothesis Statistical inference: - significance tests - likelihood - Bayesian inference ü Griffiths-Tavare Griffiths, R. C., and S. Tavare. 994. Simulating probability distributions in the coalescent. Theoretical Population Biology 46:3-59. We observe some configuration of mutations: i 3 4 5 6 7 8 9 y a 0 0 0 0 0 0 b 0 0 0 0 0 c 0 0 0 0 0 d 0 0 0 0 0 0 j z k e 0 0 0 0 0 0 0 0 0 { This configuration was produced by this genealogy:

Coalescent process.nb 9 e c d a b This rooted genealogy cannot be fully reconstructed, because there were no mutations along the branchs leading down to e and to {a,b,c,d} ü The algorithm (exact version): - Work back along the genealogy, until the most recent mutation or coalescence - Sites can only lose a mutation if that mutation is represented only in one leaf; let there be J such sites. (In the example above, sites 6,7,8,9 are singletons; J=4). - A pair of lineages can only coalesce if they carry the same set of mutations; let there be K such pairs. In the example, there are no such possibilities: K=0. - With n lineages, the rate of events is l n = n ÅÅÅÅ q + ÅÅÅÅÅÅÅÅÅÅÅÅÅÅ nhn-l ; a sum is taken over these events, with the appropriate probability, and expressed in terms of the probabilities of the simpler configurations generated by loss of a mutation or coalescence. - This sum over J+K possible previous configurations is wighted by the overall weight ÅÅÅÅ l : P@SD = ÅÅÅÅÅÅÅÅ i J l n j k j= q ÅÅÅÅÅ K P@S j * D + k= y P@S * k D { z where l n = n ÅÅÅÅÅ q n Hn - L + ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ S j * represents deletion of the j' th singleton site from S, and S k * the coalescence of the k' th pair. This algorithm becomes extremely slow for large numbers of mutations and lineages. ü Monte Carlo version: A Monte Carlo estimate can be made by sampling possible paths back through the genealogy, with relative probability ÅÅÅÅ f for possible losses of mutations, and for possible coalescences: P@SD = J ÅÅÅ q m Å f N EA ÅÅÅÅÅÅ l i J f ÅÅÅ Å J i * + K i * NE where J i * is the number of possible losses of mutations, K i * the number of possible coalescences, m the number of segregating sites, and i the current # of lineages

30 Coalescent process.nb The parameter f can be chosen arbitrarily: it should take a value which minimises the variance of the estimator. Note that while f=q seems natural, it does not give an optimal estimator. ü Other applications: Joint estimation of recombination and mutation H4 N e r, 4 N e ml : Kuhner, M. K., J. Yamato, and J. Felsenstein. 000. Maximum likelihood estimation of recombination rates from population data. Genetics 56:393-40. Fearnhead, P., and P. Donnelly. 00. Estimating recombination rates from population genetic data. Genetics 59:99-38. Estimation of population structure: Beerli, P., and J. Felsenstein. 00. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proceedings of the National Academy of Sciences (U.S.A.) 98:4563-4568