Wald Lecture 2 My Work in Genetics with Jason Schweinsbreg

Similar documents
Armitage and Doll (1954) Probability models for cancer development and progression. Incidence of Retinoblastoma. Why a power law?

Spatial Moran model. Rick Durrett... A report on joint work with Jasmine Foo, Kevin Leder (Minnesota), and Marc Ryser (postdoc at Duke)

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Stochastic Demography, Coalescents, and Effective Population Size

Evolution in a spatial continuum

ON COMPOUND POISSON POPULATION MODELS

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford

How robust are the predictions of the W-F Model?

Computing likelihoods under Λ-coalescents

Frequency Spectra and Inference in Population Genetics

The genealogy of branching Brownian motion with absorption. by Jason Schweinsberg University of California at San Diego

The Wright-Fisher Model and Genetic Drift

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

Theoretical Population Biology

GENE genealogies under neutral evolution are commonly

The Combinatorial Interpretation of Formulas in Coalescent Theory

Demography April 10, 2015

Genetic Variation in Finite Populations

The dynamics of complex adaptation

arxiv: v1 [math.pr] 29 Jul 2014

Computational Systems Biology: Biology X

BIRS workshop Sept 6 11, 2009

arxiv: v3 [math.pr] 23 Jun 2009

Genetic hitch-hiking in a subdivided population

An introduction to mathematical modeling of the genealogical process of genes

Lecture 18 : Ewens sampling formula

Latent voter model on random regular graphs

Yaglom-type limit theorems for branching Brownian motion with absorption. by Jason Schweinsberg University of California San Diego

Measure-valued diffusions, general coalescents and population genetic inference 1

Crump Mode Jagers processes with neutral Poissonian mutations

Some mathematical models from population genetics

The effects of a weak selection pressure in a spatially structured population

A coalescent model for the effect of advantageous mutations on the genealogy of a population

Mathematical models in population genetics II

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

APM 541: Stochastic Modelling in Biology Diffusion Processes. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 61

arxiv:math/ v3 [math.pr] 5 Jul 2006

Mathematical Biology. Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. Matthias Birkner Jochen Blath

WXML Final Report: Chinese Restaurant Process

The two-parameter generalization of Ewens random partition structure

The Moran Process as a Markov Chain on Leaf-labeled Trees

Population Genetics: a tutorial

Notes 20 : Tests of neutrality

Theoretical Population Biology

J. Stat. Mech. (2013) P01006

Stan Ulam once said: Branching Process Models of Cancer. I have sunk so low that my last paper contained numbers with decimal points.

Endowed with an Extra Sense : Mathematics and Evolution

6 Introduction to Population Genetics

Genealogies in Growing Solid Tumors

Evolutionary dynamics of cancer: A spatial model of cancer initiation

From Individual-based Population Models to Lineage-based Models of Phylogenies

Australian bird data set comparison between Arlequin and other programs

Surfing genes. On the fate of neutral mutations in a spreading population

Hitting Probabilities

Modern Discrete Probability Branching processes

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

6 Introduction to Population Genetics

Stochastic flows associated to coalescent processes

Computational statistics

THE x log x CONDITION FOR GENERAL BRANCHING PROCESSES

The Coalescence of Intrahost HIV Lineages Under Symmetric CTL Attack

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Problems on Evolutionary dynamics

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

Segregation versus mitotic recombination APPENDIX

Rare Alleles and Selection

Mathematical Models for Population Dynamics: Randomness versus Determinism

Ancestor Problem for Branching Trees

MANY questions in population genetics concern the role

Estimating Evolutionary Trees. Phylogenetic Methods

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Selection and Population Genetics

Look-down model and Λ-Wright-Fisher SDE

CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI

arxiv: v2 [q-bio.pe] 13 Jul 2011

arxiv: v1 [math.pr] 18 Sep 2007

GENETICS. On the Fixation Process of a Beneficial Mutation in a Variable Environment

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Learning Session on Genealogies of Interacting Particle Systems

Almost giant clusters for percolation on large trees

ON THE TWO OLDEST FAMILIES FOR THE WRIGHT-FISHER PROCESS

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Lecture 19 : Chinese restaurant process

Branching Processes II: Convergence of critical branching to Feller s CSB

The fate of beneficial mutations in a range expansion

A CHARACTERIZATION OF ANCESTRAL LIMIT PROCESSES ARISING IN HAPLOID. Abstract. conditions other limit processes do appear, where multiple mergers of

The coalescent process

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Statistical Tests for Detecting Positive Selection by Utilizing High. Frequency SNPs

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

A phase-type distribution approach to coalescent theory

Evolution of cooperation in finite populations. Sabin Lessard Université de Montréal

Competing selective sweeps

BRANCHING PROCESSES 1. GALTON-WATSON PROCESSES

A N-branching random walk with random selection

Properties of Statistical Tests of Neutrality for DNA Polymorphism Data

Lecture 4: Random Variables and Distributions

Effective population size and patterns of molecular evolution and variation

Resistance Growth of Branching Random Networks

Transcription:

Wald Lecture 2 My Work in Genetics with Jason Schweinsbreg Rick Durrett Rick Durrett (Cornell) Genetics with Jason 1 / 42

The Problem Given a population of size N, how long does it take until τ k the first time we have an individual with a prespecified sequence of k mutations? We use the Moran model. Initially all individuals are type 0. Each individual is subject to replacement at rate 1. A copy is made of an individual chosen at random from the population. Type j 1 mutates to type j at rate u j. Rick Durrett (Cornell) Genetics with Jason 2 / 42

Progression to Colon Cancer Luebeck and Moolgavakar (2002) PNAS fit a four stage model to incidence of colon cancer by age. Rick Durrett (Cornell) Genetics with Jason 3 / 42

k=2 : Iwasa, Michor, Nowak (2004) Genetics Theorem. If Nu 1 0 and N u 2 P(τ 2 > t/nu 1 u2 ) e t 10,000 simulations of n = 10 3, u 1 = 10 4, u 2 = 10 2 1 0.9 0.8 0.7 Probability Density 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Rescaled Time Rick Durrett (Cornell) Genetics with Jason 4 / 42

Idea of Proof Since 1 s mutate to 2 s at rate u 2, τ 2 will occur when there have been O(1/u 2 ) births of individuals of type 1. The number of 1 s is roughly a symmetric random walk, so τ 2 will occur when the number of 1 s reaches O(1/ u 2 ). N >> 1/ u 2 guarantees that up to τ 2 the number of 1 s is o(n), so 1 mutations occur at rate Nu 1. The waiting time from the 1 mutation until the 2 mutant appears is of order 1/ u 2. For this to be much smaller than the overall waiting time 1/Nu 1 u2 we need Nu 1 << 1. Rick Durrett (Cornell) Genetics with Jason 5 / 42

Waiting for k mutations Total progeny of a critical binary branching process has P(ξ > k) Ck 1/2, so the sum of M such random variables is O(M 2 ). To get 1 individual of type 4, we need of order 1/u 4 births of type 3. 1/ u 4 mutations to type 3. 1/u 3 u4 births of type 2. 1/u 1/2 3 u 1/4 4 mutations to type 2. 1/u 2 u 1/2 3 u 1/4 4 births of type 1. 1/u 1/2 2 u 1/4 3 u 1/8 4 mutations to type 1. Rick Durrett (Cornell) Genetics with Jason 6 / 42

Durrett, Schmidt, Schweinsberg, Ann Prob. Probability type j has a type k descendant. r j,k = u 1/2 j+1 u1/4 j+2 u1/2k j k for 1 j < k Theorem. Let k 2. Suppose that: (i) Nu 1 0. (ii) For j = 1,..., k 1, u j+1 /u j > b j for all N. (iii) There is an a > 0 so that N a u k 0. (iv) Nr 1,k. Then for all t > 0, lim N P(τ k > t/nu 1 r 1,k ) = exp( t). Rick Durrett (Cornell) Genetics with Jason 7 / 42

When Nr 1,k Fixation of 1 before τ k and stochastic tunneling each have positive probability. Using convergence to the Wright-Fisher diffusion and the Feynman-Kac formula we can prove. Theorem. Let k 2. Assume (i), (ii), and (iii) from before. (iv) (Nr 1,k ) 2 γ > 0, and we let α = k=1 γ k / γ k (k 1)!(k 1)! k!(k 1)! > 1 k=1 then for all t > 0, lim N P(u 1 τ k > t) = exp( αt). Rick Durrett (Cornell) Genetics with Jason 8 / 42

Back to reality. Armitage and Doll (1954) Rick Durrett (Cornell) Genetics with Jason 9 / 42

Small time behavior Our results are appropriate for the regulatory sequence application since one is interested in the typical amount of time that the process takes. However, most cancers occur in less than 1% of the population so we are looking at the lower tail of the distribution. Let g k (t) = Q 1 (τ k t) where Q 1 is the probability for the branching process started with one type 1. In the case u j µ g j (t) = µg j 1 (t) (1 µ)g j (t) 2 2µg j (t) One can inductively solve the differential equations and finds If t << µ 1/2 then g k (t) µ k 1 t k 1 /(k 1)! Rick Durrett (Cornell) Genetics with Jason 10 / 42

Other results Schweinsberg (2008) studies all possible limits in the case µ j µ paper is on the arxiv When m = 3 the behavior changes at µ = N 2 N 4/3 N 1 N 2/3 Thus there are five regimes and four borderline cases. Rick Durrett (Cornell) Genetics with Jason 11 / 42

Moran model Each individual is replaced at rate 1. That is, individual x lives for an exponentially distributed amount with mean 1 and then is replaced. To replace individual x, we choose an individual at random from the population (including x itself) to be the parent of the new individual. Suppose that we have two alleles A and a, and let X t be the number of copies of A. The transition rates for X t are i i + 1 at rate b i = (2N i) i i 1 at rate d i = i 2N i 2N i 2N Rick Durrett (Cornell) Genetics with Jason 12 / 42

Kingman s coalescent Theorem When time is run at rate N, the genealogy of a sample of size n from the Moran model converges to Kingman s coalescent. Proof. If we look backwards in time, then when there are k lineages, each replacement leads to a coalescence with probability (k 1)/2N. If we run time at rate N, then jumps occur at rate N k/2n = k/2, so the total rate of coalescence is k(k 1)/2, the right rate for Kingman s coalescent. Rick Durrett (Cornell) Genetics with Jason 13 / 42

Directional Selection Fecundity selection. Suppose b s are born at a rate 1 s times that of B s. The transition rates for X t for the number of B s is now: i i i + 1 at rate b i = (2N i) 2N i i 1 at rate d i = i 2N i (1 s) 2N Embedded jump chain is a simple random walk that jumps up with probability p = 1/(2 s) and down with probability 1 p. Started with X 0 = i, B becomes fixed in the population (reaches 2N) with probability: 1 (1 s) i 1 (1 s) 2N Rick Durrett (Cornell) Genetics with Jason 14 / 42

Three phases of the fixation process 1 While the advantageous B allele is rare, the number of B s can be approximated by a supercritical branching process. 2 While the frequency of B s is [ɛ, 1 ɛ] there is very little randomness and it follows the solution of the logistic differential equation: du/dt = su(1 u). 3 While the disadvantageous b allele is rare, the number of a s can be approximated by a subcritical branching process. Rick Durrett (Cornell) Genetics with Jason 15 / 42

Hitchhiking Due to recombination, each chromosome you inherit from each parent is a mixture of their two chromosomes, with transitions between the two at points of a nonhomogeneous Poisson process. In the absence of recombination, fixation of an allele would result in every individual in the population having a copy of the associated chromosome. With recombination, changes in allele frequency occur only near the allele that went to fixation. Rick Durrett (Cornell) Genetics with Jason 16 / 42

Maynard-Smith and Haigh (1974) Alleles B and b have relative fitnesses 1 and 1-s, neutral locus with alleles A and a, recombination between the two has probability r. Let p 0 = frequency of B before the sweep (1/2N). Q t = P(A B). R t = P(A b). Theorem. Suppose Q 0 = 0. Under the logistic sweep model, which ignores the branching process phases 1 and 3, 2τ Q = R 0 (1 p 0 ) 0 re rt ds (1 p 0 ) + p 0 est Proof. R 0 (1 p 0 ) is the frequency of A before the sweep. In order for a sampled individual to have the A allele, its lineage must escape the sweep due to recombination. Rick Durrett (Cornell) Genetics with Jason 17 / 42

Hitchhiking = Population subdivision. b population. ignore....... B population 0 τ = (1/s) log(2n) 2τ Rick Durrett (Cornell) Genetics with Jason 18 / 42

Durrett and Schweinsberg (2004) Th. Pop. Biol. From the previous theorem, the probability a lineage escapes from the sweep by recombination is pinb = 2τ 0 re rt ds (1 p 0 ) + p 0 est Theorem. Under the logisitic sweep model, if N and r log(2n)/s a, pinb 1 e a. Biologists rule of thumb: hitchhiking is efficient if r < s and negligible if r s. (should be efficient if r s/(log(2n)) Rick Durrett (Cornell) Genetics with Jason 19 / 42

Effect on genealogies Approximation 1 Let p k,i = probability k lineages reduced to i by the sweep. Under the logistic sweep model, if N with r ln(2n)/s a and s(ln N) 2 then for j 2 p k,k j+1 ( ) k p j (1 p) k j j where p = e a and p k,k (1 p) k + kp(1 p) k 1. p-merger. Flip coins with probability p of heads for each lineage and coalesce all of those with heads. Need at least two heads to get a coalescence. Rick Durrett (Cornell) Genetics with Jason 20 / 42

Simulation results N = 10, 000, s = 0.1. Set r = 0.00516 so pinb 0.4. p2inb = P( both lineages escapes the sweep and do not coalesce). p2cinb = P( both lineages escape the sweep but coalesce). p1b1b = P( one lineage escapes but the other does not). p 22 = P( no coalescence ) = p2inb + p1b1b pinb p2inb p2cinb p1b1b p 22 Approx. 1 0.4 0.16 0 0.48 0.64 logistic ODE 0.39936 0.13814 0.09599 0.32646 0.46460 Moran sim 0.33656 0.10567 0.05488 0.35201 0.45769 Approx. 2 0.34065 0.10911 0.05100 0.36112 0.47203 Rick Durrett (Cornell) Genetics with Jason 21 / 42

Approximation 2 A stick breaking construction that leads to a coalescent with simultaneous multiple collisions. a 8 = a 9 = a 10 a 1 a 2 = a 3 = a 4 a 5 = a 6 a 7 a 11 I 1 I 4 I 6 I 7 I 10 Pieces of stick are coalesced lineages that escape due to recombination. Sampled individuals = points random on (0,1). Two in the same piece coalesce. I 1 may be marked ( ) or not (escapes sweep). Rick Durrett (Cornell) Genetics with Jason 22 / 42

a 8 = a 9 = a 10 a 1 a 2 = a 3 = a 4 a 5 = a 6 a 7 a 11 I 1 I 4 I 6 I 7 I 10 M = [2Ns] number of lineages with an infinite line of descent ξ l, 2 l M iid Bernoulli, 1 (recombination) with prob r/s. W l, 2 l M are beta(1, l 1) (fraction of lineages) V l = ξ l W l, T l = V l M i=l+1 (1 V i) a l = a l+1 T l, I l = [a l, a l+1 ] Proofs. Schweinsberg and Durrett (2005) Ann. Appl. Prob. Error is O(1/ log 2 N) versus O(1/ log N) for approx 1 Rick Durrett (Cornell) Genetics with Jason 23 / 42

Reduction of π = 0.01 due to a sweep Kim and Stephan (2002) > D & S (dashed) answer 9 x 10 3 8 7 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 position in kilobases Rick Durrett (Cornell) Genetics with Jason 24 / 42

A Drosophila Puzzle Begun and Aquadro (1992) observed that in Drosophila melanogaster there is a positive correlation between nucleotide diversity and recombination rates. Two explanations: Repeated episodes of hitchhiking caused by the fixation of newly arising advantageous mutations, which has a greater effect in regions of low recombination, because the average size of the region affected depends on the ratio s/r. Background selection (removal of deleterious alleles) which leads to a reduction of the effective population size has a greater impact in regions of low recombination, but does not change the site frequency spectrum. Rick Durrett (Cornell) Genetics with Jason 25 / 42

Λ-coalescents. Pitman, Möhle and Sagitov State is a partition. Sets in partition are lineages that have coalesced. ξ η is a k-merger if k sets in ξ collapse to one in η, and the rest of η does not change. q ξ,η = 1 Λ({0}) = 1. Kingman s coalescent. 0 p k 2 (1 p) ξ k Λ(dp) If λ = 1 0 p 2 Λ(dp) <, p-mergers with a random λ 1 p 2 Λ(dp) distributed p occur at rate λ. Rick Durrett (Cornell) Genetics with Jason 26 / 42

Durrett and Schweinsberg (2005) SPA Suppose that the recombination rate between 0 and x is β x. Mutations with a fixed selective advantage s occur in the population at rate γ per unit length. Theorem. The genealogies converge to a Λ coalescent with Λ = δ 0 + cy dy where c = 2γs 2 /β. Rick Durrett (Cornell) Genetics with Jason 27 / 42

Comparison with data on π. Stephan (1995) 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 N times recombination rate Rick Durrett (Cornell) Genetics with Jason 28 / 42

Large family sizes The original biological motivation for Λ-coalescents is that many species have a highly variable number of offspring. Cannings model Suppose that the 2N members of the population have offspring (ν 1,... ν 2N ). The ν i are exchangeable and sum to 2N. (Distribution depends on N.) Möhle (2000). Run time at rate 2N/var (ν i ). Convergence to Kingman s coalescent occurs if and only if E[ν 1 (ν 1 1)(ν 1 2)]/N 2 E[ν 1 (ν 1 1)]/N 0 In words, if and only if no triple mergers. Rick Durrett (Cornell) Genetics with Jason 29 / 42

Schweinsberg (2003) Stoch. Proc. Appl. Each individual has X i offspring (independent) then N are chosen to make the next generation. Part (c) of Theorem 4 shows Theorem. Suppose EX i = µ > 1 and P(X i k) Ck α with 1 < α < 2. Then, when time is run at rate 2N/var (ν i ) C N α 1, the genealogical process converges to a Λ-coalescent where Λ is the beta(2 α, α) distribution, i.e., Λ(dx) = x 1 α (1 x) α 1 B(2 α, α) where B(a, b) = Γ(a)Γ(b)/Γ(a + b), and Γ(a) = 0 x a 1 e x dx is the usual gamma function. Rick Durrett (Cornell) Genetics with Jason 30 / 42

Genealogy when α = 1.2 Rick Durrett (Cornell) Genetics with Jason 31 / 42

Genealogy when α = 1.9 Kingman Rick Durrett (Cornell) Genetics with Jason 32 / 42

Arnason (2004) cytochrome b data, 1278 cod 39 mutations define 59 haplotypes (mutation patterns): This indicates some sites were hit more than once, for if not, the number of haplotypes = 1 + the number of mutations Haplotype frequencies: 696, 193, 124, 112, 29, 15, 9, 7, 6, 5(3), 4(2), 3(6), 2(7), 1(32) Rick Durrett (Cornell) Genetics with Jason 33 / 42

Rick Durrett (Cornell) Genetics with Jason 34 / 42

Site frequency spectrum J. Berestycki, N. Berestycki, and Schweinsberg (2006a,b). Theorem Suppose we introduce mutations into the beta coalescent at rate θ, and let M n,k be the number of mutations affecting k individuals in a sample of size n. Then as n, M n,k (2 α)γ(α + k 2) a k = C α k α 3. S n Γ(α 1)k! When α = 2 this reduces to the 1/k behavior found in Kingman s coalescent. When k = 1, a k = 2 α. Rick Durrett (Cornell) Genetics with Jason 35 / 42

Data set 2 Boom, Boulding, and Beckenbach (1994) did a restriction enzyme digest of mtdna on a sample of 141 Pacific Oysters from British Columbia. They found 51 segregating sites and 30 singleton mutations, resulting in an estimate of α = 2 30 51 = 1.41 However, this estimate is biased. If the underlying data was generated by Kingman s coalescent, we would expect a fraction 1/ ln(141) = 0.202 of singletons, resulting in an estimate of α = 1.8. Rick Durrett (Cornell) Genetics with Jason 36 / 42

BBB α = 1.19 (uncorr: 1.41), Arnason α = 1.54 Rick Durrett (Cornell) Genetics with Jason 37 / 42

Segregating sites J. Berestycki, N. Berestycki, and Schweinsberg (2006a,b). Theorem Suppose we introduce infinite sites mutations into the beta coalescent at rate θ, and let S n be the number of segregating sites observed in a sample of size n. If 1 < α < 2 then as n S n θα(α 1)Γ(α) n2 α 2 α In Kingman s coalescent S n log n θ Rick Durrett (Cornell) Genetics with Jason 38 / 42

Simulation mean / formula : slow convergence Rick Durrett (Cornell) Genetics with Jason 39 / 42

Subsampling Arnason, α 1.50 (vs. 1.54) Rick Durrett (Cornell) Genetics with Jason 40 / 42

PRF likelihood of SFS Carlos Bustamante Rick Durrett (Cornell) Genetics with Jason 41 / 42

Estimation results: Emilia Huerta-Sanchez Now VIGRE postdoc, U.C. Berkeley Statisitcs. Rick Durrett (Cornell) Genetics with Jason 42 / 42