Computational Systems Biology: Biology X

Similar documents
6 Introduction to Population Genetics

6 Introduction to Population Genetics

The Wright-Fisher Model and Genetic Drift

How robust are the predictions of the W-F Model?

Genetic Variation in Finite Populations

Mathematical models in population genetics II

Demography April 10, 2015

The Combinatorial Interpretation of Formulas in Coalescent Theory

Evolution in a spatial continuum

Lecture 18 : Ewens sampling formula

Frequency Spectra and Inference in Population Genetics

URN MODELS: the Ewens Sampling Lemma

Notes on Population Genetics

Mathematical Population Genetics II

Effective population size and patterns of molecular evolution and variation

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Mathematical Population Genetics II

ON COMPOUND POISSON POPULATION MODELS

Inventory Model (Karlin and Taylor, Sec. 2.3)

An introduction to mathematical modeling of the genealogical process of genes

Mutation, Selection, Gene Flow, Genetic Drift, and Nonrandom Mating Results in Evolution

So in terms of conditional probability densities, we have by differentiating this relationship:

Rare Alleles and Selection

BIRS workshop Sept 6 11, 2009

Endowed with an Extra Sense : Mathematics and Evolution

Lecture 14 Chapter 11 Biology 5865 Conservation Biology. Problems of Small Populations Population Viability Analysis

Population Genetics I. Bio

Processes of Evolution

NEUTRAL EVOLUTION IN ONE- AND TWO-LOCUS SYSTEMS

A phase-type distribution approach to coalescent theory

Stochastic Demography, Coalescents, and Effective Population Size

STAT 536: Genetic Statistics

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

Selection and Population Genetics

Introduction to population genetics & evolution

Challenges when applying stochastic models to reconstruct the demographic history of populations.

Evolutionary Theory. Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A.

Problems for 3505 (2011)

Outline of lectures 3-6

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford

Ancestral Inference in Molecular Biology. Simon Tavaré. Saint-Flour, 2001

Outline of lectures 3-6

Wald Lecture 2 My Work in Genetics with Jason Schweinsbreg

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Page Max. Possible Points Total 100

Notes for MCTP Week 2, 2014

arxiv: v2 [q-bio.pe] 13 Jul 2011

Linking levels of selection with genetic modifiers

The effects of a weak selection pressure in a spatially structured population

1.3 Forward Kolmogorov equation

1. The diagram below shows two processes (A and B) involved in sexual reproduction in plants and animals.

The coalescent process

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Introduction to Natural Selection. Ryan Hernandez Tim O Connor

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

Chapter 5 Evolution of Biodiversity

WXML Final Report: Chinese Restaurant Process

19. Genetic Drift. The biological context. There are four basic consequences of genetic drift:

THE SPATIAL -FLEMING VIOT PROCESS ON A LARGE TORUS: GENEALOGIES IN THE PRESENCE OF RECOMBINATION

Genetic variation in natural populations: a modeller s perspective

Follow links Class Use and other Permissions. For more information, send to:

On the inadmissibility of Watterson s estimate

Modeling Evolution DAVID EPSTEIN CELEBRATION. John Milnor. Warwick University, July 14, Stony Brook University

ANCESTRAL PROCESSES WITH SELECTION: BRANCHING AND MORAN MODELS

Reading for Lecture 13 Release v10

2 Discrete-Time Markov Chains

Quantitative trait evolution with mutations of large effect

I of a gene sampled from a randomly mating popdation,

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

The fate of beneficial mutations in a range expansion

b. ( ) ( ) ( ) ( ) ( ) 5. Independence: Two events (A & B) are independent if one of the conditions listed below is satisfied; ( ) ( ) ( )

Population Genetics. Tutorial. Peter Pfaffelhuber, Pleuni Pennings, and Joachim Hermisson. February 2, 2009

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

#2 How do organisms grow?

Reproduction and Evolution Practice Exam

Population Structure

Gene Pool Recombination in Genetic Algorithms

Mathematical Models for Population Dynamics: Randomness versus Determinism

GENE genealogies under neutral evolution are commonly

Population genetics snippets for genepop

Surfing genes. On the fate of neutral mutations in a spreading population

Mathematical modelling of Population Genetics: Daniel Bichener

arxiv: v1 [q-bio.pe] 14 Jan 2016

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Outline of lectures 3-6

(Write your name on every page. One point will be deducted for every page without your name!)

NATURAL SELECTION FOR WITHIN-GENERATION VARIANCE IN OFFSPRING NUMBER JOHN H. GILLESPIE. Manuscript received September 17, 1973 ABSTRACT

arxiv: v6 [q-bio.pe] 21 May 2013

Outline for today s lecture (Ch. 13)

Heaving Toward Speciation

Tutorial on Theoretical Population Genetics

Inférence en génétique des populations IV.

The Origin of Species

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Derivation of Itô SDE and Relationship to ODE and CTMC Models

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

Transcription:

Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Human Population Genomics

Outline 1 2

Damn the Human Genomes. Small initial populations; genes too distant; pestered with transposons; feeble contrivance; could make a better one myself. Lord Jefferey (badly paraphrased)

Outline 1 2

Wright-Fisher model Model of population for genealogical relationship among genes Wright (1931) and Fisher (1930). Idealized haploid model of reproduction: Model of transmission of genes from one generation to the next in a population of fixed size; population of 2N genes, corresponding to N diploid or 2N haploid individuals. Each of the genes of generation t + 1 are obtained by copying the gene of a random individual from generation t; this process is repeated until 2N genes have been sampled to create the population at t + 1. A gene in generation t might not have any descendant in generation t + 1 and thus its lineage dies out.

Moran model An alternative model to Wright-Fisher Moran (1958) Moran model allows overlapping generations The population has 2N haploid individuals or genes A new generation is created from the previous one by sampling randomly randomly to give birth to a new gene, and one gene to die: The gene that dies is distinct from the one that gives birth. Population size remains fixed. The Moran model rules out the possibility of multiple coalescent events in the same generation (i.e., no more than two genes share the same common ancestor in the previous generation).

Thus, one out of ( 2N) 2 possible pairs has the desired coalescent property, Thus the natural time scale is in units of N(2N 1) Moran-generations, rather than in units of 2N Wright-Fisher genrations. After adjusting for the differences in time scales, the two models have approximately equivalent coalescence and fixation properties.

Assumptions of the Wright-Fisher Model Discrete and non-overlapping generations: For humans, a generation (from conception to reproduction) is assumed to be about 25 years. Haploid individuals vs. two subpopulation: Note that in practice, generation time differs for males and females, e.g., 30 vs. 20 years. If the selection does not involve heterosis, the difference has little quantitative consequence. The population size is constant: Population bottleneck effects not accounted for.

Assumptions of the Wright-Fisher Model All individuals are equally fit: Presence and strength of natural selection is ignored. The population has no geographical or social structure: It is a hard assumption to relax; but very important in modeling mechanism of reproduction in a real population. The genes do not recombine within the population: Mitochondria and Y chromosomes are possible exceptions... Must be modeled by an ancestral recombination graph.

Number of Descendants Number of descendants of a particular gene, i, in generation t: A stochastic variable. Let v i be the number of descendants of gene i in generation t... 1 i 2N. ( )( ) 2N 1 k ( Pr(v i = k) = 1 1 ) 2N k 1 k 2N 2N k! e 1. This is a binomial distribution Bin(m, p) (m = 2N; p = 1/2N) with a Poisson approximation Poisson(1).

[ The moment generating function is ψ(t) = 1 + (et 1) 2N and for v i, its mean is 1 and variance is 2N 1 ( 1 1 ) 2N 2N = 1 1 2N. If mean number had deviated from one, the population would grow without bound, or shrink to extinction. The covariance of the off-spring number for two genes i and j is Cov(v i, v j ) = E(v i v j ) E(v i )E(v j ) = 1 2N. The correlation coefficient is Corr(v i, v j ) = Cov(v i, v j ) Var(vi )Var(v j ) = 1 2N 1. ] 2N,

Covariance A negative covariance is expected because if gene i leaves many descendants in next generation, then gene i is more likely to leave few. However, v i and v j are almost independent of each other for large 2N. Note that the probability that a gene has no immediate descendant is Pr(v i = 0) = e 1. Thus approximately 0.63 fraction of all genes have descendants. In a few generations (i.e., relative to 2N) a randomly mating population descends from a small number of genes.

Descendants If d j denotes the probability that a gene in generation j leaves no descendant in the present generation, then d 1 = e 1 0.37. Furthermore, d j = k=0 1 k! e 1 (d j 1 ) k = e d j 1 1, for j > 1. For example, d 10 = 0.85 and d 50 = 0.96. An entire population of size 2N = 10, 000 descends from approximately 2N(1 d 50 ) = 400 genes 50 generations ago.

Outline 1 2

of a Sample of Two Genes What is the distribution of the waiting time until the MRCA (Most Recent Common Ancestor) of two genes sampled in a model with 2N genes? (a) The probability p that these two genes find an ancestor in the first generation back in time is p = 1 2N the first gene chooses its parent freely, the second must choose the same parent out of 2N possibilities; (b) The probability q that the two genes have different ancestors is therefore q = 1 1 2N. The probability that the two genes finds a common ancestor exactly j generations back is Pr(T 2 = j) = q j 1 p = ( 1 1 ) j 1 1 2N 2N.

Note Pr(T 2 j) = q j 1 p[1 + q + q 2 + ] = q j 1 1 e (j 1)/2N. Thus Pr(T 2 j) = p[1 + q + q 2 + q j 1 ] = 1 q j 1 e j/2n.

Note that these models assume a Markov Property: That is the probability of an event (such as coalescence) depends on the present state of the population The process has no memory of events prior to the present. It also implicitly assumes that the number of offsprings is distributed as a Poisson process with parameter 1. In reality the mean or variance of the number of offspring may deviate from the expected value 1 (with population bottlenecks, etc.) They result in significant deviation from the predicted model.

Statistics Thus T 2 Geo(1/2N) is geometrically distributed with parameter p = 1 2N. Hence, it has mean and variance Mean = E(T 2 ) = Variance = Var(T 2 ) = 1 p = 2N 1 p p 2 = 2N(2N 1). Thus the expected time until a MRCA is the same as the number of genes in the population.

of a Sample of n Genes The waiting time for k( n) genes to have less than k ancestral lineages: The probability that k genes have exactly k different ancestors in the previous generation is (2N 1) (2N 2) 2N 2N k 1 (2N k + 1) = 2N ( ) k 1 + o(1) = 1 2 2N. i=1 ( 1 i ) 2N

Thus, as before, we have Pr(T k = j) { 1 ( ) k 1 2 2N } j 1 ( ) k 1 2 2N. Thus T k has approximately a geometric distribution with parameter ( k 2) /(2N). Note that the times T2,, T n are independent.

Properties of Geometric Distributions Assume that t 2 > t 1. Then Pr(T > t 2 T > t 1 ) = Pr(T > t 2 t 1 ). Let S and T be two independent geometrically distributed random variables. S Geo(p) and T Geo(p ), then min(s, T) Geo(p + p pp ).

Properties of Exponential Distributions Assume that t 2 > t 1, and V Exp(a) and U Exp(b) are two independent exponentially distributed random variables. Then Pr(U > t 2 U > t 1 ) = Pr(T > t 2 t 1 ) E(V) = 1 a Var(V) = 1 a 2 E(U) = 1 Var(U) = 1 b b 2 a Pr(v < U) = a + b, and min(u, V) Exp(a + b)

Continuous Time Approximation One unit of time corresponds to the average time for two genes to find a common ancestor: E(T 2 ) = 2N generations. Time is scaled by a factor of 2N (or N or in some cases, 4N). Coalescent becomes independent of the population size. The structure of the coalescent process is the same for any population as lon as the sample size is small relative to population size 2N. n 2N. Only the time scale differs between populations when 2N varies.

Rescaling Time Let t = j/(2n), where j is time measured in generations. j = 2Nt. The waiting time, Tk c, in the continuos representation (for k genes to have k 1 ancestors) is exponentially distributed Tk c Exp(( k 2) ). Pr(T c k t) = 1 e (k 2)t.

Stochastic Algorithm to Sample Genealogies for n Genes Algorithm 1 Start with k = n genes. Repeat until k = 1: 1 Simulate the waiting time Tk c to the next event Tk c Exp(`k 2 ). 2 Choose a random pair (i, j) with 1 i < j k uniformly from the `k 2 possible pairs. 3 Merge i and j into one gene and decrease the sample size by one: k k 1.

Effective Population Size Most real populations show some form of reproductive structure: either due to geological proximity of individuals or due to social constraints. Also, the number of descendants of a gene in one generation does not follow the Poisson distribution with intensity one. For a real population, the population size of the haploid Wright-Fisher that best approximates the real population is called the effective population size N e. One could choose one of the following two: N (i) e = 1, or N(t) 2Pr(T 2 = 1) e = E(T 2). 2

(inbreeding effective population size) relates to the immediate past, where as N (t) e relates to the number of generations until an MRCA is found. N (i) e For the haploid Wright-Fisher model, both definitions agree N (i) e = N (t) e = N, since Pr(T 2 = 1) = 1 2N, and E(T 2) = 2N.

Diploid Model In the diploid model with N f = cn females and N m = (1 c)n males: ( Pr(T 2 = 1) = 1 1 ) N. 2N 8N f N m Hence N e 4c(1 c)n. There are other robust ways of defining effective population size: but the differences are minor.

Mutation Three interesting models: The infinite alleles model Kimura and Crow 1964 The infinite sites model Kimura 1969 The finite sites model - Jukes and Cantor 1969 Mutations are assumed to be selectively neutral. Thus the mutation process can be separated from the genealogical process.

In the absence of selection, the mutational process and the transmission of genes from one generation to the next are independent processes. Thus a sample configuration or n genes can be simulated using a two step procedure: 1 Simulate the genealogy of n genes; 2 Add mutations to the genealogy according to the chosen model.

The Wright-Fisher Model with Mutation Impose a process of mutation on top of the process of reproduction. Each gene chosen to be passed on is subject to a mutation with probability u. [[With probability 1 u the gene is copied without modification to the offspring, and with probability u it mutates.]] If we follow a lineage from the present time to the past, then with probability u the parental gene in generation t differs from the offspring gene at time t + 1. The probability that a lineage experiences the first mutation j generations back is Pr(T M = j) = u(1 u) j 1 u u 1 e uj.

Continuous Approximation If time is measured in units of 2N generations (like in coalescence) then Pr(T M j) = 1 (1 u) j 1 e θt/2 = Pr(T c M t), where t = j/(2n), θ = 4Nu and TM c is the time in 2N (assumed large) generations units. The parameter θ is called the population mutation rate or the scaled mutation rate. It also tells us about how fixation and mutations work against each other...

n > 2 Lineages Consider n disjoint lineages. The time until the first mutation event in any of the n lineages is exponentially distributed with parameter nθ/2. If we wait for mutation events of coalescence events then the parameter of the exponentially distributed waiting time is the sum of the two parameters, which is ( n 2 ) + nθ 2 n(n 1 + θ) =. 2

Whether the first event is a coalescence or a mutation is determined by a Bernoulli trial: With probability ( n 2) ( n ) 2 + nθ 2 the event is a coalescence; and With probability it is a mutation. = n 1 n 1 + θ, θ n 1 + θ,

Stochastic Algorithm to Sample Genealogies with Mutations Algorithm 1 Start with k = n genes (sample size). Repeat until k = 1: 1 Simulate the waiting time T c k to the next event Tk c Exp(k(k 1 + θ)/2). 2 With probability (k 1)/(k 1 + θ) the event is coalescence, and with probability θ/(k 1 + θ) the event is mutation. 3 Case : Choose a random pair (i, j) with 1 i < j k uniformly from the `k 2 possible pairs. Merge i and j into one gene and decrease the sample size by one: k k 1. 4 Case Mutation: Choose a lineage at random to leave. The sample size k remains unchanged.

[End of Lecture #10] ***THE END***