Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Similar documents
Estimating Evolutionary Trees. Phylogenetic Methods

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Population Genetics I. Bio

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Processes of Evolution

Demography April 10, 2015

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Genetic Drift in Human Evolution

Introduction to Advanced Population Genetics

Taming the Beast Workshop

There are 3 parts to this exam. Use your time efficiently and be sure to put your name on the top of each page.

Mathematical models in population genetics II

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Lecture 14 Chapter 11 Biology 5865 Conservation Biology. Problems of Small Populations Population Viability Analysis

Robust demographic inference from genomic and SNP data

Challenges when applying stochastic models to reconstruct the demographic history of populations.

CONSERVATION AND THE GENETICS OF POPULATIONS

How robust are the predictions of the W-F Model?

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

7. Tests for selection

The theory of evolution continues to be refined as scientists learn new information.

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Approximate Bayesian Computation: a simulation based approach to inference

Evolutionary Theory. Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A.

Frequency Spectra and Inference in Population Genetics

Neutral Theory of Molecular Evolution

Microevolution (Ch 16) Test Bank

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Stochastic Demography, Coalescents, and Effective Population Size

An introduction to Approximate Bayesian Computation methods

6 Introduction to Population Genetics

Name Period. 3. How many rounds of DNA replication and cell division occur during meiosis?

Selection and Population Genetics

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

I. Short Answer Questions DO ALL QUESTIONS

The Combinatorial Interpretation of Formulas in Coalescent Theory

There are 3 parts to this exam. Take your time and be sure to put your name on the top of each page.

The Origin of Species

6 Introduction to Population Genetics

From Individual-based Population Models to Lineage-based Models of Phylogenies

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Big Idea #1: The process of evolution drives the diversity and unity of life

A Bayesian Approach to Phylogenetics

Dr. Amira A. AL-Hosary

Primate Diversity & Human Evolution (Outline)

Ch. 16 Evolution of Populations

Mutation, Selection, Gene Flow, Genetic Drift, and Nonrandom Mating Results in Evolution

STABILIZING SELECTION ON HUMAN BIRTH WEIGHT

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Discrete & continuous characters: The threshold model

Darwinian Selection. Chapter 7 Selection I 12/5/14. v evolution vs. natural selection? v evolution. v natural selection

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Name Period. 2. Name the 3 parts of interphase AND briefly explain what happens in each:

Molecular Epidemiology Workshop: Bayesian Data Analysis

- point mutations in most non-coding DNA sites likely are likely neutral in their phenotypic effects.

Gene Pool The combined genetic material for all the members of a population. (all the genes in a population)

Quantitative Trait Variation

Reproduction and Evolution Practice Exam

Bayesian Phylogenetics:

122 9 NEUTRALITY TESTS

Computational Systems Biology: Biology X

The Mechanisms of Evolution

NOTES CH 17 Evolution of. Populations

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Fundamentals and Recent Developments in Approximate Bayesian Computation

Rapid speciation following recent host shift in the plant pathogenic fungus Rhynchosporium

Phenotypic Evolution. and phylogenetic comparative methods. G562 Geometric Morphometrics. Department of Geological Sciences Indiana University

Concepts and Methods in Molecular Divergence Time Estimation

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

ABCME: Summary statistics selection for ABC inference in R

Demographic inference reveals African and European. admixture in the North American Drosophila. melanogaster population

Notes on Population Genetics

(Write your name on every page. One point will be deducted for every page without your name!)

Supporting Information

Distinguishing between population bottleneck and population subdivision by a Bayesian model choice procedure

Notes for MCTP Week 2, 2014

UNIT V. Chapter 11 Evolution of Populations. Pre-AP Biology

The Wright-Fisher Model and Genetic Drift

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Applications of Genetics to Conservation Biology

Population Genetics & Evolution

Evolution of Populations. Chapter 17

A (short) introduction to phylogenetics

Approximate Bayesian Computation


Statistical Methods in Particle Physics Lecture 1: Bayesian methods

ESTIMATION of recombination fractions using ped- ber of pairwise differences (Hudson 1987; Wakeley

Lecture 11 Friday, October 21, 2011

Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics

Processes of Evolution

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

The problem Lineage model Examples. The lineage model

Mechanisms of Evolution. Adaptations. Old Ideas about Evolution. Behavioral. Structural. Biochemical. Physiological

I of a gene sampled from a randomly mating popdation,

Modern Evolutionary Classification. Section 18-2 pgs

Bayesian Inference and MCMC

Introduction into Bayesian statistics

Surfing genes. On the fate of neutral mutations in a spreading population

Population Structure

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Transcription:

Coalescent based demographic inference Daniel Wegmann University of Fribourg

Introduction The current genetic diversity is the outcome of past evolutionary processes. Hence, we can use genetic diversity to tell stories about the past.

Introduction The current genetic diversity is the outcome of past evolutionary processes. Hence, we can use genetic diversity to tell stories about the past. But this is a challenging task! The history of natural populations is usually complex. Several evolutionary processes can leave similar footprints (bottleneck vs. selection). Loci are not independent, but correlated realizations of the same process. Novembre & Ramachandran (2011)

Qualitative inference Traditionally, we have relied on qualitative inference Example: out of Africa expansion via sequential founder effects in humans. Heterozygosity decays with distance from East Africa Ramachandran et al. (2005)

Model-based inference Patterns of genetic diversity may serve as evidence for or against stories of the evolutionary past. Such stories are usually vague ( Serial founder effects ). While the evidence may be strong, the argument remains verbal and is potentially subjective. Model-based inference provides statistical support

Model-based inference Patterns of genetic diversity may serve as evidence for or against stories of the evolutionary past. Such stories are usually vague ( Serial founder effects ). While the evidence may be strong, the argument remains verbal and is potentially subjective. Model-based inference provides statistical support Essentially, all models are wrong, but some are useful. George E. Box Qualitative inference is key when constructing sensible models!

Rejection of a Null Model The same as hypothesis testing in frequentist statistics: A null model M is rejected using a summary statistics s if P ( s M ) s s obs By convention, α = 0.05 Often the Null model is an isolated Wright-Fisher population of constant size

Rejection of a Null Model: F-Statistics F ST may be used to reject a panmictic population in favor of a specific structure. F IS may be used to reject a panmictic population in favor of non-random mating (inbreeding or substructure) The significance of F-Statistics is usually assessed usign permutation or randomization approaches.

Rejection of a Null Model: F-Statistics Western Eastern Central Bonobo Wegmann & Excoffier, MBE, 2010

Rejection of a Null Model: Tajima s D Tajima s D compares two estimates of θ=4n for a Wright-Fisher population of constant size: one based on the number segregating sites S one based on the average number of pairwise differences These estimates may differ when assumptions of the Wright-Fisher population are violated. Witgh-Fisher population An expanding population, for instance, leads to a negative D Significance is usually assessed via simulations. expanding population

Felsenstein Equation The Felsenstein Equation The Likelihood Function The probability of the data D given the parameters of the model Θ: P(D Θ) Maximum Likelihood Inference The maximum likelihood estimates are the values of Θ for which the likelihood P(D Θ) is maximized. 1 / 14

Felsenstein Equation The Felsenstein Equation The Likelihood Function The probability of the data D given the parameters of the model Θ: P(D Θ) Maximum Likelihood Inference The maximum likelihood estimates are the values of Θ for which the likelihood P(D Θ) is maximized. Bayesian Statistics The goal is to infer the probability of the parameters Θ given the data D. According to probability theory, Here, P(Θ D) = P(D Θ)P(Θ) P(D) P(D Θ)P(Θ) = P(D Θ)P(Θ)d Θ Θ P(Θ) is the prior probability, the probability of the parameter before looking at the data (yes, this is subjective!). P(Θ D) is the posterior probability of the parameter after considering the data. 2 / 14

Felsenstein Equation Mutation Model Likelihood of sequence data given a Genealogy The link between sequencing data D and some demographic parameters Θ is the underlying, unknown genealogy. Given a genealogy G i and a mutation model µ, the likelihood of the data is straight forward to calculate. Ind 1 : aagacacaga gatagaccag Ind 1 Ind 2 Ind 3 Ind 2 : aagacgcaga gatagaccag Ind 3 : aagacacaga tatagacaag Assuming all mutations to occur with rate µ: P(D G i, µ) = P(# mutations on b length(b), µ) b {Branches} Ind 1 Ind 2 Ind 3 3 / 14

Felsenstein Equation The Felsenstein Equation The Felsenstein Equation Calculating P(D Θ) requires to integrate over all possible genealogies and weighting each by their probability. P(D Θ, µ) = P(D G, µ)p(g Θ)dG G 4 / 14

Felsenstein Equation The Felsenstein Equation The Felsenstein Equation Calculating P(D Θ) requires to integrate over all possible genealogies and weighting each by their probability. P(D Θ, µ) = P(D G, µ)p(g Θ)dG G The Felsenstein Equation in practice Unfortunately, this integral is impossible to solve analytically in all but some extremely simple models. In practice, we thus approximate this integral using a random sample of coalescent trees. P(D Θ, µ) 1 N N P(D G i, µ) where g i P(G Θ) i=1 5 / 14

Felsenstein Equation Primer in Coalescent Theory Coalescent theory A population genetic theory that considers the history of a sample backward in time. Coalescent event If two sampled lineages have the same parent in the previous generation. 6 / 14

Felsenstein Equation Primer in Coalescent Theory Coalescent theory A population genetic theory that considers the history of a sample backward in time. Coalescent event If two sampled lineages have the same parent in the previous generation. Probability to coalesce Under random mating in a constant population, two lineages coalesce in the previous generation with probability Pr(2 individuals coalesce) = 1 2N 7 / 14

Felsenstein Equation Primer in Coalescent Theory Coalescent theory A population genetic theory that considers the history of a sample backward in time. Coalescent event If two sampled lineages have the same parent in the previous generation. Probability to coalesce Under random mating in a constant population, two lineages coalesce in the previous generation with probability Pr(2 individuals coalesce) = 1 2N Expected time t 2 until two lineages coalesce (time to Most Recent Common Ancestor, MRCA): E[t 2] = 2N generations. 8 / 14

Felsenstein Equation Coalescence with multiple samples Probability of coalescent Intuitive explanation ( ) k 1 k(k 1) Pr(at least one coalescent event) = = 2 2N 4N Probability of coalescence among k lineages = probability of coalescence among two lineages 1 2N the number of possible pairs ( k 2). times 9 / 14

Felsenstein Equation Coalescence with multiple samples Probability of coalescent Intuitive explanation ( ) k 1 k(k 1) Pr(at least one coalescent event) = = 2 2N 4N Probability of coalescence among k lineages = probability of coalescence among two lineages 1 2N the number of possible pairs ( k 2). times Expected time t k until k lineages coalesce 10 / 14

Felsenstein Equation Coalescence with multiple samples Probability of coalescent Intuitive explanation ( ) k 1 k(k 1) Pr(at least one coalescent event) = = 2 2N 4N Probability of coalescence among k lineages = probability of coalescence among two lineages 1 2N the number of possible pairs ( k 2). times Expected time t k until k lineages coalesce The expected waiting time until an event occurs the first time is given by the inverse of the probability of the event! E[t k ] = 1 ( k 2) 1 2N = ( 2N 4N k = 2) k(k 1) 11 / 14

Felsenstein Equation Expected genealogy of n samples (lineages) Height versus length of a genealogy of n samples ( E[T n] = 4N 1 1 ) n n 1 1 E[L n] = 4N k k=1 E[ L n ] or E[ T n ] 28N 24N 20N 16N 12N 8N 4N 0N E[ L n ] E[ T n ] 2 4 8 16 32 64 128 256 512 Sample size n Note: Adding additional samples does increase the expected tree height only marginally, but increases the tree length a lot. Actually, doubling of the sample size increases the tree length by about 1.5 N. 12 / 14

Deep resequencing data set Data set: 202 known or prospective drug target genes 14,002 individuals, of which 12,514 Europeans Median coverage of 27x and a call rate of 90.7% Extensive quality control John Novembre Matt Nelson Heterozygous concordance 99.1% in 130 sample duplicates 99.0% in comparison to 1000G Trios Singleton concordance 98.5% in 130 sample duplicates 98.3% of 245 validated via Sanger Wegmann & Nelson et al. 2012

Rare variants are only weakly affected by selection Expected number of Alleles with frequency x Advantageous alleles Neutral alleles Disadvantageous alleles Messer 2009

Phenotypic Effect of Rare Variants Rare variants have a strong, negative impact on the phenotype 85% of NS mutations are deleterious enough never to get fixed 75% never to never get common (MAF of 5%) Similar patterns found by PolyPhen Wegmann & Nelson et al. 2012

Joint inference of demography and mutation rates Mutation rate and population size N have similar effects on genetic diversity. large population small population low mutation rate large mutation rate Wakeley and Takahashi 2002

Joint inference of demography and mutation rates Mutation rate and population size N have similar effects on genetic diversity. large population small population low mutation rate large mutation rate If sample size > effective population size, the rate of recent coalescent events is independent of, which rensers estimation of and N individually possible. Wakeley and Takahashi 2002

Joint inference of demography and mutation rates Mutation rate and population size N have similar effects on genetic diversity. large population small population low mutation rate large mutation rate If sample size > effective population size, the rate of recent coalescent events is independent of, which rensers estimation of and N individually possible. Problem: Likelihood calculation is intractable! Wakeley and Takahashi 2002

Joint inference of demography and mutation rates Using Monte Carlo simulations to approximate P(SFS,N): Simulate genealogies with fixed parameter values Africa Asia Europe Exponential growth in Europe All other parameters fixed to Schaffner estimates Nielsen 2000; Coventry et al. 2010

Joint inference of demography and mutation rates Using Monte Carlo simulations to approximate P(SFS,N): Simulate genealogies with fixed parameter values Compute average likelihood of the SFS across genealogies Africa Asia Europe Exponential growth in Europe All other parameters fixed to Schaffner estimates Likelihood 1 Likelihood 2 Likelihood 3 Average Likelihood Nielsen 2000; Coventry et al. 2010

Mutation rate Joint inference of demography and mutation rates Rapid population growth in Europe Variable mutation rates across genes (p 10-16 ) Median mutation rate of 1.2x10-8 Lower than divergence based estimates (2.5x10-8 ) But in good agreement with recent estimates from pedigrees Population size (millions)

Mode of Speciation in Rose Finches In the classic view, geographic isolation was considered essential for speciation. However, recent evidence suggests that local adaptation and speciation may occur in the presence of gene flow if ecological selection is strong. In Birds, the Z-chromosome is known to play a vital role is speciation Haldanes Rule: In hybrids, fintness is lower in the hemizygous sex (females) Male sexually selected traits and female preference was mapped to the Z- chromosome in several species. Prediction If selection against hybrids is a driving force in speciation, gene flow will be interrupted ealier on the Z-chromosome than on autosomes.

Mode of Speciation in Rose Finches Inferring isolation times for Z-linked and autosomal markers seperately. Shou-Hsien Li Carpodacus vinaceus (Himalaya) Carpodacus formosa (Taiwan)

Two major difficulties For realistic evolutionary models, analytical solutions of the likelihood function are usually very hard and often impossible to obtain. We will use two tricks: 1) Using summary statistics S instead of the full data D The hope is that P(D θ) is proportional to P(S θ) 2) Using simulations to approximate the likelihood function P(S θ) Apply in a Bayesian setting: P(θ D) P(D θ) P(θ) Posterior Approximate Bayesian Computation (ABC) Likelihood Prior

Tavaré et al. (1997); Weiss & von Haeseler (1998) Approximate Bayesian Computation ABC defining statistics S,, F ST, D,... Data Summary statistics

Tavaré et al. (1997); Weiss & von Haeseler (1998) Standard ABC Algorithm defining statistics generating simulations according to prior

Tavaré et al. (1997); Weiss & von Haeseler (1998) Approximate Bayesian Computation ABC defining statistics generating simulations according to prior accepting close simulations

Tavaré et al. (1997); Weiss & von Haeseler (1998) Approximate Bayesian Computation ABC defining statistics generating simulations according to prior accepting close simulations estimating posterior distribution

Tavaré et al. (1997); Weiss & von Haeseler (1998) Approximate Bayesian Computation ABC defining statistics generating simulations according to prior accepting close simulations estimating posterior distribution

Beaumont et al. (2002); Blum & François (2009) Approximate Bayesian Computation ABC defining statistics generating simulations according to prior Regression to project points to s obs Assumption: no change in prior weight accepting close simulations post sampling regression adjustment estimating posterior distribution

ABC-GLM defining statistics generating simulations according to prior It is easy to show that where is the truncated likelihood accepting close simulations fitting a simple likelihood model estimating posterior distribution and the truncated prior Leuenberger & Wegmann (2010) Chris Leuenberger

ABC-GLM defining statistics generating simulations according to prior accepting close simulations fitting a simple likelihood model Assume GLM (estimate via OLS) with From retained sample using Gaussian peaks estimating posterior distribution Leuenberger & Wegmann (2010) Note: other models could be used, GLM was chosen due to laziness...

Leuenberger & Wegmann (2010) ABC-GLM defining statistics generating simulations according to prior accepting close simulations fitting a simple likelihood model estimating posterior distribution

Mode of Speciation in Rose Finches

Mode of Speciation in Rose Finches

Mode of Speciation in Rose Finches Joint posterior asymmetry observed in simulated data sets 51.5%

Cross River Gorilla (Thalmann et al., 2011) Olaf Thalmann Thalmann et al. (2011)

Hybridizing ABC with Full Likelihood Example: Estimating continuous trait evolution on phylogenetic trees Backbone tree Clades with unknown phylogenetic relationships Graham Slater L ( D a, 2,,, ) Trait values mean and variance within clade Brownian model of trait evolution a = root state of trait 2 = rate of trait evolution Phylogenetic birth-death process = species birthrate = species death rate Slater et al. (2011)

Hybridizing ABC with Full Likelihood Example: Estimating continuous trait evolution on phylogenetic trees Backbone tree Clades with unknown phylogenetic relationships L ( D a, 2,,, ) 2 L ( D a,, T ) P ( T,, ) Trait values mean and variance within clade T Brownian model of trait evolution a = root state of trait 2 = rate of trait evolution G Phylogenetic birth-death process = species birthrate = species death rate Slater et al. (2011)

Hybridizing ABC with Full Likelihood Example: Estimating continuous trait evolution on phylogenetic trees Backbone tree Clades with unknown phylogenetic relationships ABC-MCMC Metropolis-Hastings L ( D a, 2,,, ) 2 L ( D a,, T ) P ( T,, ) Trait values mean and variance within clade T Brownian model of trait evolution a = root state of trait 2 = rate of trait evolution G Phylogenetic birth-death process = species birthrate = species death rate Slater et al. (2011)

Application to Body Size Evolution in Carnivora Several members of the semiaquatic Pinnipedia attain very large body sizes. Did body size evolve faster among Pinnipedia than all other Carnivora? Southern Elephant Seal up to 4,000 Kg Walrus up to 1,800 Kg Slater et al. (2011)

Several members of the semiaquatic Pinnipedia attain very large body sizes. Did body size evolve faster among Pinnipedia than all other Carnivora? Slater et al. (2011)

Several members of the semiaquatic Pinnipedia attain very large body sizes. Did body size evolve faster among Pinnipedia than all other Carnivora? Slater et al. (2011)

Conclusions While often preferred, model based inference in biology is challenging due to the stochasticity and complexity of realistic models. As a consequence, we often rely on approximate inference schemes... It may help to replace the full data with summary statistics. Approximate Bayesian Computation is an extremely flexible but crude approach.... or approximate models. Approximating models such that they fit standard inference schemes. On the bright side: Such techniques allow us to estimate what we are really interested in, rather than require us to shift to problems for which analytical solutions are available.