Diffusion Models in Population Genetics

Similar documents
Estimating Evolutionary Trees. Phylogenetic Methods

Frequency Spectra and Inference in Population Genetics

Derivation of Itô SDE and Relationship to ODE and CTMC Models

Evolution in a spatial continuum

Genetic Drift in Human Evolution

Introduction to Advanced Population Genetics

Introduction to population genetics & evolution

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

The Wright-Fisher Model and Genetic Drift

Hidden Markov models in population genetics and evolutionary biology

Mathematical models in population genetics II

Challenges when applying stochastic models to reconstruct the demographic history of populations.

Population Structure

Population Genetics: a tutorial

6 Introduction to Population Genetics

Population Genetics I. Bio

How robust are the predictions of the W-F Model?

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

Linear Regression (1/1/17)

Endowed with an Extra Sense : Mathematics and Evolution

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

6 Introduction to Population Genetics

Nonparametric Drift Estimation for Stochastic Differential Equations

Lecture 2: From Linear Regression to Kalman Filter and Beyond

URN MODELS: the Ewens Sampling Lemma

BIRS workshop Sept 6 11, 2009

Some mathematical models from population genetics

Introduction to Algebraic Statistics

EVOLUTIONARY DISTANCES

Intensive Course on Population Genetics. Ville Mustonen

Statistical population genetics

Robust demographic inference from genomic and SNP data

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Demography April 10, 2015

DNA-based species delimitation

Modeling Evolution DAVID EPSTEIN CELEBRATION. John Milnor. Warwick University, July 14, Stony Brook University

Processes of Evolution

Introduction to Natural Selection. Ryan Hernandez Tim O Connor

The problem Lineage model Examples. The lineage model

Bioinformatics 2 - Lecture 4

Association studies and regression

Look-down model and Λ-Wright-Fisher SDE

Stochastic Demography, Coalescents, and Effective Population Size

Supporting Information

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Problems for 3505 (2011)

Lecture 18 : Ewens sampling formula

Modelling populations under fluctuating selection

Quantitative trait evolution with mutations of large effect

Learning ancestral genetic processes using nonparametric Bayesian models

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Exact Simulation of Diffusions and Jump Diffusions

Stat 516, Homework 1

O 3 O 4 O 5. q 3. q 4. Transition

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford

Problems on Evolutionary dynamics

Taming the Beast Workshop

Bayesian Models for Phylogenetic Trees

Evolutionary dynamics of populations with genotype-phenotype map

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Inférence en génétique des populations IV.

EVOLUTIONARY DYNAMICS AND THE EVOLUTION OF MULTIPLAYER COOPERATION IN A SUBDIVIDED POPULATION

Population genetics snippets for genepop

Lecture 11 Friday, October 21, 2011

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

Inventory Model (Karlin and Taylor, Sec. 2.3)

Intelligent Systems (AI-2)

1.3 Forward Kolmogorov equation

Mathematical Population Genetics. Introduction to the Stochastic Theory. Lecture Notes. Guanajuato, March Warren J Ewens

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article.

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

MCMC: Markov Chain Monte Carlo

Bayesian inference for stochastic differential mixed effects models - initial steps

Introduction to Wright-Fisher Simulations. Ryan Hernandez

Notes for MCTP Week 2, 2014

Probabilistic Graphical Models

Populations in statistical genetics

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

ON COMPOUND POISSON POPULATION MODELS

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Supporting information for Demographic history and rare allele sharing among human populations.

TMS165/MSA350 Stochastic Calculus, Lecture on Applications

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Introduction to stochastic processes and stochastic calculus

Research Statement on Statistics Jun Zhang

Chapter 8: Introduction to Evolutionary Computation

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Computational statistics

Recovery of a recessive allele in a Mendelian diploid model

Exercises. T 2T. e ita φ(t)dt.

Evolution Problem Drill 10: Human Evolution

Intelligent Systems (AI-2)

The tree-valued Fleming-Viot process with mutation and selection

Computational Systems Biology: Biology X

WXML Final Report: Chinese Restaurant Process

Transcription:

Diffusion Models in Population Genetics Laura Kubatko kubatko.2@osu.edu MBI Workshop on Spatially-varying stochastic differential equations, with application to the biological sciences July 10, 2015 Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 1 / 24

Population Genetics Population genetics: Study of genetic variation within a population Assume that a gene has two alleles, call them A and a Population is composed of N individuals who have two copies of each gene so possible genotypes are: The population evolves over time AA Aa aa We are interested in the composition of the population at generation t Need a model for how a generation is derived from the previous generation Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 2 / 24

Wright-Fisher Model Assumptions: Population of 2N gene copies Discrete, non-overlapping generations of equal size Parents of next generation of 2N genes are picked randomly with replacement from preceding generation (genetic differences have no fitness consequences) Probability of a specific parent for a gene in the next generation is 1 2N Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 3 / 24

Wright-Fisher Model Source: Popvizard, a python program to simulate evolution under the WF and other models, written by Peter Beerli http://people.sc.fsu.edu/ pbeerli/popvizard.tar.gz Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 4 / 24

The Wright-Fisher Model View Wright-Fisher model as a discrete-time Markov process Let Y t = number of alleles of type A in population at generation t, 0 Y t 2N for t = 0,1,... Define p ij = P(Y t+1 = j Y t = i). Then, {( 2N ) j ( i 2N p ij = )j ( 2N i 2N )2N j, j = 0,1,...,2N 0, otherwise States 0 and 2N are absorbing states we can never leave these states Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 5 / 24

The Wright-Fisher Model Note that: E(Yt+1 Y t = i) = 2N( i 2N ) = i Var(Yt+1 Y t = i) = 2N( i 2N )(1 i 2N ) So the expected number of A alleles remains the same, but the actual number may vary between 0 and 2N Classical approach: Look at the limit as the population size N Kingman s Coalescent Process Widely used in population genetics and phylogenetics Difficult to extend to handle features of the evolutionary process, such as selection Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 6 / 24

Wright-Fisher Model as a Diffusion Process Define a diffusion process {X t } t 0 as a continuous-time Markov process with approximately Guassian increments over small time intervals and for which the following three conditions hold for small δt and X t = x: E(Xt+δt X t X t = x) = µ(t,x)δt +o(δt) E((Xt+δt X t) 2 X t = x) = σ 2 (t,x)δt +o(δt) E((Xt+δt X t) k X t = x) = 0 for k > 2 From Radu s slides, we had: dx t = S(X t )dt +σ(x t )dw t, where S(X t ) is the drift coefficient and σ(x t ) is the diffusion coefficient. For standard Brownian Motion, µ(t,x) = 0 and σ 2 (t,x) = 1. Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 7 / 24

Wright-Fisher Model as a Diffusion Process Let Y t be the number of A alleles in the population at generation t Let X t = proportion of A alleles in population at generation t; X t = Yt 2N Let X t represent the continuous-time process (eventually measure time in units of 2N generations, as before) Define Y t = Y t+1 Y t and X t = X t+1 X t Then E(Y t+1 X t = x) = 2Nx E( Y t X t = x) = 0 E[( Y t ) 2 X t = x)] = 2Nx(1 x) E( X t X t = x) = 0 = µ(t,x) = µ(x) E(( X t ) 2 X t = x) = x(1 x) 2N = σ 2 (t,x) = σ 2 (x) Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 8 / 24

Wright-Fisher Model as a Diffusion Process Now re-define Y t = Y t+ t Y t and X t = X t+ t X t, where t = 1 2N and let N, so that E(( X t) 2 X t ) = X t (1 X t ) t The corresponding SDE is dx t = X t (1 X t )dw t, X t [0,1] where W t is standard Brownian Motion (See Pardoux, 2009, for a rigorous proof) Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 9 / 24

The Wright-Fisher Model with Selection Model for selection: Suppose that allele A is superior to allele a so that p x = 2Nx(1+s) 2Nx(1+s)+(2N 2Nx) As before, let N and define s = β/(2n). E( Xt X t) (βx t(1 X t)) t E(( Xt) 2 X t) X t(1 X t) t The corresponding SDE is dx t = βx t (1 X t )dt + X t (1 X t )dw t, X t [0,1] Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 10 / 24

The Wright-Fisher Diffusion with Selection: Intuition Use the Euler Method (see Radu s lectures) to simulate from the WF Diffusion model X(t i+1 ) = X(t i )+βx(t i )(1 X(t i ))(t i+1 t i )+ t i+1 t i X(ti )(1 X(t i ))Z where Z N(0,1) Python code to simulate this: T = 0.05 Define 0 = t0 < t 1 < < t N 1 < t N = T, equally spaced Vary β Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 11 / 24

The Wright-Fisher Diffusion with Selection: Intuition β = 0, varying N Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 12 / 24

The Wright-Fisher Diffusion with Selection: Intuition N = 1000, vary β Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 13 / 24

Application: Inferring Selection From Genome-scale Data Diffusion models are currently becoming more widely used in analyzing genome-scale data. Example: Williamson, S. H. et al. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. PNAS: 120(22): 7882-7887. Data: NIEHS Environmental Genome Project web site (http://egp.gs.washington. edu) Sequenced 301 genes associated with variation in response to environmental exposure 90 individuals: 24 African Americans, 24 Asian Americans, 24 European Americans, 12 Mexican Americans, and 6 Native Americans Goal: Detect selection in different types of mutations; distinguish selection from other demographic factors, such as population size change Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 14 / 24

Application: Inferring Selection From Genome-scale Data Data are recorded as SNPs bases in the DNA sequence at which there is variation across individuals Example data: this would be Taxon Sequence (A) Human GCCGATGCCGATGCCGAA (B) Chimp GCCGTTGCCGTTGCCGTT (C ) Gorilla GCGGAAGCGGAAGCGGAA Taxon Sequence (A) Human CATCATCAA (B) Chimp CTTCTTCTT (C ) Gorilla GAAGAAGAA Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 15 / 24

Application: Inferring Selection From Genome-scale Data Example SNP data is Taxon Sequence (A) Human CATCATCAA (B) Chimp CTTCTTCTT (C ) Gorilla GAAGAAGAA Record this as the site frequency spectrum (SFS), denoted by the vector u, where entry u i = number of SNP sites with i copies of the derived allele For the example, we have (assuming that the ancestral state is that found in Gorilla), u = (4,5) If we let Human be ancestral, we d have u = (9,0) Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 16 / 24

Application: Inferring Selection From Genome-scale Data Idea of analysis: Write the likelihood function and obtain MLEs of the parameters of interest Likelihood function for a sample of K SNPs: L(β) = K Pr(i k,n k β) where Pr(i k,n k ) is the probability of that SNP k is at frequency i k nk k=1 Pr(i k,n k ) comes from the diffusion model how? Williamson et al. (2005): Use numerical methods to approximate the diffusion Today: use a naive sampling method based on the Euler approximation Ongoing work (with Radu Herbei and Jeff Gory): use exact sampling from the WF diffusion to implement a Bayesian version of the model Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 17 / 24

Application: Inferring Selection From Genome-scale Data Naive method: 1 Use the Euler method to simulate a path from the WF diffusion with selection parameter β, and record the final allele frequency, q. 2 For the q from step 1, simulate the data for a SNP by drawing Y Bin(2n,q). n is the number of people in the sample. 3 Repeat steps 1-2 a large number of times, say M (the larger, the better), to generate a set of observed Y values, Y 1,Y 2,,Y M. 4 Form the estimates ˆP i (β) = 1 M M m=1i(ym = i) The approximate likelihood is then K ˆL(β) = ˆP ik (β) k=1 Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 18 / 24

Application: Inferring Selection From Genome-scale Data Does it work? Simulate data for 15 people and 100 SNPs with various values of β and M β = 0.2 Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 19 / 24

Application: Inferring Selection From Genome-scale Data Does it work? Simulate data for 15 people and 100 SNPs with various values of β and M β = 2.0 Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 20 / 24

Application: Inferring Selection From Genome-scale Data Does it work? Simulate data for 15 people and 100 SNPs with various values of β and M β = 10.0 Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 21 / 24

Application: Inferring Selection From Genome-scale Data Does it work? Take the maximum value of the approximate likelihood as the MLE Repeat the simulation multiple times and look at properties of the MLEs True β Number of reps Mean MLE MSE 2.0 30 2.10 2.13 10.0 15 10.25 3.58 Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 22 / 24

Conclusions Diffusion models are being increasingly used for data analysis in population genetics. Methods used for estimation are mostly based on numerical approximations, rather than on statistical techniques. Promising area of application as availability of whole-genome sequence data is increasing rapidly. Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 23 / 24

References Wakeley, J. (2009) Coalescent Theory: An Introduction. Robert and Company. Williamson, S. et al. (2005) Simultaneous inference of selection and population growth from patterns of variation in the human genome. PNAS 102(22): 7882-7887. Gutenkunst, R. et al. (2009) Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data. PLoS Genetics 5(10): e1000695. Pardoux, E. (2009). Probabilistic models of population genetics. http://www.cmi.univ-mrs.fr/ pardoux/enseignement/cours genpop.pdf dadi: Diffusion Approximation for Demographic Inference. https://code.google.com/p/dadi/ Thank you! Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 24 / 24