Frequency Spectra and Inference in Population Genetics

Similar documents
The Wright-Fisher Model and Genetic Drift

How robust are the predictions of the W-F Model?

Population Genetics: a tutorial

Evolution in a spatial continuum

Genetic Variation in Finite Populations

Mathematical models in population genetics II

6 Introduction to Population Genetics

Computational Systems Biology: Biology X

Demography April 10, 2015

Closed-form sampling formulas for the coalescent with recombination

Evolutionary Theory. Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A.

6 Introduction to Population Genetics

Estimating Evolutionary Trees. Phylogenetic Methods

Stochastic Demography, Coalescents, and Effective Population Size

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

The tree-valued Fleming-Viot process with mutation and selection

ON COMPOUND POISSON POPULATION MODELS

p(d g A,g B )p(g B ), g B

Endowed with an Extra Sense : Mathematics and Evolution

Introduction to population genetics & evolution

Diffusion Models in Population Genetics

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

The Combinatorial Interpretation of Formulas in Coalescent Theory

Supporting information for Demographic history and rare allele sharing among human populations.

So in terms of conditional probability densities, we have by differentiating this relationship:

Learning ancestral genetic processes using nonparametric Bayesian models

Introduction to Advanced Population Genetics

Quantitative trait evolution with mutations of large effect

Theoretical Population Biology

Inventory Model (Karlin and Taylor, Sec. 2.3)

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

The nested Kingman coalescent: speed of coming down from infinity. by Jason Schweinsberg (University of California at San Diego)

Notes 20 : Tests of neutrality

Selection and Population Genetics

Wald Lecture 2 My Work in Genetics with Jason Schweinsbreg

Lecture 18 : Ewens sampling formula

Supporting Information

Population Genetics I. Bio

The problem Lineage model Examples. The lineage model

Challenges when applying stochastic models to reconstruct the demographic history of populations.

122 9 NEUTRALITY TESTS

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Ancestral Inference in Molecular Biology. Simon Tavaré. Saint-Flour, 2001

APM 541: Stochastic Modelling in Biology Diffusion Processes. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 61

Populations in statistical genetics

Elementary Applications of Probability Theory

WXML Final Report: Chinese Restaurant Process

Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

STAT 536: Genetic Statistics

NATURAL SELECTION FOR WITHIN-GENERATION VARIANCE IN OFFSPRING NUMBER JOHN H. GILLESPIE. Manuscript received September 17, 1973 ABSTRACT

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Genetic Drift in Human Evolution

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data

Notes on Population Genetics

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Mathematical Population Genetics II

Surfing genes. On the fate of neutral mutations in a spreading population

Similarity Measures and Clustering In Genetics

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

Fitness landscapes and seascapes

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

Spring 2012 Math 541B Exam 1

EVOLUTIONARY DISTANCES

Supporting Information Text S1

Calculation of IBD probabilities

Problems on Evolutionary dynamics

Some mathematical models from population genetics

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Intraspecific gene genealogies: trees grafting into networks

Stat 516, Homework 1

The coalescent process

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Lecture 14 Chapter 11 Biology 5865 Conservation Biology. Problems of Small Populations Population Viability Analysis

stochnotes Page 1

Calculation of IBD probabilities

NEUTRAL EVOLUTION IN ONE- AND TWO-LOCUS SYSTEMS

AARMS Homework Exercises

Mathematical Population Genetics. Introduction to the Stochastic Theory. Lecture Notes. Guanajuato, March Warren J Ewens

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

Inférence en génétique des populations IV.

It has been more than 25 years since Lewontin

Mathematical Population Genetics II

Effective population size and patterns of molecular evolution and variation

BIRS workshop Sept 6 11, 2009

Theoretical Population Biology

Statistical population genetics

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

Population Structure

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

ROBUST METHODS FOR ESTIMATING ALLELE FREQUENCIES SHU-PANG HUANG

Theoretical Population Biology

Australian bird data set comparison between Arlequin and other programs

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Observation: we continue to observe large amounts of genetic variation in natural populations

Statistical population genetics

Transcription:

Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient inference. Unphased SNP data: with short reads, it may not be possible to accurately infer haplotypes. Mixtures: with short reads, it may not be possible to assemble sequences corresponding to different individuals. Selection: although the coalescent can be extended to include the effects of selection, in general the resulting processes are difficult to simulate. Complex models: likelihood surfaces may be difficult to reconstruct in models with many parameters using inherently noisy stochastic simulations of the coalescent. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 1 / 17

An alternative approach is to use the distribution of allele frequencies in sampled populations to make inferences about the evolutionary forces operating in those populations. The allele frequency spectrum (AFS) is the distribution of the frequencies or counts of derived alleles in a sample of individuals calculated over all segregating sites. The folded allele frequency spectrum (folded AFS) is the distribution of the frequencies or counts of minor alleles in a sample calculated over all segregating sites. The AFS and folded AFS will be similar when the derived alleles are also the minor alleles, but this will not always be the case. However, the folded AFS can always be calculated, whereas the AFS can only be calculated if we can reliably distinguish derived from ancestral alleles. When it can be calculated, the AFS will be more informative about the population history than the folded AFS. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 2 / 17

Allele frequency spectra of autosomal and X-linked genes in M. musculus The allele frequency spectrum for a sample of n chromosomes can be represented by a vector a = (a 0,, a n) where a i denotes the number of segregating sites at which the derived allele is carried by exactly i chromosomes. For example, a 1 is the number of singletons in the sample. Athanosios et al. (2014) Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 3 / 17

Folded and Unfolded AFS in D. melanogaster Cooper et al. (2015) Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 4 / 17

To use the allele frequency spectrum for inference, we need to be able to calculate its likelihood under models that account for the evolutionary processes thought to be operating on the population. The probability density of the frequency x of the derived allele will depend on the choice of the model and its parameter through some function f (x Θ). If n chromosomes are sampled at random from a population in which the frequency of the derived allele is x, the probability that it will be sampled k times is given by the binomial distribution: P(k x) = ( n k ) x k (1 x) n k. The unconditional probability of sampling the derived allele k times can be calculated by integrating the preceding conditional probability with respect to the density of x: ( ) 1 n P(k Θ) = f (x Θ) x k (1 x) n k dx k 0 Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 5 / 17

If we assume that the segregating sites are independent of one another, then the total probability of the allele frequency spectrum A = (a 1, a 2,, a n 1) can be calculated as follows: n 1 P(A Θ) = e P(k Θ) P(k Θ) a k a k! k=1. This assumes an infinite sites model, with new mutations produced by a Poisson process. It also assumes that the segregating sites in the sample are unlinked (and hence statistically uncorrelated conditional on Θ). With linked sites, we can treat this as a composite likelihood, which is known to give consistent estimates of parameter values under many neutral models, even though it underestimates the variance in these estimates. Bootstraps (conventional and parametric) can be used to obtain confidence intervals when working with composite likelihoods. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 6 / 17

Joint Allele Frequency Spectrum If individuals are sampled from different populations, the distribution of allele frequencies cross populations can be characterized using the joint allele frequency spectrum, which specifies the joint distribution of the counts of derived alleles in different populations. For example, with two populations, we need to keep track of the number of segregating sites at which the derived allele was sampled i times in the first population and j times in the second population. To evaluate the likelihood of the joint AFS, we then need a model that will allow us to calculate the joint probability distribution f (x 1, x 2) of the derived allele frequencies in the two populations. Dependence between the allele frequencies in the different populations will arise because of shared ancestry and migration. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 7 / 17

Joint Allele Frequency Spectra of Two Populations (Simulated) Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 8 / 17

In principal, the distribution of the derived allele frequencies in a population could be calculated under either the Wright-Fisher model or some other suitable Markov chain (e.g., the Moran model). In practice, this is difficult because usually analytical expressions are not available for these models, which instead must be solved using calculations involving large matrices. In addition, changing the effective population size changes the state space of the Wright-Fisher model, which will increase computational costs. Fortunately, these complications can be sidestepped to a degree by working with an approximation to the Wright-Fisher model which is accurate when the effective population size is sufficiently large, say N e > 100. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 9 / 17

The following figure shows a series of simulations of the Wright-Fisher model for 100 generations for N = 10 (blue), 100 (red), 1000 (orange), and 10, 000 (green). Notice that both the size of the fluctuations and the total change in p over 100 generations decrease as the population size is increased. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 10 / 17

A different picture emerges if we plot each sample path against time measured in units of N generations. Here the total rescaled time is t = 1. In this case, the jumps become smaller, but the total change in p over N generations does not tend to 0. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 11 / 17

Diffusion Processes in Population Genetics In fact, it can be shown that as N, the rescaled processes (p (N) (Nt) : t 0) converge to a limiting process known as a diffusion approximation. Diffusion processes are Markov processes with continuous sample paths, i.e., there are no jumps. Diffusion approximations can be derived for the Wright-Fisher model with mutation and selection provided that these other processes are sufficiently weak. The transition densities f (p; t) satisfy a partial differential equation known as the Kolmogorov forward equation: ( ) 2 x(1 x) φ(x; t) = φ(x; t) ( ) sx(1 x)φ(x; t) t x 2 4N e x This equation can be solved numerically using sophisticiated techniques available for PDE s that do not require stochastic simulations. Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 12 / 17

Diffusion approximations can also be derived for models with more than one population. In this case, the joint density of the allele frequencies in the different populations can be found by solving the following PDE: where t φ = 1 2 P { } 2 xi (1 x i ) φ xi 2 4ν i [{ P γ i x i (1 x i ) + x i i=1 i=1 ν i is the relative effective population size of population i; ) } P M i j (x j x i ) φ γ i = 2Ns i is the scaled selective coefficient of the derived allele in population i; M i j = 2Nm i j is the scaled migration rate from j to i. This must be solved numerically, but again this can be done without stochastic simulations. j=1 Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 13 / 17

Example: Out of Africa analysis Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 14 / 17

Example: Out of Africa analysis Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 15 / 17

Example: Settlement of the New World analysis Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 16 / 17

Example: Settlement of the New World analysis Jay Taylor (ASU) The Frequency Spectrum 23 Feb 2017 17 / 17