Notes 20 : Tests of neutrality

Similar documents
Lecture 19 : Chinese restaurant process

Lecture 18 : Ewens sampling formula

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

122 9 NEUTRALITY TESTS

Neutral Theory of Molecular Evolution

On the inadmissibility of Watterson s estimate

Australian bird data set comparison between Arlequin and other programs

Frequency Spectra and Inference in Population Genetics

Demography April 10, 2015

Notes 1 : Measure-theoretic foundations I

Notes 6 : First and second moment methods

6 Introduction to Population Genetics

Notes 9 : Infinitely divisible and stable laws

6 Introduction to Population Genetics

Genetic Variation in Finite Populations

Statistical Tests of Neutrality of Mutations

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Endowed with an Extra Sense : Mathematics and Evolution

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

I of a gene sampled from a randomly mating popdation,

MATH5745 Multivariate Methods Lecture 07

Classical Selection, Balancing Selection, and Neutral Mutations

An introduction to mathematical modeling of the genealogical process of genes

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

CBA4 is live in practice mode this week exam mode from Saturday!

Lecture 19 : Brownian motion: Path properties I

Statistical Tests for Detecting Positive Selection by Utilizing High. Frequency SNPs

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Statistics. Statistics

Stochastic Demography, Coalescents, and Effective Population Size

Wald Lecture 2 My Work in Genetics with Jason Schweinsbreg

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Computational Systems Biology: Biology X

MS&E 226: Small Data

LECTURE 5 HYPOTHESIS TESTING

The Combinatorial Interpretation of Formulas in Coalescent Theory

Lecture 18 - Selection and Tests of Neutrality. Gibson and Muse, chapter 5 Nei and Kumar, chapter 12.6 p Hartl, chapter 3, p.

Statistical Tests for Detecting Positive Selection by Utilizing. High-Frequency Variants

Selection against A 2 (upper row, s = 0.004) and for A 2 (lower row, s = ) N = 25 N = 250 N = 2500

Population genetics snippets for genepop

Notes 18 : Optional Sampling Theorem

Theory. Pattern and Process

The Structure of Genealogies in the Presence of Purifying Selection: a "Fitness-Class Coalescent"

Crump Mode Jagers processes with neutral Poissonian mutations

f (1 0.5)/n Z =

Properties of Statistical Tests of Neutrality for DNA Polymorphism Data

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

STAT 536: Genetic Statistics

Modern Discrete Probability Branching processes

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

NEUTRALITY tests are among the most widely used tools

University of California San Diego and Stanford University and

7. Tests for selection

Lecture 13 : Kesten-Stigum bound

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Testing Statistical Hypotheses

Population Genetics I. Bio

Notes 15 : UI Martingales

The Wright-Fisher Model and Genetic Drift

MS&E 226: Small Data

Statistics 135 Fall 2008 Final Exam

Chapter 6 Expectation and Conditional Expectation. Lectures Definition 6.1. Two random variables defined on a probability space are said to be

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

An inferential procedure to use sample data to understand a population Procedures

Maximum-Likelihood Estimation: Basic Ideas

Chapter Seven: Multi-Sample Methods 1/52

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Lecture 9 : Identifiability of Markov Models

PROBABILITY VITTORIA SILVESTRI

A coalescent model for the effect of advantageous mutations on the genealogy of a population

The Genealogy of a Sequence Subject to Purifying Selection at Multiple Sites

Mathematical Structures Combinations and Permutations

Testing Statistical Hypotheses

Theoretical Population Biology

Theoretical Population Biology

Introduction to the Analysis of Variance (ANOVA)

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea

Applying the Benjamini Hochberg procedure to a set of generalized p-values

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Applications of Analytic Combinatorics in Mathematical Biology (joint with H. Chang, M. Drmota, E. Y. Jin, and Y.-W. Lee)

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

5 Introduction to the Theory of Order Statistics and Rank Statistics

ST5215: Advanced Statistical Theory

Reflections and Rotations in R 3

Bounds for pairs in partitions of graphs

2.1 Elementary probability; random sampling

THE MINIMUM MATCHING ENERGY OF BICYCLIC GRAPHS WITH GIVEN GIRTH

Stochastic Convergence, Delta Method & Moment Estimators

Mathematical Models for Population Dynamics: Randomness versus Determinism

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Genetics: Early Online, published on February 26, 2016 as /genetics Admixture, Population Structure and F-statistics

Math 494: Mathematical Statistics

Multivariate Statistical Analysis

Convergence Time to the Ewens Sampling Formula

Mathematical models in population genetics II

Estimating Evolutionary Trees. Phylogenetic Methods

Math Review Sheet, Fall 2008

Transcription:

Notes 0 : Tests of neutrality MATH 833 - Fall 01 Lecturer: Sebastien Roch References: [Dur08, Chapter ]. Recall: THM 0.1 (Watterson s estimator The estimator is unbiased for θ. Its variance is which converges to 0. θ W = S n h n, Var[θ W ] = θ 1 + θ g n, h n h n Also we will need the following result about the structure of the coalescent: THM 0. Assume that Π n i has sets of size λ 1,..., λ i where the sets are ordered such that the first one contains 1, the second contains the smallest remaining element, etc. Let π be a permutation on {1,..., i} and define µ l = λ π(l, for l = 1,..., i. Then the vector (µ 1,..., µ i is distributed uniformly over vectors summing to n. 1 A class of θ estimators A standard way to test whether the data is consistent with our model is to look at the difference between two different θ estimators. Most estimators in the literature can be expressed as follows. Let η k be the number of segregating sites where the mutated alleles has frequency k, for k = 1,..., n 1. This is known as the site frequency spectrum. (One can also define a similar notion when the ancestral states are not known, using the least common allele as a reference leading to 1

Notes 0: Tests of neutrality the folded site frequency spectrum. See [Dur08, Section 1.4]. For constants c n,k, k = 1,..., n 1, consider the estimator ˆθ = c n,k η k. For instance, taking c n,k = 1/h n for all k gives θ W. (Recall that h n = n 1 i=1 1/i. Other important examples include: Choosing c n,k = 1 if k = 1 and 0 otherwise gives θ F L = η 1. Choosing c n,k = k(n k ( n, leads to θ π, the pairwise differences. To see that these are unbiased and potentially derive more estimators, we compute: THM 0.3 We have E[η k ] = θ k. For instance E[θ π ] = ( 1 n i = 1. Proof:(Theorem 0.3 Imagine that at each level of the coalescent, the individuals are numbered uniformly at random. Let Jl k be the number of descendants of edge l on level k. Letting L m be the total branch length with m sampled descendants, note that k L m = 1{Jl k = m}, t k l=1 where t k is the time duration of level k. (Note that since each edge on level k has at least one sampled descendant, k must be smaller or equal to n m + 1 for l to possibly have m sampled descendants. So E[L m ] = i=1 k(k 1 kp[j k 1 = m].

Notes 0: Tests of neutrality 3 By Theorem 0., P[J k 1 = m] = ( n m 1 k ( n 1 k 1 that is, the number of ways k 1 numbers sum to n m divided by the number of ways k numbers sum to n. Hence, E[L m ] = k(k 1 k ( n m 1 k ( n 1 k 1, (n m 1! (n k! (n 1! (n m k + 1! (n m 1!(m 1! ( n k (n 1! m 1 ( (n m 1!(m 1! n 1 (n 1! m m, where we used that the sum on the third line counts the number of ways to pick i elements from n 1 when the smallest one has index m 1. To compute the variance of the difference statistics, we also need the covariance between η i and η j. This is done in [Dur08, Section.1]. However, since the distribution of the statistics is not known, the tests are performed based on simulations. Tajima s D and related statistics Two common difference statistics are θ π θ W and θ W θ F L. The former (standardized is known as Tajima s D statistic, the latter, as Fu and Li s D. Under a null hypothesis of neutral evolution in a homogeneously mixing population of constant size, both statistics should be roughly 0. (However, their variance does not go to 0 as n because, unlike θ W, θ π and θ F L are no consistent. Since we are not interested in estimating θ, this will not matter here. Any significant deviation indicates the possibility that our assumptions may not be satisfied. To understand the effect of selection or population structure/growth on the difference

Notes 0: Tests of neutrality 4 statistics, we note that ( k(n k θ π θ W = ( n 1 η k h n and ( 1 θ W θ F L = 1{k = 1} η k. h n In both cases, the coefficient of η k is negative for small enough k; positive for middle k s. In other words, an excess of singletons will tend to produce negative values and an excess of middle frequencies will tend to produce positive values. Typical examples of phenomena creating these effects are: and Population growth tends to produce a coalescent with long pendant edges (star-shaped genealogy, and therefore, an excess of singletons. Deleterious mutations will tend to produce very low frequency alleles, and thefore, an excess of singletons. Population isolation will tend to produce a delayed deepest coalescence (chicken-legs genealogy, and therefore, to create an excess of middle frequencies. Balancing selection will tend to produce an excess of heterozygotes, and therefore, an excess of middle frequencies. Further reading The material in this section was taken from Chapter of the excellent monograph [Dur08].

Notes 0: Tests of neutrality 5 References [Dur08] Richard Durrett. Probability models for DNA sequence evolution. Probability and its Applications (New York. Springer, New York, second edition, 008.