Statistical population genetics

Similar documents
Lecture 19 : Chinese restaurant process

Statistical population genetics

Lecture 18 : Ewens sampling formula

An introduction to mathematical modeling of the genealogical process of genes

Genetic Variation in Finite Populations

The Combinatorial Interpretation of Formulas in Coalescent Theory

Closed-form sampling formulas for the coalescent with recombination

Frequency Spectra and Inference in Population Genetics

Mathematical statistics

URN MODELS: the Ewens Sampling Lemma

On The Mutation Parameter of Ewens Sampling. Formula

p(d g A,g B )p(g B ), g B

The problem Lineage model Examples. The lineage model

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Lecture 7 Mutation and genetic variation

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

Endowed with an Extra Sense : Mathematics and Evolution

Computational Systems Biology: Biology X

Diffusion Models in Population Genetics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Estimating Evolutionary Trees. Phylogenetic Methods

Bayesian analysis of the Hardy-Weinberg equilibrium model

Heterozygosity is variance. How Drift Affects Heterozygosity. Decay of heterozygosity in Buri s two experiments

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Normalising constants and maximum likelihood inference

4.5.1 The use of 2 log Λ when θ is scalar

Evolution (Chapters 15 & 16)

Lecture Notes: BIOL2007 Molecular Evolution

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Joyce, Krone, and Kurtz

Population Genetics I. Bio

The neutral theory of molecular evolution

6 Introduction to Population Genetics

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford

Crump Mode Jagers processes with neutral Poissonian mutations

Sample solutions to Homework 4, Information-Theoretic Modeling (Fall 2014)

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Inférence en génétique des populations IV.

6 Introduction to Population Genetics

Dynamic Approaches: The Hidden Markov Model

AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. By Paul A. Jenkins and Yun S. Song, University of California, Berkeley

Wright-Fisher Models, Approximations, and Minimum Increments of Evolution

Concepts and Methods in Molecular Divergence Time Estimation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Natural Selection Maximizes Fisher Information

The Wright-Fisher Model and Genetic Drift

122 9 NEUTRALITY TESTS

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

Statistical inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Partitioning the Genetic Variance. Partitioning the Genetic Variance

Nuisance parameters and their treatment

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Observation: we continue to observe large amounts of genetic variation in natural populations

Lecture 4: Random Variables and Distributions

I of a gene sampled from a randomly mating popdation,

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Quantitative trait evolution with mutations of large effect

Inferring Transcriptional Regulatory Networks from High-throughput Data

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Estimating the parameters of hidden binomial trials by the EM algorithm

Sequence evolution within populations under multiple types of mutation

Time Series and Dynamic Models

Logistisk regression T.K.

Bayesian Inference. Introduction

Tutorial: Statistical distance and Fisher information

Unifying theories of molecular, community and network evolution 1

Partitioning the Genetic Variance

Introduction To Machine Learning

Hypothesis Testing. BS2 Statistical Inference, Lecture 11 Michaelmas Term Steffen Lauritzen, University of Oxford; November 15, 2004

VARIANCE AND COVARIANCE OF HOMOZYGOSITY IN A STRUCTURED POPULATION

Demography April 10, 2015

Parametric Techniques Lecture 3

CS 361: Probability & Statistics

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Intraspecific gene genealogies: trees grafting into networks

Mathematical Population Genetics II

Mixture Models and Expectation-Maximization

A duality identity between a model of bacterial recombination. and the Wright-Fisher diffusion

How robust are the predictions of the W-F Model?

Inferring Protein-Signaling Networks II

Lecture 1 Bayesian inference

De los ejercicios de abajo (sacados del libro de Georgii, Stochastics) se proponen los siguientes:

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Chapter 5: Integer Compositions and Partitions and Set Partitions

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

DISTRIBUTION OF NUCLEOTIDE DIFFERENCES BETWEEN TWO RANDOMLY CHOSEN CISTRONS 1N A F'INITE POPULATION'

Generalized Linear Models (1/29/13)

Section 11 1 The Work of Gregor Mendel

PROBABILITY OF FIXATION OF A MUTANT GENE IN A FINITE POPULATION WHEN SELECTIVE ADVANTAGE DECREASES WITH TIME1

Foundations of Statistical Inference

arxiv: v6 [q-bio.pe] 21 May 2013

Mathematical Population Genetics II

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

On the inadmissibility of Watterson s estimate

COMP90051 Statistical Machine Learning

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Transcription:

Statistical population genetics Lecture 7: Infinite alleles model Xavier Didelot Dept of Statistics, Univ of Oxford didelot@stats.ox.ac.uk Slide 111 of 161

Infinite alleles model We now discuss the effect of mutations. Kimura and Crow (1964) proposed the following mutational model: Definition (The infinite alleles model). Each mutation creates a new allele. Slide 112 of 161

Infinite alleles model A A B C D C Slide 113 of 161

Infinite alleles model Data from the infinite alleles model can be represented as a vector a = (a 1,...,a n ) where a i is the number of alleles for which i copies exist in the sample of sizen. a is called the allelic partition of the data. n = n i=1 ia i and K n = n i=1 a i is the number of allele types For example, in the previous slide, we have n = 6,K n = 4 and a = (2,2,0,0,0,0). Slide 114 of 161

Number of alleles Theorem (Number of alleles). P(K n = k) = S(n,k) θ k n 1 i=0 (θ +i) wheres(n,k) is the Stirling number of the first kind. Slide 115 of 161

Number of alleles Proof. If the last event was a coalescent, then just before that we hadn 1 lineages and k distinct alleles. If the last event was a mutation, then the mutating lineage is a unique allele, and then 1 other lineages contained k 1 distinct alleles. It follows that: P(K n = k) = θ n 1+θ P(K n 1 = k 1)+ n 1 n 1+θ P(K n 1 = k) with initial condition P(K 1 = 1) = 1. Solving this recursive equation gives the result. Slide 116 of 161

Number of alleles Slide 117 of 161

Ewens sampling formula Theorem (Ewens sampling formula). The probability of an allelic partitionain a sample of size n is equal to: P n (a) = n! n 1 i=0 (θ +i) n j=1 ( ) aj θ 1 j a j! This formula is called Ewens sampling formula (ESF) because it was discovered by Ewens (1972). The ESF has since been found to have many applications, and is thus an important result in theoretical probability. Slide 118 of 161

Ewens sampling formula Proof. Let e i be the vector of size n filled with zeros except for a one at the i-th position. We decompose P n (a) according to whether the last event was a coalescence (C) or a mutation (M): P n (a) = P(a C)P(C)+P(a M)P(M) = n 1 n 1+θ P(a C)+ θ n 1+θ P(a M) If the last event was a mutation, then the mutating lineage has a unique allelic type and then 1 other lineages need to generate the rest of the profile, ie. a e 1 so that P(a M) = P n 1 (a e 1 ). If a 1 = 0 then this probability is of course equal to zero. Slide 119 of 161

Ewens sampling formula If the last event was a coalescence, we decompose P(a C) according to all the profiles of size n 1 that could be observed just before the coalescence: P(a C) = a P n 1 (a )P(a C,a ) The coalescence may have happened between any two genes that share the same allele ina. Let j denote the number of copies inaof the allele of the genes that coalesced. Given j, we have a = a e j +e j 1. Thus: P(a C) = n P n 1 (a e j +e j 1 )P(a C,a e j +e j 1 ) j=2 Slide 120 of 161

Ewens sampling formula The last term is the probability that a coalescence event happens to one of the (j 1)(a j 1 +1) genes for which there are a j 1 copies ina e j e j 1. Since there are n 1 genes ina e j e j 1, we have: Putting this altogether we get: P(a C,a e j +e j 1 ) = (j 1)(a j 1 +1) n 1 P n (a) = θ n 1+θ P n 1(a e 1 ) n + n 1 n 1+θ j=2 (j 1)(a j 1 +1) P n 1 (a e j +e j 1 ) n 1 with boundary condition P 1 (1) = 1 and P n (a) = 0 if any of thea j < 0. Solving this recursion equation leads to the ESF. Slide 121 of 161

Example For a sample of size n = 3, there are three possible allelic profiles: (3,0,0), (1, 1, 0) and (0, 0, 1) with respective probabilities: P 3 (3,0,0) = θ θ +2 P 2(2,0) = θ 2 (θ +1)(θ +2) P 3 (1,1,0) = θ θ +2 P 2(0,1)+ 2 θ +2 P 2(2,0) = θ 1 θ+2θ +1 + 2 θ +2 3θ = (θ+1)(θ+2) P 3 (0,0,1) = 2 θ +2 P 2(0,1) = θ θ +1 2 (θ +1)(θ +2) Slide 122 of 161

Sufficiency of number of alleles Definition (Sufficiency of a statistic). A statistict(x) is sufficient for underlying parameter θ if the conditional distribution of the datax given the statistict(x) is independent of θ, ie: P(X T(X),θ) = P(X T(X)) Theorem (Sufficiency of the number of alleles). The number of alleles is a sufficient statistic for parameter θ. Slide 123 of 161

Sufficiency of number of alleles Proof. Since the number of alleles K n is completely determined by the allelic profilea, the distribution ofagiven K n reduces to: P(a K n = k,θ) = P n(a) P(K n = k) = n! S(n, k) n j=1 1 j a j aj! This distribution does not depend onθ, therefore K n is sufficient for parameter θ. Slide 124 of 161

Example Coyne (1976) studied the xanthine dehydrogenase gene (Xdh) of Drosophila persimilis by electrophoresis. This method reveals whether two genes are identical, but not how closely related they are. The infinite alleles model is therefore particularly well suited for the analysis of such data. They found K n = 23 alleles in a sample of n = 60 individuals with the following allelic profile: a 1 = 18, a 2 = 3,a 4 = 1, a 32 = 1 What is the maximum likelihood estimator ofθ based on this data? Slide 125 of 161

Example Since K n is sufficient for θ, we estimate θ based on K n only. The likelihood ofθ is: L(θ) = P(K 60 = 23 θ) = S(60,23) Taking the logarithm and deriving byθ gives: θ 23 59 i=0 (θ +i) dl(θ) dθ This is equal to zero when: = 23 59 θ i=0 1 θ +i 23 = 59 i=0 θ θ +i Solving gives a maximum likelihood estimator for θ of 13.17. Slide 126 of 161

Summary In the infinite alleles model, each mutation creates a new allele The Ewens sampling formula gives the probability of a dataset occurring under this mutational model We derived an equation for the number of alleles The number of alleles is a sufficient statistic in this model, making it very useful to draw inference from genetic data The infinite alleles model is particularly well suited to Analise data from electrophoresis Slide 127 of 161