The coalescent coalescence theory

Similar documents
Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Chapter 6 Sampling Distributions

Approximations and more PMFs and PDFs

6.3 Testing Series With Positive Terms

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

UNIT 2 DIFFERENT APPROACHES TO PROBABILITY THEORY

Queuing Theory. Basic properties, Markovian models, Networks of queues, General service time distributions, Finite source models, Multiserver queues

Module 1 Fundamentals in statistics

CS284A: Representations and Algorithms in Molecular Biology

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Homework 5 Solutions

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Statistics 511 Additional Materials

Parameter, Statistic and Random Samples

Random Variables, Sampling and Estimation

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Sequences. Notation. Convergence of a Sequence

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Math 140 Introductory Statistics

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

NUMERICAL METHODS FOR SOLVING EQUATIONS

Lecture 2: April 3, 2013

Simulation. Two Rule For Inverting A Distribution Function

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Binomial Distribution

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lesson 10: Limits and Continuity

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

10.6 ALTERNATING SERIES

Empirical Distributions

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

( ) = p and P( i = b) = q.

Test of Statistics - Prof. M. Romanazzi

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

SINGLE-CHANNEL QUEUING PROBLEMS APPROACH

Stochastic Simulation

Lecture 19: Convergence

Application to Random Graphs

CSE 527, Additional notes on MLE & EM

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Final Review for MATH 3510

Lecture 12: November 13, 2018

Massachusetts Institute of Technology

Random Models. Tusheng Zhang. February 14, 2013

Discrete probability distributions

Lecture 7: Properties of Random Samples

Machine Learning Brett Bernstein

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Statisticians use the word population to refer the total number of (potential) observations under consideration

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Introduction to Probability and Statistics Twelfth Edition

Math 113 Exam 3 Practice

Median and IQR The median is the value which divides the ordered data values in half.

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

SEQUENCES AND SERIES

Problem Set 4 Due Oct, 12

Generalized Semi- Markov Processes (GSMP)

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Topic 6 Sampling, hypothesis testing, and the central limit theorem

First come, first served (FCFS) Batch

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

1 Inferential Methods for Correlation and Regression Analysis

4. Partial Sums and the Central Limit Theorem

CS/ECE 715 Spring 2004 Homework 5 (Due date: March 16)

The standard deviation of the mean

Estimation for Complete Data

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Expectation and Variance of a random variable

Infinite Sequences and Series

Topic 9: Sampling Distributions of Estimators

1. Hilbert s Grand Hotel. The Hilbert s Grand Hotel has infinite many rooms numbered 1, 2, 3, 4

Lecture 1 Probability and Statistics

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Stat 421-SP2012 Interval Estimation Section

Chapter 18 Summary Sampling Distribution Models

MATH/STAT 352: Lecture 15

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

CS 330 Discussion - Probability

Design and Analysis of Algorithms

HOMEWORK I: PREREQUISITES FROM MATH 727

For example suppose we divide the interval [0,2] into 5 equal subintervals of length

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

HOMEWORK 2 SOLUTIONS

RECENT COMMON ANCESTORS OF ALL PRESENT-DAY INDIVIDUALS

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008

Section 11.8: Power Series

On forward improvement iteration for stopping problems

Power and Type II Error

The Poisson Process *

Transcription:

The coalescet coalescece theory Peter Beerli September 1, 009 Historical ote Up to 198 most developmet i populatio geetics was prospective ad developed expectatios based o situatios of today. Most work did provide expectatios about the future. With the easy availability of geetic data retrospective aalyses did catch up oly i phylogeetics (startig i the sixties). Oly Malécot, who pioeered lookig backwards i time i 1948, developed backwards lookig results i populatio geetics. Kigma expressed this lookig backwards i time approach as the coalescece of sampled lieages. He was ot the oly oe workig o such problem at the the time as with may great solutios it was i the air, see Hudso (1983) ad Tajima (1983). The coalescet A sample of gee copies is take at the preset time ad we are iterested i the acestral relatioship of these gee copies. We express time τ icreasig the further back i real time we go: τ 1 <τ meas that τ is further i the past tha τ 1. Kigma (198) ad Ewes (004) describe this backwards i time process with equivalece classes. Two copies are i the same equivalece class at time τ whe they have a commo acestor at that time. At time τ = 0 each idividual gee ca be cosidered i its ow equivalece class ad we could express this for a sample of =8 as φ 0 = {(a), (b), (c), (d), (e), (f), (g), (h)} Kigma s -coalescet describes the moves from φ 0 to a sigle equivalece class φ = {(a, b, c, d, e, f, g, h)}. 1

ISC-5317-Fall 009 Computatioal Evolutioary Biology a b c d e f a,b d,e d,e,f a,b,c a,b,c,d,e,f Figure 1: Example of the coalescece process All idividuals are i some equivalece relatio ξ ad we ca fid a ew equivalece relatio η by joiig two of the equivalece classes i ξ. This joiig process is called a coalescece, ad a series of such joiigs is called the coalescet or coalescece process. Figure gives a example of the relatioship of a sample ad the equivalece classes describig the process. It is assumed that the probability of a coalescece depeds o the waitig time δτ Prob(process i η at time τ + δτ process i ξ at time τ) =δτ (igorig higher order terms), ad if k is the umber of equivalece classes i ξ the Prob(process i ξ at time τ + δτ process i ξ at time τ) =1 δτ =1 k δτ We will see that if we apply the right time scale to τ the we will ed up i the more familiar terms that are commo i the applied populatio geetics literature. Kigma focused o the Caig model, ad sice the Wright-Fisher model is a special case of the results carry over easily. The coalescet is a approximatio to these models because it was developed o a cotiuous time scale whereas the Caig ad Wright-Fisher populatio models have discrete time. Ay fidigs usig this coalescet machiery eeds to be rescaled to the time scale of these discrete time models. I the coalescet framework oe has oly a sigle coalescet per ifiitesimal time period. This forces us to restrict the use of the coalescet to discrete time models were we ca guaratee that there is ot more tha oe coalescet evet occurrig per time period. For example, for a Wright-Fisher populatio we ca allow oly oe coalescet evet per geeratio. this soud rather restrictive but as log as the sample is much smaller tha the

ISC-5317-Fall 009 Computatioal Evolutioary Biology populatio N this situatio rarely occurs. Fu (006) calculates that a sample eeds to be less tha the square-root of N < N, whe we use the coalescet for a model such as the Wright-Fisher or some versio of the Caig model. Joe Felsestei shows this for the discrete case i the Wright-Fisher model i his course populatio geetics (http://evolutio.geetics.washigto.edu/pgbook/pgbook.html). The discrete coalescet ad the Wright-Fisher populatio model I the Wright-Fisher model we ca calculate the probability that two radom idividuals have a commo acestor oe geeratio earlier, very simply. Pickig two idividuals at radom, ad the decide the chace of havig a commo acestor: the first idividual has with probability 1 a acestor i the last geeratio ad the other had with probability 1/(N) the same acestor. I tur we also kow the probability that the two idividuals have o commo acestor i the last geeratio: 1 1/(N). Whe we have idividuals i the sample the we ca have pairs of idividuals that coalesce with a chace of 1/(N), therefore we have ever geeratio chace to see a coalescece of two idividuals is N = ( 1) 4N This rate of coalescece i the discrete Wright-Fisher model is similar to a coi toss where we wait with some rate (that the head appears), this looks like a geometric series, but i the case of the coalescece the series does ot use a fixed rate, but the rate chages because the coalescet evet reduces the umber i the sample by 1. The expected value of a geometric is approximately 1/rate, therefore the waitig time to reduce from lieages to 1 lieages is E(T ( 1 ) )= (1) 4N ( 1) =4N(1 1 ) () To reach the fial coalescet (the most recet commo acestor) we eed to wait util all lieages have coalesced we simply add up all the waitig times E(T MRCA )=4N k= 1 (3) 3

ISC-5317-Fall 009 Computatioal Evolutioary Biology The (cotious) coalescet ad the Wright-Fisher populatio model The coalescet process is i effect a sequece of 1 Poisso processes 1, with rates r k =,k =, 1,,..., describig the Poisso process at which two of the equivalece classes merge whe there are k equivalece classes. Sice these evets are comig from a Poisso distributio we ca calculate the expectatio for each iterval, which is 1/rate, here /(). All mergers of the equivalece classes are idepedet of each other so the expectatio of the whole coalescece process is E(T MRCA )= k= = 1 Compariso of the cotet of the sum with the coalescece rate makes clear that the Wright-Fisher populatio model is the stadard coalescet uits ad we eed to multiply by N to arrive at the more familiar geeratio time scale. k= The coalescet process results i a tree of a sample of idividuals. We call this ofte a geealogy as the the idividuals are typically from the same species or populatio (hece populatio size). Ofte the probability of the coalescet process for the Wright-Fisher populatio model is expressed as p(g N) = k= e u k k(k 1) 4N 4N, where u is expressed i geeratios. The expected T MRCA is the same as above. Ofte we will ot use a time scale i geeratios but geeratios mutatio rate ad the we would express the above formula as p(g Θ) = k= e u k k(k 1) Θ where Θ is 4 N e µ, withµ as the mutatio rate per geeratio ad site (whe usig sequece data), ad N e as the effective populatio size. Θ, Uder a strict Wright-Fisher populatio model 1 From Mathworld: A Poisso process is a process satisfyig the followig properties: 1. The umbers of chages i o-overlappig itervals are idepedet for all itervals.. The probability of exactly oe chage i a sufficietly small iterval is, where is the probability of oe chage ad is the umber of trials. 3. The probability of two or more chages i a sufficietly small iterval is essetially 0. I the limit of the umber of trials becomig large, the resultig distributio is called a Poisso distributio. 4

ISC-5317-Fall 009 Computatioal Evolutioary Biology Figure : Example of a coalescece structure i the Wright-Fisher model N = N e, but uder more biological scearios oe eeds to kow more about the life history of the species to traslate the N e ito real umbers. The coalescet ad the Mora populatio model The coalescet is a exact represetatio of the Mora model because the problems with the multiple coalescet evets i oe geeratio do ot occur. The Mora model allows oly oe lieage to chage at a give time. Therefore the limitatio to small sample size as we have see for the Wright-Fisher model is ot eeded. Usig our fidigs of the discussio of the Mora model earlier, but istead of thikig forward i time, thik backward i time. Lookig backwards we see that the Mora process is similar structured like the coalescece process. We have idividuals that are reduced i their acestry to 1,,... ad evetually to oe gee, the most commo recet acestor of the sampled idividuals. Assume that we are at a time where we have a sample of k idividuals, these are descedats of k 1 parets of oe of these parets was chose to reproduce ad the offsprig is 5

ISC-5317-Fall 009 Computatioal Evolutioary Biology i acestry of the sample of gees. The probability of this evet is ad with probability 1 (N), (N), the acestors remai at j. Tracig back this acestry the umber of death ad birth evets betwee the times whe there are j ad j 1 acestors follows a geometric distributio with parameters /(N) ad thus has a mea of E(u j )= (N) ow we ca assemble the expectatio for the time to the most recet commo acestor E(T MRCA )= k= =(N) (N) =(N) (1 1 ) k= 1 (4) If we assume that we sampled the whole populatio, where =N tha we derive the same result as with stadard (forward) theory. E(T MRCA )=(N) (1 1 )=(N) (1 1 )=N(N 1) N We also ca make the same observatio as we made with the Wright-Fisher populatio model. The coalescece time scale is by a factor (N) differet from the Mora model time scale (formula 4). We ca express the probability of the geealogy uder the Mora model usig the fact that the expoetial distributio is a good approximatio to the geometric distributio p(g N) = k= e u k k(k 1) (N) 1 (N), where u is expressed i geeratios. The expected T MRCA is the same as above. Ofte we will ot use a time scale i geeratios but geeratios mutatio rate ad the we would express the above formula as p(g Θ) = k= e u k k(k 1) Θ 4 Θ, where Θ is 4 N e µ, withµ as the mutatio rate per geeratio ad site (whe usig sequece data), ad N e as the effective populatio size. Uder a strict Wright-Fisher populatio model N = N e, but uder more biological scearios oe eeds to kow more about the life history of the species to traslate the N e ito real umbers. 6