π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

Similar documents
Evolutionary Models. Evolutionary Models

Computational Systems Biology: Biology X

Reading for Lecture 13 Release v10

BMI/CS 776 Lecture 4. Colin Dewey

Evolutionary Tree Analysis. Overview

Reconstruire le passé biologique modèles, méthodes, performances, limites

Stat 516, Homework 1

Evolution in a spatial continuum

Eigenvalues and Eigenvectors

6 Introduction to Population Genetics

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Demography April 10, 2015

6 Introduction to Population Genetics

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Chapter 7: Models of discrete character evolution

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Lecture 18 : Ewens sampling formula

Stochastic Processes

Statistics 992 Continuous-time Markov Chains Spring 2004

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

LTCC. Exercises. (1) Two possible weather conditions on any day: {rainy, sunny} (2) Tomorrow s weather depends only on today s weather

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Phylogenetics: Likelihood

Quantifying sequence similarity

Mathematical models in population genetics II

Math 180B Problem Set 3

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

The Combinatorial Interpretation of Formulas in Coalescent Theory

The Wright-Fisher Model and Genetic Drift

Math 425 Fall All About Zero

Consensus methods. Strict consensus methods

A phase-type distribution approach to coalescent theory

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Frequency Spectra and Inference in Population Genetics

Estimating Evolutionary Trees. Phylogenetic Methods

What Is Conservation?

EVOLUTIONARY DISTANCES

Evolutionary Analysis of Viral Genomes

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

ON COMPOUND POISSON POPULATION MODELS

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Appendix A: Matrices

Lecture 10: Powers of Matrices, Difference Equations

Stochastic Demography, Coalescents, and Effective Population Size

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

STA 4273H: Statistical Machine Learning

EIGENVALUES AND EIGENVECTORS 3

Molecular Evolution and Phylogenetic Tree Reconstruction

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Mutation models I: basic nucleotide sequence mutation models

How robust are the predictions of the W-F Model?

MARKING A BINARY TREE PROBABILISTIC ANALYSIS OF A RANDOMIZED ALGORITHM

Molecular Evolution, course # Final Exam, May 3, 2006

Supplementary Material to Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Modern Discrete Probability Branching processes

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

Improved maximum parsimony models for phylogenetic networks

Markov chains. 1 Discrete time Markov chains. c A. J. Ganesh, University of Bristol, 2015

Chapter 3 Balance equations, birth-death processes, continuous Markov Chains

Problems on Evolutionary dynamics

LIMITING PROBABILITY TRANSITION MATRIX OF A CONDENSED FIBONACCI TREE

The effects of a weak selection pressure in a spatially structured population

Stochastic Simulation

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Chris Bishop s PRML Ch. 8: Graphical Models

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Probabilistic modeling and molecular phylogeny

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Genetic Variation in Finite Populations

Exercise Set Suppose that A, B, C, D, and E are matrices with the following sizes: A B C D E

LECTURES 14/15: LINEAR INDEPENDENCE AND BASES

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Almost sure asymptotics for the random binary search tree

Introduction to Probabilistic Graphical Models

Contrasts for a within-species comparative method

Lecture 20 : Markov Chains

Final Exam: Probability Theory (ANSWERS)

Notes by Zvi Rosen. Thanks to Alyssa Palfreyman for supplements.

Random trees and branching processes

Notes for MCTP Week 2, 2014

Assignment 5: Solutions

Matthew Rathkey, Roy Wiggins, and Chelsea Yost. May 27, 2013

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

Fundamentals of Linear Algebra. Marcel B. Finan Arkansas Tech University c All Rights Reserved

URN MODELS: the Ewens Sampling Lemma

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Is the equal branch length model a parsimony model?

Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics

Section 29: What s an Inverse?

Inference in Bayesian Networks

Dominating Configurations of Kings

x + 2y + 3z = 8 x + 3y = 7 x + 2z = 3

Transcription:

ABC estimation of the scaled effective population size. Geoff Nicholls, DTC 07/05/08 Refer to http://www.stats.ox.ac.uk/~nicholls/dtc/tt08/ for material. We will begin with a practical on ABC estimation of the parameter of a Poisson distribution from a few independent realizations of a Poisson distributed random variable. We will then move on to the main 3 exercises. Substitution model Let x and y be two aligned sequences of AA bases, x = (x 1, x 2,..., x N ) etc, with x i, y i {A, C, G, T } say. Suppose y descends from x over a time interval of length δ. In the neutral independent finite sites substitution model, events at each site are independent, we assume neutral evolution, and ignore processes such as indels which change the sequence length. Since the sequences are aligned, the base at site i in y evolves by substitution from the base at site i in x. Let and P a,b = Pr(y i (δ) = b x i (0) = a) π a = Pr(x i = a). Here P a,b models the probability for different kinds of changes, as a function of time, and π a gives the equilibrium proportions of the different bases (since we arrive at sequence x following simulation from the infinite past). It is not to hard to show that π b = a π a P a,b for the equilibrium base frequency vector π, or in matrix notation π = πp. Parameterization, and the rate Matrix, Q. We will parameterize P via the rates Q a,b for a b transitions. Clearly P a,b = P a,b (t). We can see that P a,b = 0 if a b and P a,b (0) = 1 if a = b, so P(0) = I 4 where I 4 is the 4 4 identity matrix. Parameterize P a,b (δ) P a,a P = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ), the last line in matrix notation. The little-o notations is f(δ) is o(δ) if f(δ)/δ 0 as δ 0. These equations are just the Taylor expansion of P, using P(0) = I 4. In order to get b P a,b = 1 at o(δ) we need Q aa = b =a Q a,b. The Q a,b (a b) parameters are important physical parameters, modeling the instantaneous rate at which an a is substituted by a b. However the diagonal elements of Q are determined from the others elements - we do not wish to model Q a,a, the rate at which nothing happens! In order to fix the large δ behavior of P(δ) we impose a homogeneous Markov structure: The future is independent of the past given the present. For τ (0, t), P a,b (t) = P a,c (τ)p c,b (t τ), c {A,C,G,T } or in matrix notation P a,b (t) = [P(τ)P(t τ)] a,b (multiply the matrices and take the (a, b) element). We can use this idea to cut a macro-time interval into micro-time pieces, in which our defining 1

equations for P(δ) hold good. P(t) = P 2 (t/2) = P n (t/n) = (I 4 + Qt/n + o(t/n)) n e Qt, where the last line is obtained by taking the limit as n and using (I + Aδ) 1/δ e A where A is a suitable matrix (meaning, one for which e A = I + A + A 2 /2 +... is defined). This last expression P(t) = exp(qt) is a matrix exponential. We could compute it by diagonalizing Q = UDU T with D a diagonal matrix of eigenvalues, and writing P(t) = U exp(dt)u T. Here exp(dt) is a diagonal matrix with entries exp(dt) ii = exp(d ii t). We can compute matrix exponentials in MatLab using expm(). Units of time. you can check from the relation between P and Q that if π is the EBF for P so π = πp then πq = 0. We usually scale Q to have units substitutions, so there is one substitution per unit time. Q generates n = π diag(q) substitutions per unit time, so if this isn t equal to one we set Q Q/n to get the desired units. Jukes Cantor Example. One very simple substitution model is the Jukes-Cantor model. In this Model all the off diagonal entries in Q are equal. It follows that π = (1, 1, 1, 1)/4 - all bases appear in equal proportions over time at a site, or along sites. In order to get a model in units of substitutions we need Q = 1 1/3 1/3 1/3 1/3 1 1/3 1/3 1/3 1/3 1 1/3 1/3 1/3 1/3 1 This is a nice case where we can actually compute P(t) = exp(qt) in a closed form. P a,b = (1 + 3e 4t/3 )/4 for a b, P a,a = (1 e 4t/3 )/4. We can model other types of substitution, allow for an increased rate for transitions over transversions, and so on. The model is easily extended to codon substitution models, with 61 = 64 3 rather than 4 possible states. Simulation, down a branch. In order to simulate the substitution process from x to y we simulate x i π (eg, if U i U(0, 1) then we set x i equal the smallest a such that a j=1 π j > U i - this generates x i = a with probability π a ) for each i = 1, 2,..., N. We then compute P(t) = exp(qt) using expm(q t). Now for each i = 1, 2,..., N we simulate y i according to the probabilities in row x i of P(t), y i [P xi,1, P xi,4, P xi,4, P xi,4]. This is the same kind of discrete variate simulation we did for x i : we take V i U(0, 1) and set y i equal the smallest b such that b j=1 P x i,j > U i. Simulation, on a tree. In order to simulate synthetic data on the leaves of a tree, we will simulate a root sequence from the equilibrium base frequencies, using the scheme we just gave for simulating x.next, for each edge of the tree we simulate the sequence at the child node given the sequence at the parent node. We do this second part using the scheme we just gave for simulating y given x. A) Practical on Finite Sites. 1. Implement a function function nseq=simulate(mu,dt,oseq) taking as input a rate µ[subs/time], a time interval dt and a 1 L character array oseq (use A = 1 C = 2 etc). Your program should simulate the substitution process acting on the sequence, and output a simulated sequence nseq evolved from oseq over the time dt. Implement this for the Jukes Cantor substitution process. Make this function as efficient as you can.

The Coalescent and Coalescent Simulation The Coalesent is a model of the genealogy of L individuals. Let time (ie age) increase into the past. We start out with time measured in generations. Take a Wright Fisher population (single sex, fixed population size N e, non-overlapping generations). In this model of ancestry each individual in the present generation chooses their parent uniformly at random from among the N e individuals in the previous generation. Wright Fisher population Wright Fisher genealogy Coalescent genealogy M=20 t Take two indeviduals at random from the present generation (ie without reference to their ancestry) and trace their ancestry back in time. How many generations M must we go back in time before lineages coalesce at the most recent common ancestor (MRCA)? Pr(M = m) = 1 N e ( 1 1 N e ) m 1 Pr(M m) = m 1 N e M=1 (1 1Ne ) M 1 ) m = 1 1 (1 1Ne ) N e 1 (1 1Ne ) m = 1 (1 1Ne Let T = M/N e (time in units of N e generations). Consider what happens when N e, the population size, is large. Using lim a 0 (1 + a) 1/a = e, ) Net Pr(T t) = 1 (1 1Ne lim Pr(T t) = 1 N e t. e That is, the distribution of the time back to the MRCA in the coalescent approximation to the WF model is T Exp(1). Of course real populations are not infinite, but if the limiting distribution is Exp(1) then we expect that for large populations the distribution of T will be well approximated by

Exp(1). Many different micro-time-scale models (WF, Moran,...) have this same large-time-scale limit model (coalescent). Simulation (two leaf case). As we move back in time, the two lineages ancestral to the two selected individuals coalesce at instantaneous rate λ = 1. So, to sample a two leaf tree we simply simulate T 1 Exp(1), the time back to the first coalescent event. When we come to deal with the genealogy of L individuals we have the possibility of 3 or more lineages coalescing at the same time. Trees with events of this kind in them have a probability which is O(1/N e ) relative to trees involving only binary coalescent events (since the probability to have any particular 3-way merge is 1/N 2 e compared to 1/N e for the 2-lineage case). For this reason multifurcations are not important for large N e WF populations. Simulation (L leaf case). How do we define and simulate an L-leaf coalescent genealogy? (1) Let k = L, j = 1, i = L + j, t 1, t 2,..., t L = 0 and V = {1, 2,..., L}. Think of the labels in V as the labels of the leaf nodes. (2) As we move back in time, each pair of the k ancestral lineages we are following coalesces independently at instantaneous rate λ = 1. The number of pairs of lineages is k(k 1)/2 so the total instantaneous coalescent rate for k lineages is λk(k 1)/2 = k(k 1)/2. (3) Simulate the length of the time interval T i Exp(k(k 1)/2) back to the next (into the past) coalescent event. Set t i = t i 1 + T j (so, t L+1 = t L + T 1 with t L = 0, for the first coalescence event). The label for the tree node representing this coalescence will be i. (4) Conditional on there being a coalescence event at time t i, the pair of lineages involved could be any pair with equal probability, so choose two lineages (ie two child nodes) at random from V and set V (V \ {a, b}) {i}. (5) If i = 2L 1 (so should have k = 2 and j = L 1) we are at the root so stop. Otherwise set j j + 1 (move to the next interval), k k 1 (which has one less lineage) and i = L + j (the label for the next node) and go back to Step 2. Example Figure needed here. B) Practical on the coalescent and Finite Sites. 1. Implement a function function s=coal(l) taking as input the number of leaves, and simulating the coalescent in natural units (ie N e generations for a population of size N e ). Your code should return a tree, which I recommend you implement as an array s(1 : (2L 1)) of structures s(i) = struct( name, i, seq, [], age, 0, child, []); with s(i).name = i, s(i).seq a 1 L base character sequence at this stage left [] empty, s(i).age is the age of node i on the tree, which you simulate via the coalescent, and s(i).child a 1 2 array giving the labels of the child nodes of this node (and empty if i is a leaf) which again are determined in the coalescent simulation. Test your code by verifying that the mean root time is equal 2(1 1/L). 2. Implement a function function s=sequence(s,i,mu) simulating the substitution process on the tree s starting from node i and generating sequences s(j).seq at all nodes j below i on the tree. Use the simulate() code you wrote for the previous practical. Note that the input µ, the substitution rate, will have units [subs/n e gens] since the unit of time is N e generations. Note: this problem has a neat solution using a recursive implementation.

ABC for the mutation rate µ We will estimate µ from sequence data D. Let S be the unknown true genealogy, let M be the unknown true substitution rate, and let D be an L N matrix of sequence data (ie, one 1 N vector giving the base character at each site for each sampled individual). How did nature generate the data (assume nature employs a coalescent and the Jukes Cantor substitution model)? First S Coalescent, so the genealogy was created. The sequence evolution is neutral by assumption, so the sequences then evolve down this tree without affecting it. The genealogy just provides the railway tracks for the sequence evolution. So, D Jukes Cantor given S. Let x (n) be the character sequence at node n = 1, 2,..., 2L 1. We simulate x (2L 1) (the sequence at the root) as x (2L 1) i π, ie, as a draw from the equilibrium base frequency distribution. Next for j, k {1, 2,..., 2L 1} with k the parent node of j on the tree S x (j) i P x (k) i,: (t k t j ) for i = 1, 2,..., N. Here P = exp(µqt) with µ in units of substitutions per N e generations. Q is the JC rate matrix and the time t k t j is the time elapsed between the times of nodes k and j. Our ABC will recapitulate this observation process. Now, π(µ, s D) p(µ, s)p(d µ, s) with p(µ, s) = p 1 (µ)p 2 (s). Here p 1 (µ) is a subjective prior for µ (see below), and p 2 (s) the coalescent probability density over rooted L-leaf trees. p(d µ, s) is the likelihood of the data. We want to avoid computing p(d µ, s) so we use ABC for the simulation of (µ, s) π(µ, s D). ABC algorithm Aim: draw X π(x D) p(d x)p(x) (approximately) Fix ǫ > 0 and some measure of distance d(d, D) between data sets. 1. draw µ p 1 (µ), s p 2 (s) and D p(d µ, s). 2. while d(d, D) > ǫ do µ p 1 (µ) s p 2 (s) D p(d µ, s) end return X = (µ, s) We can run this whole thing repeatedly to collect samples (µ 1, s 1 ), (µ 2, s 2 ),..., (µ M, s M ) and then plot a histogram of µ 1, µ 2,..., µ M to get the marginal posterior for µ D. How to choose d(d, D) and ǫ? Let Let U(D) = the number of unique site patterns in D. d(d, D) = U(D) U(D ). The idea here is that as µ increases, the number of unique site patterns increases, so this statistic should be sensitive to µ. Choose ǫ as small as possible, but large enough so that the algorithm returns samples in a reasonable amount of time. Check that results are not sensitive to the choice of ǫ. Choose p 1 (µ) equal to the uniform distribution on 0 µ 1. Why? E(t 2L 1 ) = 2(1 1/L) since this is a coalescent (we saw this for L = 2) so the mean root time in the tree prior is between 1 and 2. Clearly µ 0. If µ is large enough to make µt 2L 1 1 then the data will be saturated with substitutions. We would usually know if we had saturated data (sequences with very ancient MRCA s). Where saturated data occurs, we cannot estimate µ other than to show that it is large - here of order one or larger. Our ABC in slightly more detail is then: ABC algorithm Aim: draw X π(x D) p(d x)p(x) (approximately) Fix ǫ > 0 and some measure of distance d(d, D) between data sets. 0. compute U(D) for U = # unique site patterns. 1. draw µ U(0, 1), s coalescent and D p(d µ, s).

2. compute U(D ) and d = U(D ) U(D). 3. If d < ǫ stop and return a sample X = (µ, s), and otherwise, goto 1. C) Practical on ABC for the substitution rate from sequence data. 1. Implement a function function s=ssfs(s,i,mu) to compute a vector giving the number of times each unique site pattern appears. Note that [v,i,j]=unique(seq, rows ) returns the unique site patterns, and j can be used to construct SSFS. 2. Consider the two data sets hivseq.mat and synth.mat. If you load hivseq ndata; you get L N data matrix ndata. Similarly if you load synth sdata; you get L N data matrix sdata. Using ABC, estimate µ. For the synthetic data the true value of µ was µ = 0.3. Take an exponential prior with mean 1 or 0.1 or thereabouts. For the HIV data this is justified from previous studies. Note that a tree-drawing programme will be available show.m and may be useful if you have the right tree data structure.