Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Similar documents
Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Estimating Evolutionary Trees. Phylogenetic Methods

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Bayesian Phylogenetics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Dr. Amira A. AL-Hosary

Inferring Speciation Times under an Episodic Molecular Clock

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

From Individual-based Population Models to Lineage-based Models of Phylogenies

A Bayesian Approach to Phylogenetics

Molecular Evolution & Phylogenetics

Taming the Beast Workshop

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Probability Distribution of Molecular Evolutionary Trees: A New Method of Phylogenetic Inference

Bayesian Phylogenetics:

Approximate Bayesian Computation: a simulation based approach to inference

Bayesian Inference and MCMC

Discrete & continuous characters: The threshold model

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Bayesian Models for Phylogenetic Trees

Concepts and Methods in Molecular Divergence Time Estimation

Using the Birth-Death Process to Infer Changes in the Pattern of Lineage Gain and Loss. Nathaniel Malachi Hallinan

Constructing Evolutionary/Phylogenetic Trees

BIG4: Biosystematics, informatics and genomics of the big 4 insect groups- training tomorrow s researchers and entrepreneurs

Probabilistic Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

EVOLUTION INTERNATIONAL JOURNAL OF ORGANIC EVOLUTION

Intraspecific gene genealogies: trees grafting into networks

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Chapter 16: Reconstructing and Using Phylogenies

Bagging During Markov Chain Monte Carlo for Smoother Predictions

How can molecular phylogenies illuminate morphological evolution?

Reading for Lecture 13 Release v10

EVOLUTIONARY DISTANCES

MCMC: Markov Chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 18-16th March Arnaud Doucet

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

arxiv: v2 [q-bio.pe] 18 Oct 2013

Inference of Viral Evolutionary Rates from Molecular Sequences

Phylogenetic Tree Reconstruction

New Inferences from Tree Shape: Numbers of Missing Taxa and Population Growth Rates

Estimating Trait-Dependent Speciation and Extinction Rates from Incompletely Resolved Phylogenies

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Bayesian Regression Linear and Logistic Regression

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

A Probabilistic Framework for solving Inverse Problems. Lambros S. Katafygiotis, Ph.D.

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Testing macro-evolutionary models using incomplete molecular phylogenies

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Markov Switching Regular Vine Copulas

Chapter 22 DETECTING DIVERSIFICATION RATE VARIATION IN SUPERTREES. Brian R. Moore, Kai M. A. Chan, and Michael J. Donoghue

An Introduction to Bayesian Inference of Phylogeny

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Answers and expectations

CPSC 540: Machine Learning

BIOLOGY 432 Midterm I - 30 April PART I. Multiple choice questions (3 points each, 42 points total). Single best answer.

Inferring Molecular Phylogeny

Modeling Trait Evolutionary Processes with More Than One Gene

Theory of Evolution Charles Darwin

Bayesian Methods in Multilevel Regression

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Bayesian Classification and Regression Trees

Bayesian Methods for Machine Learning

Advanced Machine Learning

Estimating a Binary Character s Effect on Speciation and Extinction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

8/23/2014. Phylogeny and the Tree of Life

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Artificial Intelligence

Lecture 12: Bayesian phylogenetics and Markov chain Monte Carlo Will Freyman

Markov-Chain Monte Carlo

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Algorithms in Bioinformatics

Bayesian Networks BY: MOHAMAD ALSABBAGH

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Constructing Evolutionary/Phylogenetic Trees

Markov Chain Monte Carlo, Numerical Integration

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

A (short) introduction to phylogenetics

STA 4273H: Statistical Machine Learning

Contents. Part I: Fundamentals of Bayesian Inference 1

QTL model selection: key players

Anatomy of a species tree

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Introduction to Machine Learning CMU-10701

Monte Carlo Methods. Geoff Gordon February 9, 2006

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

Phenotypic Evolution. and phylogenetic comparative methods. G562 Geometric Morphometrics. Department of Geological Sciences Indiana University

C.DARWIN ( )

Bayesian Nonparametric Regression for Diabetes Deaths

Forward Problems and their Inverse Solutions

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Postal address: Ruthven Museums Building, Museum of Zoology, University of Michigan, Ann Arbor,

Transcription:

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Nicolas Salamin Department of Ecology and Evolution University of Lausanne Switzerland MEP05, Paris; Jun, 20 2005 p.1/26

Time and rates of divergence the temporal dimension contained in a calibrated tree also gives information about the tempo of lineages evolution through the time intervals between splits on the tree a formal mathematical description has to be developed to obtain estimates of speciation and extinction rates from a phylogenetic tree stochastic models of lineage growth, such as pure birth or birth/death process, have been extensively studied and are ideal for such a task MEP05, Paris; Jun, 20 2005 p.2/26

Investigating speciation rates Some of the approaches developed so far: birth and death process e.g. Nee et al. (1994), Bokma (2003) survival analysis Paradis (1997, 1998) non-parametric approach: distribution of the number of species within clades of a tree suggest lineages with higher diversification rates detection of the morphological or ecological characteristic associated possible Slowinski and Guyer (1993), Paradis (2005) MEP05, Paris; Jun, 20 2005 p.3/26

Lineage-through-time plot I the most widely used approach is the lineage-through-time plots Nee et al. (1994); Nee (2001) Barraclough and Nee (2001) MEP05, Paris; Jun, 20 2005 p.4/26

Lineage-through-time plot II departure from a constant speciation rate Barraclough and Nee (2001) the slope near the present asymptotically approaches the speciation rate, while the slope in the main body of the graph represents only the net speciation rate MEP05, Paris; Jun, 20 2005 p.5/26

Aims to develop a full maximum likelihood approach for the estimation of λ, the speciation rate, and µ, the extinction rate to develop a framework to take into account the uncertainty about branch length and topology while estimating the speciation and extinction rates to assess the method developed through computer simulations MEP05, Paris; Jun, 20 2005 p.6/26

Birth and death process probabilities lineages are independent, and rates are constant through time and lineages probability that one lineage does not speciate nor get extinct during time t (Kendall, 1948; Thompson, 1975) p 1 (t) = (λ µ)2 e (λ µ)t (λe (λ µ)t µ) 2 which, when λ = µ, simplify to p 1 (t) = 1 (1 + λt) 2 MEP05, Paris; Jun, 20 2005 p.7/26

Probability density of a simple tree MEP05, Paris; Jun, 20 2005 p.8/26

Probability density of a simple tree the probability density of this particular tree: P rob(5, t 1, t 2, t 3, t 4 λ, µ, T ) = p 1 (T )λdt 1 p 1 (t 1 )λdt 2 p 1 (t 2 )λdt 3 p 1 (t 3 )λdt 4 p 1 (t 4 ) 4 = p 1 (T )λ 4 p 1 (t i )dt i i=1 MEP05, Paris; Jun, 20 2005 p.8/26

Probability density for n lineages for any given labelled histories with n species, the probability density becomes: P rob(n, t 1,..., t n λ, µ, T ) = p 1 (T )λ n 1 (n 1)! n 1 i=1 p 1 (t i )dt i MEP05, Paris; Jun, 20 2005 p.9/26

Probability density for n lineages for any given labelled histories with n species, the probability density becomes: P rob(n, t 1,..., t n λ, µ, T ) = p 1 (T )λ n 1 (n 1)! n 1 i=1 p 1 (t i )dt i by conditioning on having n terminal species, the estimation of p 1 (T ) can be eluded: P rob(t 1,..., t n n, λ, µ, T ) = n 1 i=1 p 1 (t i )dt i T 0 p 1(u)du MEP05, Paris; Jun, 20 2005 p.9/26

Dealing with phylogenetic uncertainty to take into account uncertainty in the branch lengths and the topology, we should sum over all possible trees similar to the Bayesian approach (e.g. MrBayes), without specifying a prior distribution on parameters similar approach can be used in population genetics to estimate population size, migration rate, recombination rate, selection http://evolution.gs.washington.edu/lamarck - Joe Felsenstein s lab MEP05, Paris; Jun, 20 2005 p.10/26

Sampling all possible trees we could draw a random sample, but most trees are extremely implausible given the data. Such a random sample would have to be extremely large MEP05, Paris; Jun, 20 2005 p.11/26

Sampling all possible trees we could draw a random sample, but most trees are extremely implausible given the data. Such a random sample would have to be extremely large we could bootstrap the data, but slow and biased MEP05, Paris; Jun, 20 2005 p.11/26

Sampling all possible trees we could draw a random sample, but most trees are extremely implausible given the data. Such a random sample would have to be extremely large we could bootstrap the data, but slow and biased we can use an importance sampling approach: we concentrate sampling on those trees which are plausible and will contribute substantially to the likelihood of the data and parameters MEP05, Paris; Jun, 20 2005 p.11/26

Sampling all possible trees we could draw a random sample, but most trees are extremely implausible given the data. Such a random sample would have to be extremely large we could bootstrap the data, but slow and biased we can use an importance sampling approach: we concentrate sampling on those trees which are plausible and will contribute substantially to the likelihood of the data and parameters very versatile approach to estimate unknown and complex distribution MEP05, Paris; Jun, 20 2005 p.11/26

Monte-Carlo integration on trees Maximum likelihood estimation of speciation and extinction rate: we want to get f(t ) = P rob(t λ, µ) = L(λ, µ) MEP05, Paris; Jun, 20 2005 p.12/26

Monte-Carlo integration on trees Maximum likelihood estimation of speciation and extinction rate: we want to get f(t ) = P rob(t λ, µ) = L(λ, µ) but we can t MEP05, Paris; Jun, 20 2005 p.12/26

Monte-Carlo integration on trees Maximum likelihood estimation of speciation and extinction rate: we want to get f(t ) = P rob(t λ, µ) = L(λ, µ) but we can t, so we sample from the posterior probability for some driving values λ 0 and µ 0 of λ and µ g(t ) = P rob(t D, λ 0, µ 0 ) = P rob(d T )P rob(t λ 0, µ 0 ) P rob(d λ 0, µ 0 ) MEP05, Paris; Jun, 20 2005 p.12/26

then, L(λ, µ) = T f(t ) g(t )dt g(t ) = T P rob(d T ) P rob(d T ) P rob(d λ 0, µ 0 ) P rob(d λ 0, µ 0 ) P rob(t λ, µ) P rob(t λ 0, µ 0 ) L(λ 0, µ 0 )dt MEP05, Paris; Jun, 20 2005 p.13/26

then, L(λ, µ) = T f(t ) g(t )dt g(t ) = T P rob(d T ) P rob(d T ) P rob(d λ 0, µ 0 ) P rob(d λ 0, µ 0 ) P rob(t λ, µ) P rob(t λ 0, µ 0 ) L(λ 0, µ 0 )dt L(λ, µ) L(λ 0, µ 0 ) = T P rob(t λ, µ) P rob(t λ 0, µ 0 ) dt = E T [ ] P rob(t λ, µ) P rob(t λ 0, µ 0 ) MEP05, Paris; Jun, 20 2005 p.13/26

then, L(λ, µ) = T f(t ) g(t )dt g(t ) = T P rob(d T ) P rob(d T ) P rob(d λ 0, µ 0 ) P rob(d λ 0, µ 0 ) P rob(t λ, µ) P rob(t λ 0, µ 0 ) L(λ 0, µ 0 )dt L(λ, µ) L(λ 0, µ 0 ) = T P rob(t λ, µ) P rob(t λ 0, µ 0 ) dt = E T [ ] P rob(t λ, µ) P rob(t λ 0, µ 0 ) 1 n n i=1 P rob(t i λ, µ) P rob(t i λ 0, µ 0 ) which is an average of a likelihood ratio over the n trees T i sampled. MEP05, Paris; Jun, 20 2005 p.13/26

MCMC: Metropolis-Hastings method to draw a sample (or Markov chain) T 1,..., T n from a distribution proportional to a function g(t ), start from a tree T 0 and then: MEP05, Paris; Jun, 20 2005 p.14/26

MCMC: Metropolis-Hastings method to draw a sample (or Markov chain) T 1,..., T n from a distribution proportional to a function g(t ), start from a tree T 0 and then: 1. draw a change in T i from some "proposal distribution": T i T i+1 MEP05, Paris; Jun, 20 2005 p.14/26

MCMC: Metropolis-Hastings method to draw a sample (or Markov chain) T 1,..., T n from a distribution proportional to a function g(t ), start from a tree T 0 and then: 1. draw a change in T i from some "proposal distribution": T i T i+1 2. accept the change if a uniformly-distributed random number r satisfies r < P rob(t i T i+1 ) P rob(t i+1 T i ) L(D T i+1 )P rob(t i+1 λ 0, µ 0 ) L(D T i )P rob(t i λ 0, µ 0 ) MEP05, Paris; Jun, 20 2005 p.14/26

MCMC: Metropolis-Hastings method to draw a sample (or Markov chain) T 1,..., T n from a distribution proportional to a function g(t ), start from a tree T 0 and then: 1. draw a change in T i from some "proposal distribution": T i T i+1 2. accept the change if a uniformly-distributed random number r satisfies r < P rob(t i T i+1 ) P rob(t i+1 T i ) L(D T i+1 )P rob(t i+1 λ 0, µ 0 ) L(D T i )P rob(t i λ 0, µ 0 ) 3. go back to 1. many times (e.g. 1 mio times or generations ) MEP05, Paris; Jun, 20 2005 p.14/26

MCMC: Metropolis-Hastings method to draw a sample (or Markov chain) T 1,..., T n from a distribution proportional to a function g(t ), start from a tree T 0 and then: 1. draw a change in T i from some "proposal distribution": T i T i+1 2. accept the change if a uniformly-distributed random number r satisfies r < P rob(t i T i+1 ) P rob(t i+1 T i ) L(D T i+1 )P rob(t i+1 λ 0, µ 0 ) L(D T i )P rob(t i λ 0, µ 0 ) 3. go back to 1. many times (e.g. 1 mio times or generations ) If we do this long enough and other nice statistical properties hold, the set of T i will be a sample from the right distribution MEP05, Paris; Jun, 20 2005 p.14/26

The "proposal distribution" the idea is to wander through the tree space and calculate for each change the probability of going from one tree to another: pick a node u and its ancestor v MEP05, Paris; Jun, 20 2005 p.15/26

The "proposal distribution" the idea is to wander through the tree space and calculate for each change the probability of going from one tree to another: erase the branches connected to u MEP05, Paris; Jun, 20 2005 p.15/26

The "proposal distribution" the idea is to wander through the tree space and calculate for each change the probability of going from one tree to another: draw new heights u and v MEP05, Paris; Jun, 20 2005 p.15/26

The "proposal distribution" the idea is to wander through the tree space and calculate for each change the probability of going from one tree to another: reattach the nodes MEP05, Paris; Jun, 20 2005 p.15/26

Getting the right driving values ideally driving values used to sample the trees T i should be the (unknown) MLE of λ and µ if not, a bias is introduced in the analysis MEP05, Paris; Jun, 20 2005 p.16/26

Getting the right driving values ideally driving values used to sample the trees T i should be the (unknown) MLE of λ and µ if not, a bias is introduced in the analysis the solution taken so far is to update the driving values through multiple successive chains: start with some values of λ 0 and µ 0 MEP05, Paris; Jun, 20 2005 p.16/26

Getting the right driving values ideally driving values used to sample the trees T i should be the (unknown) MLE of λ and µ if not, a bias is introduced in the analysis the solution taken so far is to update the driving values through multiple successive chains: start with some values of λ 0 and µ 0 run 10 short chains (e.g. 100,000 generations), updating the values of λ and µ after each chain MEP05, Paris; Jun, 20 2005 p.16/26

Getting the right driving values ideally driving values used to sample the trees T i should be the (unknown) MLE of λ and µ if not, a bias is introduced in the analysis the solution taken so far is to update the driving values through multiple successive chains: start with some values of λ 0 and µ 0 run 10 short chains (e.g. 100,000 generations), updating the values of λ and µ after each chain run 1 long chain (e.g. 10,000,000 generations), to update λ and µ one last time MEP05, Paris; Jun, 20 2005 p.16/26

Getting the right driving values ideally driving values used to sample the trees T i should be the (unknown) MLE of λ and µ if not, a bias is introduced in the analysis the solution taken so far is to update the driving values through multiple successive chains: start with some values of λ 0 and µ 0 run 10 short chains (e.g. 100,000 generations), updating the values of λ and µ after each chain run 1 long chain (e.g. 10,000,000 generations), to update λ and µ one last time run 1 long chain (e.g. 10,000,000 generations), and use the trees sampled as the distribution of interest MEP05, Paris; Jun, 20 2005 p.16/26

Getting the right driving values ideally driving values used to sample the trees T i should be the (unknown) MLE of λ and µ if not, a bias is introduced in the analysis the solution taken so far is to update the driving values through multiple successive chains: start with some values of λ 0 and µ 0 run 10 short chains (e.g. 100,000 generations), updating the values of λ and µ after each chain run 1 long chain (e.g. 10,000,000 generations), to update λ and µ one last time run 1 long chain (e.g. 10,000,000 generations), and use the trees sampled as the distribution of interest maximize the likelihood surface to get the estimated values of λ and µ and their confidence interval MEP05, Paris; Jun, 20 2005 p.16/26

Speciate http://www2.unil.ch/phylo MEP05, Paris; Jun, 20 2005 p.17/26

Simulations Assessment of the current method through computer simulation: define the intervals of time between speciation events using a birth/death process with parameters λ and µ in expected number of substitution based on the times obtained, create corresponding topologies simulate DNA sequences, by evolving the different sites on the trees according to the GTR+Γ model of evolution for each set of DNA sequences, estimate the values of ˆλ and ˆµ and their confidence intervals MEP05, Paris; Jun, 20 2005 p.18/26

20 species, 1 replicate 1000 nucleotides, F84+Γ model used, λ = 1.200, µ = 0.800, estimated ˆλ = 4.645 and ˆµ = 0.034 MEP05, Paris; Jun, 20 2005 p.19/26

20 species, 100 replicates 1000 nucleotides simulated, F84+Γ model used True λ True µ Mean ˆλ Mean ˆµ SD ˆλ SD ˆµ 0.800 0.800 0.343 0.001 1.342 1.567 1.200 0.800 2.345 0.004 2.345 1.342 5.000 0.020 8.453 0.000 3.245 1.597 10000 nucleotides simulated, F84+Γ model used True λ True µ Mean ˆλ Mean ˆµ SD ˆλ SD ˆµ 0.800 0.800 0.566 0.236 0.846 1.084 1.200 0.800 1.543 0.334 0.453 0.984 5.000 0.020 4.576 1.324 1.087 1.234 MEP05, Paris; Jun, 20 2005 p.20/26

200 species, 1 replicate 1000 nucleotides, F84+Γ model used, λ = 5.000, µ = 0.020, estimated ˆλ = 5.156 and ˆµ = 0.067 MEP05, Paris; Jun, 20 2005 p.21/26

200 species, 100 replicates 1000 nucleotides simulated, F84+Γ model used True λ True µ Mean ˆλ Mean ˆµ SD ˆλ SD ˆµ 0.800 0.800 1.301 1.023 0.345 0.456 1.200 0.800 1.498 0.764 0.234 0.457 5.000 0.020 3.245 0.000 1.234 0.324 10000 nucleotides simulated, F84+Γ model used True λ True µ Mean ˆλ Mean ˆµ SD ˆλ SD ˆµ 0.800 0.800 0.904 0.645 0.123 0.213 1.200 0.800 1.023 0.986 0.046 0.127 5.000 0.020 4.675 0.295 0.233 0.345 MEP05, Paris; Jun, 20 2005 p.22/26

Conclusions conditioning on the number of species gives better parameter estimates remove the problem of obtaining the time of origin of the first lineage; although information is lost it becomes impossible to distinguish cases where λ > µ or µ > λ large amount of nucleotides are required to obtain accurate estimates many terminal species are also required in order to have enough information to estimate the parameters of the stochastic process MEP05, Paris; Jun, 20 2005 p.23/26

Some considerations strong assumption of a molecular clock required how to chose the driving values? Possible other solutions to investigate sampling from a mixture distribution stochastic approximation before starting the chain bridge sampling MEP05, Paris; Jun, 20 2005 p.24/26

Further work implementing more complex model, e.g. quantitative model of evolution of the rate of speciation and extinction MEP05, Paris; Jun, 20 2005 p.25/26

Further work implementing more complex model, e.g. quantitative model of evolution of the rate of speciation and extinction model selection can t rely on likelihood ratio test reversible jump Monte Carlo MEP05, Paris; Jun, 20 2005 p.25/26

Further work implementing more complex model, e.g. quantitative model of evolution of the rate of speciation and extinction model selection can t rely on likelihood ratio test reversible jump Monte Carlo Further MCMC approach: sampling the tree space and more... incomplete sampling of lineages Hypothesis testing by combining rates of speciation with morphological, ecological changes MEP05, Paris; Jun, 20 2005 p.25/26

Acknowledgements Joe Felsenstein Genome Science Department, University of Washington, Seattle, USA Elizabeth Thompson Statistics Department, University of Washington, Seattle, USA Mary Kuhner, Jon Yamato and Chul-Joo Kang Genome Science Department, University of Washington, Seattle, USA Bruce Rannala Medical Genetics Department, University of Alberta, Edmonton, Canada Brendan Murphy Statistics Department, Trinity College, Dublin, Ireland Funding: Swiss National Science Foundation MEP05, Paris; Jun, 20 2005 p.26/26