Dating r8s, multidistribute

Similar documents
Concepts and Methods in Molecular Divergence Time Estimation

Discrete & continuous characters: The threshold model

Estimating the Rate of Evolution of the Rate of Molecular Evolution

DATING LINEAGES: MOLECULAR AND PALEONTOLOGICAL APPROACHES TO THE TEMPORAL FRAMEWORK OF CLADES

Molecular dating of phylogenetic trees: A brief review of current methods that estimate divergence times

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Inferring Speciation Times under an Episodic Molecular Clock

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Relaxing the Molecular Clock to Different Degrees for Different Substitution Types

Estimating the Divergence Time of Molecular Sequences using Bayesian Techniques

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

A Bayesian Approach to Phylogenetics

Accepted Article. Molecular-clock methods for estimating evolutionary rates and. timescales

Taming the Beast Workshop

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

An Evaluation of Different Partitioning Strategies for Bayesian Estimation of Species Divergence Times

Monte Carlo in Bayesian Statistics

Phylogenetic Inference using RevBayes

Reminder of some Markov Chain properties:

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

MCMC: Markov Chain Monte Carlo

Local and relaxed clocks: the best of both worlds

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Dating Divergence Times in Phylogenies

Bayesian Models for Phylogenetic Trees

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Estimating Absolute Rates of Molecular Evolution and Divergence Times: A Penalized Likelihood Approach

Metropolis-Hastings Algorithm

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Probability Distribution of Molecular Evolutionary Trees: A New Method of Phylogenetic Inference

LECTURE 15 Markov chain Monte Carlo

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

An Introduction to Bayesian Inference of Phylogeny

Bayesian Methods for Machine Learning

PAML 4: Phylogenetic Analysis by Maximum Likelihood

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Evolutionary rate variation among vertebrate β globin genes: Implications for dating gene family duplication events

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Bayesian Phylogenetics

Density Estimation. Seungjin Choi

Stat 516, Homework 1

MCMC algorithms for fitting Bayesian models

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

DNA-based species delimitation

Today s Lecture: HMMs

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Inferring Molecular Phylogeny

Bayesian Inference and MCMC

STA 4273H: Statistical Machine Learning

Non-Parametric Bayesian Population Dynamics Inference

Estimation of species divergence dates with a sloppy molecular clock

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Advanced Statistical Methods. Lecture 6

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Estimating Evolutionary Trees. Phylogenetic Methods

From Individual-based Population Models to Lineage-based Models of Phylogenies

Curve Fitting Re-visited, Bishop1.2.5

Reading for Lecture 13 Release v10

Bayesian Phylogenetics

STA 4273H: Sta-s-cal Machine Learning

Bayesian Linear Regression

Sampling Algorithms for Probabilistic Graphical models

Molecular Evolution & Phylogenetics

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Modeling and Interpolation of Non-Gaussian Spatial Data: A Comparative Study

DATING THE DIPSACALES: COMPARING MODELS,

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

F denotes cumulative density. denotes probability density function; (.)

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Spatially Adaptive Smoothing Splines

Advanced Statistical Modelling

Letter to the Editor. Department of Biology, Arizona State University

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Markov Switching Regular Vine Copulas

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Bayesian Phylogenetics:

When Trees Grow Too Long: Investigating the Causes of Highly Inaccurate Bayesian Branch-Length Estimates

Analysing geoadditive regression data: a mixed model approach

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

A General Comparison of Relaxed Molecular Clock Models

Machine Learning using Bayesian Approaches

MCMC Review. MCMC Review. Gibbs Sampling. MCMC Review

Constructing Evolutionary/Phylogenetic Trees

ST 740: Markov Chain Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Transcription:

Phylomethods Fall 2006 Dating r8s, multidistribute Jun G. Inoue

Software of Dating Molecular Clock Relaxed Molecular Clock r8s multidistribute

r8s Michael J. Sanderson UC Davis Estimation of rates and times based on ML, smoothing, or other methods.

Methods of r8s 1 LF (Langley-Fitch) local molecular clock The user must indicate for every branch in the tree which of the parameters are associated with it based on the biological feature (such as life history, generation time). NPRS (Nonparametric Rate Smoothing) Relaxes the assumption of a molecular clock by using a least squares smoothing of local estimates of substitution rates. It relies on the observation that closely related lineages tend to have similar rates.

Methods of r8s - 2 PL (Penalized likelihood) Combines a parametric model having a different substitution rate on every branch with a nonparametric roughness penalty. Finds the combination of ( R, T ) of R and T that obeys any constraints on node times and that maximizes, log(p(x R, T )) λφ(r).

PL method Maximize the following: log(p(x R, T )) λφ(r) log likelihood penalty function X: data R: rate T: time Φ(R) : penalty function λ : it determines the contribution of this penalty function

PL method log(p(x R, T )) λφ(r) log likelihood Φ(R): penalty function Penalty function is the sum of two parts (squared): Variance among rates for those branches that are directly attached to the root. Those branches that are not directly attached to the root. λ : contribution of penalty function penalty function Cross-validation strategy selects the value of λ small, extensive rate variation over time. λ large, clock like pattern of rates. λ.

PL method log(p(x R, T )) λφ(r) log likelihood penalty function Note: r8s does not use sequences Not fully evaluate when finding R and T. Yields an integer-valued estimate of the number of nucleotide substitutions that have affected the sequence on each branch of the tree. Integer-valued estimate for a brach as if it were a directly observed realization from a Poisson distribution. The mean value of the Poisson distribution is determined from R and T by the product of the average rate and time duration of the branch.

r8s infile Tree with branch length from PAUP* Total numbers of substitutions Naming nodes Fixing times for some nodes

r8s infile Constraining times to some nodes Estimation method of rates and times

r8s terminal./r8s -b -f Coe29b > Coe29b.out infile outfile

r8s outfile Node Ages Constraints Rates

r8s outfile ASCII tree Tree in parentheses notation including branch length information

multidistribute Jeffrey L. Thorne North Carolina State University Estimation of rates and times using Bayesian method. The multidistribute consists of two programs: estbranches estimates branch lengths and their variancecovariance matrix. multidivtime estimates rates and times using Bayesian method.

multidistribute Flow Chart testseq.gene1 Gene1.ctl Gene1.tree Estimation of model parameters baseml Gene1.out ogene1 paml2modelinf testseq.gene1 modelinf.gene1 Estimation of branch lengths and their variance-covariance matrix estbranches testseq hmmcntrl.dat example.tree out.oest.gene1 oest.gene2 oest.gene3 oest.gene1 "grep lnl Gene*.out" "grep 'FINAL LIKE' out.oes*" multicntrl.dat example.tree multidivtime numbers "vi inseed" Estimation of rates and times based on the Bayesian method multidivtime multidivtime tree.example node.example samp.example ratio.example tree.example node.example samp.example ratio.example

Posterior distribution The Metropolis-Hastings algorithm is used to approximate p(t,r,ν X, C). p(t,r,ν X, C) = p(x, T, R, ν C) p(x C) T : Internal node times (T0, T1,...,Tk) R : Rates (R0, R1,...,Rk) ν : Parameter of rate autocorrelation (see next slide) C: Constraints on node times X: Data Thorne and Kishino (2005)

Rate Autocorrelation p(t,r,ν X, C) = p(x, T, R, ν C) p(x C) ν : Rate autocorrelation parameter determines the prior distribution for the rates on different branches given the internal node times. High little rate autocorrelation Low strong rate autocorrelation Thorne et al (1998)

Rate Autocorrelation A rate R0 is sampled from the prior distribution (gamma distribution), p(r0). log (R2) and log (R3) Normal distribution a mean equal to log (R0) a variance equal to ν x t 1 + t 2 2 So, the prior distribution of lognormal change of rate of R5 is sampled depends on, but will differ from, the prior distribution from which R0 is sampled. In this way, the joint probability density of the rates at all nodes on the tree is defined for a given value of the rate at the root node. Thorne et al (1998) t2 t1

Multilocus Data Analysis T3 Ra5 Ra4 Rb5 Rb4 Ra2 Rb2 T2 T1 Ra3 Rb3 Ra1 Rb1 Ra0 Rb0 Gene a a Gene b T0 b Each gene is modeled as experiencing its own independent rate trajectory, i.e., the distributions of evolutionary rates among genes are priori uncorrelated. Rates at the root node for individual genes (Ra0 and Rb0) are independent realizations from a gamma distribution. Two approaches for autocorrelation rate parameter. a = b or a b. Thorne et al (1998)

Posterior Distribution p(t,r,ν X, C) = p(x, T, R, ν C) p(x C) p(t,r,ν C) = p(r T,ν,C)p(T,ν C) = p(r T,ν,C)p(T ν, C)p(ν C) Value of (and C) does not provide information about the data X. B=(B0, B1,...,Bk) Blanch length p(x T, R) = p(x B) = = = = p(x T, R, ν, C)p(T,R,ν C) p(x C) p(x T, R, ν, C)p(R T,ν,C)p(T ν, C)p(ν C) p(x C) p(x T,R)p(R T,ν)p(T C)p(ν) p(x C) p(x B)p(R T,ν)p(T C)p(ν) p(x C) Thorne and Kishino (2005)

Posterior Distribution Posterior distribution p(t,r,ν X, C) = Numerator Likelihood Prior distribution p(x B)p(R T,ν)p(T C)p(ν) p(x C) Marginal probability of the data p(x B): Likelihood approximated with a multivariate normal distribution centered about the ML estimates of branch lengths B. p(r T, ): The prior distribution of rate evolution. This probability is determined once the times T and the constant is determined. p(t C): The density p(t C) is identical to the density p (T) up to a proportionality constant to. The p(t) is selected on the basis of simplicity of the Yule process. p( ): The prior distribution for the rate change parameter. Thorne and Kishino (2005)

Metropolis-Hastings Algorithm p(t,r,ν X, C) = p(x B)p(R T,ν)p(T C)p(ν) p(x C) Denominator p(x C) = P osterior P robability Density dt drdν This is not pretty.

Metropolis-Hastings Algorithm p(t,r,ν X, C) = r = min p(x B)p(R T,ν)p(T C)p(ν) p(x C) ( 1, p(t,r,ν X) J(T,R,ν T,R,ν ) p(t,r,ν X) J(T,R,ν T,R,ν) Hastings ratio ) p(t,r,ν X) p(t,r,ν X) = p(x B )p(r T,ν )p(t C)p(ν ) p(x B)p(R T,ν)p(T C)p(ν) Metropolis-Hastings Algorithm is adopted to obtain an approximately random sample from p(t, R, X). Repeating many cycles of this procedure of random proposals followed by acceptance or rejection produces a Markov chain with a stationary distribution that is the desired posterior distribution p(t, R,, X). Thorne et al (1998)

Metropolis-Hastings Algorithm r = min ( 1, p(t,r,ν X) J(T,R,ν T,R,ν ) ) p(t,r,ν X) J(T,R,ν T,R,ν) Hastings ratio p(t,r,ν X) p(t,r,ν X) = p(x B )p(r T,ν )p(t C)p(ν ) p(x B)p(R T,ν)p(T C)p(ν) Proposal step for r = min (1, p(r T,ν )p(ν )ν ) p(r T,ν)p(ν)ν Proposal step for R r = min ( 1, p(x B )p(r T,ν )R i p(x B)p(R T,ν)R i ) Proposal step for T r = min ( 1, p(r T,ν )p(t )(T y T 0 ) p(r T,ν)p(T )(T y T 0 ) ) Mixing Step r = min ( 1, p(r T,ν )p(t ) p(r T,ν)p(T ) ) 1 M 1 Thorne et al (1998)

multidistribute Flow Chart testseq.gene1 Gene1.ctl Gene1.tree Estimation of model parameters baseml Gene1.out ogene1 Transform baseml output files to estbranches input files paml2modelinf testseq.gene1 modelinf.gene1 testseq hmmcntrl.dat example.tree estbranches oest.gene2 oest.gene3 oest.gene1 out.oest.gene1 "grep lnl Gene*.out" "grep 'FINAL LIKE' out.oes*" multicntrl.dat example.tree multidivtime numbers "vi inseed" multidivtime multidivtime tree.example node.example samp.example ratio.example tree.example node.example samp.example ratio.example

Modelinf.Gene1 F84 + G5 model 5 gamma categories Base frequencies (same values among categories) Relative rates of categories

multidistribute Flow Chart testseq.gene1 Gene1.ctl Gene1.tree Estimation of model parameters baseml Gene1.out ogene1 Transform baseml output files to estbranches input files paml2modelinf testseq.gene1 modelinf.gene1 Estimation of branch lengths and their variance-covariance matrix estbranches testseq hmmcntrl.dat example.tree out.oest.gene1 oest.gene2 oest.gene3 oest.gene1 "grep lnl Gene*.out" "grep 'FINAL LIKE' out.oes*" multicntrl.dat example.tree multidivtime numbers "vi inseed" multidivtime multidivtime tree.example node.example samp.example ratio.example tree.example node.example samp.example ratio.example

oest.gene1 Tree topology with branch length variance-covariance matrix of branch lengths

multidistribute Flow Chart testseq.gene1 Gene1.ctl Gene1.tree Estimation of model parameters baseml Gene1.out ogene1 Transform baseml output files to estbranches input files paml2modelinf testseq.gene1 modelinf.gene1 Estimation of branch lengths and their variance-covariance matrix estbranches testseq hmmcntrl.dat example.tree out.oest.gene1 oest.gene2 oest.gene3 oest.gene1 "grep lnl Gene*.out" "grep 'FINAL LIKE' out.oes*" multicntrl.dat example.tree multidivtime numbers "vi inseed" Estimation of rates and times based on the Bayesian method multidivtime multidivtime tree.example node.example samp.example ratio.example tree.example node.example samp.example ratio.example

multicntrl.dat Outfile of 3 genes from estbranches MCMC generation setting A priori expected time between tip and root Mean of prior distribution for rate at root node Mean of prior for autocorrelate rate parameter, Node constraints

multidistribute Flow Chart testseq.gene1 Gene1.ctl Gene1.tree Estimation of model parameters baseml Gene1.out ogene1 paml2modelinf testseq.gene1 modelinf.gene1 Estimation of branch lengths and their variance-covariance matrix estbranches testseq hmmcntrl.dat example.tree out.oest.gene1 oest.gene2 oest.gene3 oest.gene1 "grep lnl Gene*.out" "grep 'FINAL LIKE' out.oes*" multicntrl.dat example.tree multidivtime numbers "vi inseed" Estimation of rates and times based on the Bayesian method multidivtime multidivtime tree.example node.example samp.example ratio.example tree.example node.example samp.example ratio.example

multidistribute outfile out.gene1 summarizes posterior means of divergence times and rates, and other information tree.gene1 contains the tree definition of the chronogram ratio.gene1 contains relative probability ratios for the parameter values sampled by MCMC node.gene1 contains detailed information about the MCMC run samp.gene1 contains samples from the Markov chain: they can be useful for exploring convergence Rutschmann (2005)

node.gene1 Node number Standard deviation Posterior distribution of divergence time 95% credibility interval

tree.gene1 Time scaled tree is shown using TreeView.

Programs for molecular clock-dating that relax the clock assumption no topology Modified from Renner (2005)

References Aris-Brosou S, Yang X. 2002. Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. Syst Biol 51: 703 714. Kishino H, Thorne JL. 2002. (in Japanese). 50: 17-31. Kishino H, Thorne JL, Bruno WJ. 2001. Performance of a divergence time estimation method under a probabilistic model of rate evolution. MBE 18: 352 361. Renner SS. 2005. Relaxed molecular clocks for dating historical plant dispersal events. Trends in Plant Science 10: 550-558. Rutschmann F. Bayesian molecular dating using PAML/multidivtime: A step-by-step manual, version 1.5 (July 2005). ftp://statgen.ncsu.edu/pub/thorne/bayesiandating1.5.pdf Thorne JL, Kishino H, Painter IS. 1998. Estimating the rate of evolution of the rate of molecular evolution. MBE 15: 1647 1657. Thorne JL,Kishino H. 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst Biol 51: 689 702. Thorne JL and Kishino K. 2005. Estimation of divergence times from molecular sequence data. In: Statistics for Biology and Health. Nielsen R. (Ed). Springer, NY. Thorne JL. 2003. http://statgen.ncsu.edu/thorne/multidivtime.html Yang Z. 2003. http://workshop.molecularevolution.org/people/faculty/files/ziheng_2003.txt Yang Z, Yoder AD. 2003. Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse lemur species. Syst Biol 52: 705 716. Yoder AD, Yang ZH. 2004. Divergence dates for Malagasy lemurs estimated from multiple gene loci: geological and evolutionary context. Mol Ecol 13: 757-773.