Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Similar documents
Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Phylogenetic Inference using RevBayes

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

The Importance of Proper Model Assumption in Bayesian Phylogenetics

Akaike Information Criterion

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Inferring Molecular Phylogeny

Does Choice in Model Selection Affect Maximum Likelihood Analysis?

Consensus Methods. * You are only responsible for the first two

Lecture 4. Models of DNA and protein change. Likelihood methods

Performance-Based Selection of Likelihood Models for Phylogeny Estimation

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

Lab 9: Maximum Likelihood and Modeltest

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument

Estimating Divergence Dates from Molecular Sequences

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

A Bayesian Approach to Phylogenetics

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using models of nucleotide evolution to build phylogenetic trees

Consistency Index (CI)

Letter to the Editor. Department of Biology, Arizona State University

Points of View JACK SULLIVAN 1 AND DAVID L. SWOFFORD 2

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

Systematics - Bio 615

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

OEB 181: Systematics. Catalog Number: 5459

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Reconstruire le passé biologique modèles, méthodes, performances, limites

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Geographic Origin of Human Mitochondrial DNA: Accommodating Phylogenetic Uncertainty and Model Comparison

Penalized Loss functions for Bayesian Model Choice

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Infer relationships among three species: Outgroup:

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

The Phylogenetic Handbook

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Frequentist Properties of Bayesian Posterior Probabilities of Phylogenetic Trees Under Simple and Complex Substitution Models

An Introduction to Bayesian Inference of Phylogeny

Molecular Evolution, course # Final Exam, May 3, 2006

Building trees of algae: some advances in phylogenetic and evolutionary analysis

Phylogenetic Assumptions

The Importance of Data Partitioning and the Utility of Bayes Factors in Bayesian Phylogenetics

Bayesian Phylogenetics

The Causes and Consequences of Variation in. Evolutionary Processes Acting on DNA Sequences

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Inferring Speciation Times under an Episodic Molecular Clock

Bayesian Selection of Continuous-Time Markov Chain Evolutionary Models

Chapter 7: Models of discrete character evolution

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Phylogenetic Networks, Trees, and Clusters

Dr. Amira A. AL-Hosary

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

C3020 Molecular Evolution. Exercises #3: Phylogenetics

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Copyright. Jeremy Matthew Brown

On the Uniqueness of the Selection Criterion in Neighbor-Joining

Cromwell's principle idealized under the theory of large deviations

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

JML: testing hybridization from species trees

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture 6: Model Checking and Selection

Bayesian Phylogenetics:

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

ESTIMATING DIVERGENCE TIMES FROM MOLECULAR DATA ON PHYLOGENETIC

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Constructing Evolutionary/Phylogenetic Trees

Phylogeny. November 7, 2017

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

A note on Reversible Jump Markov Chain Monte Carlo

PHYLOGENY ESTIMATION: TRADITIONAL AND BAYESIAN APPROACHES

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Phylogenetic Tree Reconstruction

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

PAML 4: Phylogenetic Analysis by Maximum Likelihood

Kei Takahashi and Masatoshi Nei

Transcription:

Homoplasy Selection of models of molecular evolution David Posada Homoplasy indicates identity not produced by descent from a common ancestor. Graduate class in Phylogenetics, Campus Agrário de Vairão, Vila do Conde, Portugal, 6 April 2006 David Posada University of Vigo, Spain, 2006 2 Saturation Evolutionary correction Models of substitution describe how sequences evolve through time and help correct for the saturation effect. Multiple hits lead to saturation at some point in time. David Posada University of Vigo, Spain, 2006 3 David Posada University of Vigo, Spain, 2006 4

Models of nucleotide substitution Proportion of invariable sites If some sites do not change it is possible that fast evolving sequences show less divergence than slower sequences. b a c f d e Q A: rate = 0.005, p-inv = 0.2 B: rate = 0.02, p-inv = 0.5 P(t = e-qt David Posada University of Vigo, Spain, 2006 5 Rate variation among sites David Posada University of Vigo, Spain, 2006 6 Models of evolution A discrete gamma distribution with four categories is commonly used David Posada University of Vigo, Spain, 2006 7 David Posada University of Vigo, Spain, 2006 8

Models are abstractions Models are important Models have a clear influence in Models asume independence of sites, stationarity, equilibrium frecuencies... All models are wrong, but indeed,...some are useful. - Box, 1976 Parameter estimation (ti/tv,!,p i,.. Topology estimation Phylogenetic confidence Hypothesis testing David Posada University of Vigo, Spain, 2006 9 David Posada University of Vigo, Spain, 2006 10 Effect of model in topology estimation: HIV pol Statistical tests of models Model adequacy is an absolute measure of the fit of a model to a particular data set. Model choice is a relative procedure to select the models that results in the best fit to a particular data set. We measure fit using the likelihood. A NJ K80 B NJ - GTR +! There are differences in the position of the subtype A. David Posada University of Vigo, Spain, 2006 11 David Posada University of Vigo, Spain, 2006 12

Likelihood In phylogenetics see 18,19 we define the likelihood (L as (proportional to the probability of the data (D given a model of evolution (M, a vector of K model parameters! = (! 1,! 2,...,! K, a tree topology (", and a vector of S branch lengths,! = (! 1,! 2,...,! S : Statistical methods for model selection Hierarchical LRTs BIC Bayes factors AIC L = P(D M,!,",# David Posada University of Vigo, Spain, 2006 13 David Posada University of Vigo, Spain, 2006 14 Parsimony principle Likelihood ratio tests (LRT All model selection methods are somehow based on the principle of parsimony. LRT = 2 (! 1!! 0! 1 is the log of the maximized likelihood under the alternative hypothesis (complex model! 0 is the log of the maximized likelihood under the null hypothesis (simple model If models are nested, the LRT is distributed as a " 2 (or a mixed " 2, with q degrees of freedom, where q is the difference in number of free parameters between the two models 22 David Posada University of Vigo, Spain, 2006 15 David Posada University of Vigo, Spain, 2006 16

Hierarchical LRTs David Posada University of Vigo, Spain, 2006 17

Problems with hlrts Bayes factors 1 Maybe there is not an optimal model 2 The optimal model depends on the significance level 3 Local optima Bayes factors 23 are the Bayesian analogue of the LRT 24. B ij = P(D M i P(D M j Evidence for M 2 is considered very strong if B ij >150, strong if 12<B ij <150, positive if 3<B ij <12, barely worth mentioning if 1<B ij <3, and negative (supports M j if B ij <1 25. Used to infer the occurrence of recombination 26, to compare different phylogenetic hypothesis 27-29 and for model selection 30 David Posada University of Vigo, Spain, 2006 18 David Posada University of Vigo, Spain, 2006 19 Posterior probabilities For multiple models P(M i D =! P(D M i P(M i R r =1 P(D M r P(M r Bayesian Information Criterion (BIC Simple approximation to the log marginal likelihood of a model BIC =!2! + K logn We generally use MCMC to approximate model likelihoods: harmonic mean of the likelihood. Given equal priors for all competing models, choosing the model with the smallest BIC is equivalent to selecting the model with the maximum posterior probability David Posada University of Vigo, Spain, 2006 20 David Posada University of Vigo, Spain, 2006 21

Model selection uncertainty Model averaging We can establish a "95% credible set of models by summing the posterior probabilities from largest to smallest until the sum is just # 0.95 (Occam s window. We can make inferences based on the entire set of candidate models. the overall posterior mean of the shape of the gamma distribution (! would be: G E(! D = "! ˆ P(M D i i, i =1 David Posada University of Vigo, Spain, 2006 22 David Posada University of Vigo, Spain, 2006 23 Parameter importance The relative importance for the shape of the gamma distribution across all candidate models is: Kullback-Leibler distance ( = f (x log I f, g " f (x % ( $ ' dx # g(x! & R w + (! = " P(D M i P(M i I! (M i, i =1 where " I! (M i = 1 if! is in model M i # $ 0 otherwise I(f,g is the information lost when using g to approximate f David Posada University of Vigo, Spain, 2006 24 David Posada University of Vigo, Spain, 2006 25

Akaike s information criterion (AIC The AIC estimates the expected K-L information. The AIC penalized the likelihood by the number of parameters (K AIC =!2! + 2K We prefer the model with smallest AIC. For small samples ($ n/k < 40: AIC c = AIC + 2K(K +1 n! K!1 Akaike differences AIC (% AIC difference (% i are rescaled AICs reescalated with the minimum AIC (% i =0 As a rule of thumb 31 :! i = AIC i " min AIC % i & 1-2 receive substantial support and are considered when making inferences 4 & % i & 7 have considerably less support % i > 10 receive no support David Posada University of Vigo, Spain, 2006 26 David Posada University of Vigo, Spain, 2006 27 Akaike weights (w i Akaike 32 suggested that the exp(!1/2" i approximate the relative likelihood of the models given the data L(M i D. Akaike weights (w i for R models: w i = # exp(!1/2" i R r =1 exp(!1/2" r Model selection uncertainty wity the AIC The Akaike weights are very useful for assessing model-selection uncertainty. We can establish a 95% confidence set of models for the best K-L model by summing the Akaike weights from largest to smallest until the sum is just # 0.95. We can interpret them as the probability of a model is the best approximation of the true model, given the data. David Posada University of Vigo, Spain, 2006 28 David Posada University of Vigo, Spain, 2006 29

Model averaging We can also make inferences based on the entire set of candidate models using the AIC Estimating phylogenies with 24 models! ˆ = " R i=1 w i I(M i! i w + (!, where R w + (! = " w i I! (M i, i=1 " I! (M i = 1 if! is in model M i # $ 0 otherwise. Also, w + (! estimates the relative importance of any parameter. David Posada University of Vigo, Spain, 2006 30 David Posada University of Vigo, Spain, 2006 31 Akaike weights Model AIC ' ( Cumulative ( GTR+ 8541.28 0.00 0.6310 0.6310 GTR+*+ 8542.46 1.18 0.3493 0.9803 SYM+ 8549.15 7.88 0.0123 0.9926 SYM+*+ 8550.45 9.17 0.0064 0.9991 HKY+ 8555.18 13.91 0.0006 0.9997 HKY+*+ 8556.40 15.13 0.0003 1 K80+ 8564.98 23.70 0 1 K80+*+ 8566.22 24.94 0 1 GTR+* 8579.28 38.00 0 1 SYM+* 8590.49 49.22 0 1 HKY+* 8592.53 51.25 0 1 F81+ 8593.77 52.49 0 1 F81+*+ 8595.00 53.72 0 1 K80+* 8603.22 61.94 0 1 JC+ 8605.24 63.97 0 1 JC+*+ 8606.51 65.23 0 1 F81+* 8629.59 88.31 0 1 JC+* 8642.08 100.81 0 1 JC 8891.18 349.91 0 1 F81 8878.68 337.41 0 1 K80 8854.93 313.66 0 1 HKY 8845.17 303.90 0 1 SYM 8843.45 302.17 0 1 GTR 8831.24 289.96 0 1 Consensus tree for 24 models using Akaike weights David Posada University of Vigo, Spain, 2006 32 David Posada University of Vigo, Spain, 2006 33

Parameter importance and model-averaged estimates Parameter Importance Model-averaged estimates fa 0.9787 0.2926 fc 0.9787 0.2283 fg 0.9787 0.2552 ft 0.9787 0.2238 TiTv 0.0003 0.9000 rac 0.5500 1.4872 rag 0.5509 2.4336 rat 0.5500 2.4315 rcg 0.5500 1.8640 rct 0.5509 3.1793 pinv(i 0.0000 0.4797 alpha(g 0.6453 0.4631 pinv(i+ig 0.3547 0.1649 alpha(g+ig 1.0000 0.5412 AIC and Bayes AIC approximates reality Bayesian approaches try to identify the most probable model If, as is usually the case, none of the models is correct, what sense can we make of prior or posterior odds? Bayesian priors What is exactly the sample size n? *for 56 models David Posada University of Vigo, Spain, 2006 34 David Posada University of Vigo, Spain, 2006 35 Model selection strategies Modeltest Good properties for model selection methods hlrt Bayesian AIC Applies easily to nonnested models No Yes Yes Allows for the simultaneous comparison of multiple models It does not depend on a subjective significance level No Yes Yes No Yes Yes Incorporates topology uncertainty No Yes* No Easy to compute Yes No* Yes Implements hlrt, AIC for 56 models. It will implement the BIC and tools for model selection uncertainty and model averaging. Google http://darwin.uvigo.es Any OS Assesses model selection uncertainty No Yes Yes So far, it needs PAUP* Allows model averaging No Yes Yes Provides the possibility of specifying prior information for models No Yes* Yes Web-server Provides the possibility of specifying prior information for model parameters Designed to approximate, rather than to identify, truth No Yes* No No No Yes *Not the BIC David Posada University of Vigo, Spain, 2006 36 David Posada University of Vigo, Spain, 2006 37

Take home 1. Models are abstractions to learn about processes. 2. We can statistically select the best-fit model for our data. I recommend AIC or Bayes. 3. It is easy to implement using PAUP* and Modeltest. Model selection is a useful tool for research, but it is not a substitute for careful thinking and common sense reasoning - Browne, 2000. David Posada University of Vigo, Spain, 2006 38 David Posada University of Vigo, Spain, 2006 39 References 1. Felsenstein, J. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology 27, 401-410 (1978. 2. Huelsenbeck, J.P. & Hillis, D.M. Success of phylogenetic methods in the four-taxon case. Systematic Biology 42, 247-264 (1993. 3. Penny, D., Lockhart, P.J., Steel, M.A. & Hendy, M.D. The role of models in reconstructing evolutionary trees. in Models in Phylogenetic Reconstruction, Vol. 52 (eds. Scotland, R.W., Siebert, D.J. & Williams, D.M. 211-230 (Clarendon Press, Oxford, 1994. 4. Bruno, W.J. & Halpern, A.L. Topological bias and inconsistency of maximum likelihood using wrong models. Molecular Biology and Evolution 16, 564-566 (1999. 5. Sullivan, J. & Swofford, D.L. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. Journal of Mammalian Evolution 4, 77-86 (1997. 6. Kelsey, C.R., Crandall, K.A. & Voevodin, A.F. Different models, different trees: The geographic origin of PTLV-I. Molecular Phylogenetics and Evolution 13, 336-347 (1999. 7. Buckley, T.R. & Cunningham, C.W. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Molecular Biology and Evolution 19, 394-405 (2002. 8. Buckley, T.R., Simon, C. & Chambers, G.K. Exploring among-site rate variation models in a maximum likelihood framework using empirical data: The effects of model assumptions on estimates of topology, edge lengths, and bootstrap support. Systematic Biology 50, 67-86 (2001. 9. Buckley, T.R. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Systematic Biology 51, 509-23 (2002. 10. Zhang, J. Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Molecular Biology and Evolution 16, 868-875 (1999. 11. Sullivan, J. & Swofford, D.L. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Systematic Biology 50, 723-9 (2001. 12. Pupko, T., Huchon, D., Cao, Y., Okada, N. & Hasegawa, M. Combining Multiple Data Sets in a Likelihood Analysis: Which Models are the Best? Molecular Biology and Evolution 19, 2294-2307 (2002. 13. Tamura, K. Model selection in the estimation of the number of nucleotide substitutions. Molecular Biology and Evolution 11, 154-157 (1994. 14. Yang, Z., Goldman, N. & Friday, A. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Systematic Biology 44, 384-399 (1995. 15. Yang, Z. How often do wrong models produce better phylogenies? Molecular Biology and Evolution 14, 105-108 (1997. 16. Xia, X. Phylogenetic relationships among horseshoe crab species: Effect of substitution models in phylogenetic analysis. Systematic Biology 49, 87-100 (2000. 17. Gaut, B.S. & Lewis, P.O. Success of maximum likelihood phylogeny inference in the four-taxon case. Molecular Biology and Evolution 12, 152-162 (1995. David Posada University of Vigo, Spain, 2006 40 18. Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17, 368-376 (1981. 19. Goldman, N. Maximum likelihood inference of phylogenetic trees, with special reference to a Poisson process model of DNA substitution and to parsimony analyses. Systematic Zoology 39, 345-361 (1990. 20. Goldman, N. Simple diagnostic statistical test of models of DNA substitution. Journal of Molecular Evolution 37, 650-661 (1993. 21. Swofford, D.L., Olsen, G.J., Waddell, P.J. & Hillis, D.M. Phylogenetic Inference. in Molecular Systematics (eds. Hillis, D.M., Moritz, C. & Mable, B.K. 407-514 (Sinauer Associates, Sunderland, MA, 1996. 22. Kendall, M. & Stuart, A. The Advanced Theory of Statistics, 240-252 (Charles Griffin, London, 1979. 23. Kass, R.E. & Raftery, A.E. Bayes factors. Journal of the American Statistical Association 90, 377-395 (1995. 24. Suchard, M.A., Kitchen, C.M., Sinsheimer, J.S. & Weiss, R.E. Hierarchical phylogenetic models for analyzing multipartite sequence data. Systematic Biology 52, 649-64 (2003. 25. Raftery, A.E. Hypothesis testing and model selection. in Markov Chain Monte Carlo in practice (eds. Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. 163-187 (Chapman & Hall, London ; New York, 1996. 26. Suchard, M.A., Weiss, R.E., Dorman, K.S. & Sinsheimer, J.S. Oh brother, where art thou? A Bayes factor test for recombination with uncertain heritage. Systematic Biology 51, 715-28 (2002. 27. Huelsenbeck, J.P. & Imennov, N.S. Geographic origin of human mitochondrial DNA: accommodating phylogenetic uncertainty and model comparison. Systematic Biology 51, 155-65 (2002. 28. Huelsenbeck, J.P., Rannala, B. & Larget, B. A Bayesian framework for the analysis of cospeciation. Evolution Int J Org Evolution 54, 352-64 (2000. 29. Suchard, M.A., Weiss, R.E. & Sinsheimer, J.S. Testing a Molecular Clock without an Outgroup: Derivations of Induced Priors on Branch-Length Restrictions in a Bayesian Framework. Systematic Biology 52, 48-54 (2003. 30. Suchard, M.A., Weiss, R.E. & Sinsheimer, J.S. Bayesian selection of continuous-time Markov chain evolutionary models. Molecular Biology and Evolution 18, 1001-13 (2001. 31. Burnham, K.P. & Anderson, D.R. Model selection and multimodel inference: a practical information-theoretic approach, (Springer-Verlag, New York, NY, 2003. 32. Akaike, H. Information measures and model selection. International Statistical Institute 22, 277-291 (1983. David Posada University of Vigo, Spain, 2006 41