Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Homoplasy Selection of models of molecular evolution David Posada Homoplasy indicates identity not produced by descent from a common ancestor. Graduate class in Phylogenetics, Campus Agrário de Vairão, Vila do Conde, Portugal, 6 April 2006 David Posada University of Vigo, Spain, 2006 2 Saturation Evolutionary correction Models of substitution describe how sequences evolve through time and help correct for the saturation effect. Multiple hits lead to saturation at some point in time. David Posada University of Vigo, Spain, 2006 3 David Posada University of Vigo, Spain, 2006 4

Models of nucleotide substitution Proportion of invariable sites If some sites do not change it is possible that fast evolving sequences show less divergence than slower sequences. b a c f d e Q A: rate = 0.005, p-inv = 0.2 B: rate = 0.02, p-inv = 0.5 P(t = e-qt David Posada University of Vigo, Spain, 2006 5 Rate variation among sites David Posada University of Vigo, Spain, 2006 6 Models of evolution A discrete gamma distribution with four categories is commonly used David Posada University of Vigo, Spain, 2006 7 David Posada University of Vigo, Spain, 2006 8

Models are abstractions Models are important Models have a clear influence in Models asume independence of sites, stationarity, equilibrium frecuencies... All models are wrong, but indeed,...some are useful. - Box, 1976 Parameter estimation (ti/tv,!,p i,.. Topology estimation Phylogenetic confidence Hypothesis testing David Posada University of Vigo, Spain, 2006 9 David Posada University of Vigo, Spain, 2006 10 Effect of model in topology estimation: HIV pol Statistical tests of models Model adequacy is an absolute measure of the fit of a model to a particular data set. Model choice is a relative procedure to select the models that results in the best fit to a particular data set. We measure fit using the likelihood. A NJ K80 B NJ - GTR +! There are differences in the position of the subtype A. David Posada University of Vigo, Spain, 2006 11 David Posada University of Vigo, Spain, 2006 12

Likelihood In phylogenetics see 18,19 we define the likelihood (L as (proportional to the probability of the data (D given a model of evolution (M, a vector of K model parameters! = (! 1,! 2,...,! K, a tree topology (", and a vector of S branch lengths,! = (! 1,! 2,...,! S : Statistical methods for model selection Hierarchical LRTs BIC Bayes factors AIC L = P(D M,!,",# David Posada University of Vigo, Spain, 2006 13 David Posada University of Vigo, Spain, 2006 14 Parsimony principle Likelihood ratio tests (LRT All model selection methods are somehow based on the principle of parsimony. LRT = 2 (! 1!! 0! 1 is the log of the maximized likelihood under the alternative hypothesis (complex model! 0 is the log of the maximized likelihood under the null hypothesis (simple model If models are nested, the LRT is distributed as a " 2 (or a mixed " 2, with q degrees of freedom, where q is the difference in number of free parameters between the two models 22 David Posada University of Vigo, Spain, 2006 15 David Posada University of Vigo, Spain, 2006 16

Hierarchical LRTs David Posada University of Vigo, Spain, 2006 17

Problems with hlrts Bayes factors 1 Maybe there is not an optimal model 2 The optimal model depends on the significance level 3 Local optima Bayes factors 23 are the Bayesian analogue of the LRT 24. B ij = P(D M i P(D M j Evidence for M 2 is considered very strong if B ij >150, strong if 12<B ij <150, positive if 3<B ij <12, barely worth mentioning if 1<B ij <3, and negative (supports M j if B ij <1 25. Used to infer the occurrence of recombination 26, to compare different phylogenetic hypothesis 27-29 and for model selection 30 David Posada University of Vigo, Spain, 2006 18 David Posada University of Vigo, Spain, 2006 19 Posterior probabilities For multiple models P(M i D =! P(D M i P(M i R r =1 P(D M r P(M r Bayesian Information Criterion (BIC Simple approximation to the log marginal likelihood of a model BIC =!2! + K logn We generally use MCMC to approximate model likelihoods: harmonic mean of the likelihood. Given equal priors for all competing models, choosing the model with the smallest BIC is equivalent to selecting the model with the maximum posterior probability David Posada University of Vigo, Spain, 2006 20 David Posada University of Vigo, Spain, 2006 21

Model selection uncertainty Model averaging We can establish a "95% credible set of models by summing the posterior probabilities from largest to smallest until the sum is just # 0.95 (Occam s window. We can make inferences based on the entire set of candidate models. the overall posterior mean of the shape of the gamma distribution (! would be: G E(! D = "! ˆ P(M D i i, i =1 David Posada University of Vigo, Spain, 2006 22 David Posada University of Vigo, Spain, 2006 23 Parameter importance The relative importance for the shape of the gamma distribution across all candidate models is: Kullback-Leibler distance ( = f (x log I f, g " f (x % ( $ ' dx # g(x! & R w + (! = " P(D M i P(M i I! (M i, i =1 where " I! (M i = 1 if! is in model M i # $ 0 otherwise I(f,g is the information lost when using g to approximate f David Posada University of Vigo, Spain, 2006 24 David Posada University of Vigo, Spain, 2006 25

Akaike s information criterion (AIC The AIC estimates the expected K-L information. The AIC penalized the likelihood by the number of parameters (K AIC =!2! + 2K We prefer the model with smallest AIC. For small samples ($ n/k < 40: AIC c = AIC + 2K(K +1 n! K!1 Akaike differences AIC (% AIC difference (% i are rescaled AICs reescalated with the minimum AIC (% i =0 As a rule of thumb 31 :! i = AIC i " min AIC % i & 1-2 receive substantial support and are considered when making inferences 4 & % i & 7 have considerably less support % i > 10 receive no support David Posada University of Vigo, Spain, 2006 26 David Posada University of Vigo, Spain, 2006 27 Akaike weights (w i Akaike 32 suggested that the exp(!1/2" i approximate the relative likelihood of the models given the data L(M i D. Akaike weights (w i for R models: w i = # exp(!1/2" i R r =1 exp(!1/2" r Model selection uncertainty wity the AIC The Akaike weights are very useful for assessing model-selection uncertainty. We can establish a 95% confidence set of models for the best K-L model by summing the Akaike weights from largest to smallest until the sum is just # 0.95. We can interpret them as the probability of a model is the best approximation of the true model, given the data. David Posada University of Vigo, Spain, 2006 28 David Posada University of Vigo, Spain, 2006 29

Model averaging We can also make inferences based on the entire set of candidate models using the AIC Estimating phylogenies with 24 models! ˆ = " R i=1 w i I(M i! i w + (!, where R w + (! = " w i I! (M i, i=1 " I! (M i = 1 if! is in model M i # $ 0 otherwise. Also, w + (! estimates the relative importance of any parameter. David Posada University of Vigo, Spain, 2006 30 David Posada University of Vigo, Spain, 2006 31 Akaike weights Model AIC ' ( Cumulative ( GTR+ 8541.28 0.00 0.6310 0.6310 GTR+*+ 8542.46 1.18 0.3493 0.9803 SYM+ 8549.15 7.88 0.0123 0.9926 SYM+*+ 8550.45 9.17 0.0064 0.9991 HKY+ 8555.18 13.91 0.0006 0.9997 HKY+*+ 8556.40 15.13 0.0003 1 K80+ 8564.98 23.70 0 1 K80+*+ 8566.22 24.94 0 1 GTR+* 8579.28 38.00 0 1 SYM+* 8590.49 49.22 0 1 HKY+* 8592.53 51.25 0 1 F81+ 8593.77 52.49 0 1 F81+*+ 8595.00 53.72 0 1 K80+* 8603.22 61.94 0 1 JC+ 8605.24 63.97 0 1 JC+*+ 8606.51 65.23 0 1 F81+* 8629.59 88.31 0 1 JC+* 8642.08 100.81 0 1 JC 8891.18 349.91 0 1 F81 8878.68 337.41 0 1 K80 8854.93 313.66 0 1 HKY 8845.17 303.90 0 1 SYM 8843.45 302.17 0 1 GTR 8831.24 289.96 0 1 Consensus tree for 24 models using Akaike weights David Posada University of Vigo, Spain, 2006 32 David Posada University of Vigo, Spain, 2006 33

Parameter importance and model-averaged estimates Parameter Importance Model-averaged estimates fa 0.9787 0.2926 fc 0.9787 0.2283 fg 0.9787 0.2552 ft 0.9787 0.2238 TiTv 0.0003 0.9000 rac 0.5500 1.4872 rag 0.5509 2.4336 rat 0.5500 2.4315 rcg 0.5500 1.8640 rct 0.5509 3.1793 pinv(i 0.0000 0.4797 alpha(g 0.6453 0.4631 pinv(i+ig 0.3547 0.1649 alpha(g+ig 1.0000 0.5412 AIC and Bayes AIC approximates reality Bayesian approaches try to identify the most probable model If, as is usually the case, none of the models is correct, what sense can we make of prior or posterior odds? Bayesian priors What is exactly the sample size n? *for 56 models David Posada University of Vigo, Spain, 2006 34 David Posada University of Vigo, Spain, 2006 35 Model selection strategies Modeltest Good properties for model selection methods hlrt Bayesian AIC Applies easily to nonnested models No Yes Yes Allows for the simultaneous comparison of multiple models It does not depend on a subjective significance level No Yes Yes No Yes Yes Incorporates topology uncertainty No Yes* No Easy to compute Yes No* Yes Implements hlrt, AIC for 56 models. It will implement the BIC and tools for model selection uncertainty and model averaging. Google http://darwin.uvigo.es Any OS Assesses model selection uncertainty No Yes Yes So far, it needs PAUP* Allows model averaging No Yes Yes Provides the possibility of specifying prior information for models No Yes* Yes Web-server Provides the possibility of specifying prior information for model parameters Designed to approximate, rather than to identify, truth No Yes* No No No Yes *Not the BIC David Posada University of Vigo, Spain, 2006 36 David Posada University of Vigo, Spain, 2006 37

Take home 1. Models are abstractions to learn about processes. 2. We can statistically select the best-fit model for our data. I recommend AIC or Bayes. 3. It is easy to implement using PAUP* and Modeltest. Model selection is a useful tool for research, but it is not a substitute for careful thinking and common sense reasoning - Browne, 2000. David Posada University of Vigo, Spain, 2006 38 David Posada University of Vigo, Spain, 2006 39 References 1. Felsenstein, J. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology 27, 401-410 (1978. 2. Huelsenbeck, J.P. & Hillis, D.M. Success of phylogenetic methods in the four-taxon case. Systematic Biology 42, 247-264 (1993. 3. Penny, D., Lockhart, P.J., Steel, M.A. & Hendy, M.D. The role of models in reconstructing evolutionary trees. in Models in Phylogenetic Reconstruction, Vol. 52 (eds. Scotland, R.W., Siebert, D.J. & Williams, D.M. 211-230 (Clarendon Press, Oxford, 1994. 4. Bruno, W.J. & Halpern, A.L. Topological bias and inconsistency of maximum likelihood using wrong models. Molecular Biology and Evolution 16, 564-566 (1999. 5. Sullivan, J. & Swofford, D.L. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. Journal of Mammalian Evolution 4, 77-86 (1997. 6. Kelsey, C.R., Crandall, K.A. & Voevodin, A.F. Different models, different trees: The geographic origin of PTLV-I. Molecular Phylogenetics and Evolution 13, 336-347 (1999. 7. Buckley, T.R. & Cunningham, C.W. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Molecular Biology and Evolution 19, 394-405 (2002. 8. Buckley, T.R., Simon, C. & Chambers, G.K. Exploring among-site rate variation models in a maximum likelihood framework using empirical data: The effects of model assumptions on estimates of topology, edge lengths, and bootstrap support. Systematic Biology 50, 67-86 (2001. 9. Buckley, T.R. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Systematic Biology 51, 509-23 (2002. 10. Zhang, J. Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Molecular Biology and Evolution 16, 868-875 (1999. 11. Sullivan, J. & Swofford, D.L. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Systematic Biology 50, 723-9 (2001. 12. Pupko, T., Huchon, D., Cao, Y., Okada, N. & Hasegawa, M. Combining Multiple Data Sets in a Likelihood Analysis: Which Models are the Best? Molecular Biology and Evolution 19, 2294-2307 (2002. 13. Tamura, K. Model selection in the estimation of the number of nucleotide substitutions. Molecular Biology and Evolution 11, 154-157 (1994. 14. Yang, Z., Goldman, N. & Friday, A. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Systematic Biology 44, 384-399 (1995. 15. Yang, Z. How often do wrong models produce better phylogenies? Molecular Biology and Evolution 14, 105-108 (1997. 16. Xia, X. Phylogenetic relationships among horseshoe crab species: Effect of substitution models in phylogenetic analysis. Systematic Biology 49, 87-100 (2000. 17. Gaut, B.S. & Lewis, P.O. Success of maximum likelihood phylogeny inference in the four-taxon case. Molecular Biology and Evolution 12, 152-162 (1995. David Posada University of Vigo, Spain, 2006 40 18. Felsenstein, J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17, 368-376 (1981. 19. Goldman, N. Maximum likelihood inference of phylogenetic trees, with special reference to a Poisson process model of DNA substitution and to parsimony analyses. Systematic Zoology 39, 345-361 (1990. 20. Goldman, N. Simple diagnostic statistical test of models of DNA substitution. Journal of Molecular Evolution 37, 650-661 (1993. 21. Swofford, D.L., Olsen, G.J., Waddell, P.J. & Hillis, D.M. Phylogenetic Inference. in Molecular Systematics (eds. Hillis, D.M., Moritz, C. & Mable, B.K. 407-514 (Sinauer Associates, Sunderland, MA, 1996. 22. Kendall, M. & Stuart, A. The Advanced Theory of Statistics, 240-252 (Charles Griffin, London, 1979. 23. Kass, R.E. & Raftery, A.E. Bayes factors. Journal of the American Statistical Association 90, 377-395 (1995. 24. Suchard, M.A., Kitchen, C.M., Sinsheimer, J.S. & Weiss, R.E. Hierarchical phylogenetic models for analyzing multipartite sequence data. Systematic Biology 52, 649-64 (2003. 25. Raftery, A.E. Hypothesis testing and model selection. in Markov Chain Monte Carlo in practice (eds. Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. 163-187 (Chapman & Hall, London ; New York, 1996. 26. Suchard, M.A., Weiss, R.E., Dorman, K.S. & Sinsheimer, J.S. Oh brother, where art thou? A Bayes factor test for recombination with uncertain heritage. Systematic Biology 51, 715-28 (2002. 27. Huelsenbeck, J.P. & Imennov, N.S. Geographic origin of human mitochondrial DNA: accommodating phylogenetic uncertainty and model comparison. Systematic Biology 51, 155-65 (2002. 28. Huelsenbeck, J.P., Rannala, B. & Larget, B. A Bayesian framework for the analysis of cospeciation. Evolution Int J Org Evolution 54, 352-64 (2000. 29. Suchard, M.A., Weiss, R.E. & Sinsheimer, J.S. Testing a Molecular Clock without an Outgroup: Derivations of Induced Priors on Branch-Length Restrictions in a Bayesian Framework. Systematic Biology 52, 48-54 (2003. 30. Suchard, M.A., Weiss, R.E. & Sinsheimer, J.S. Bayesian selection of continuous-time Markov chain evolutionary models. Molecular Biology and Evolution 18, 1001-13 (2001. 31. Burnham, K.P. & Anderson, D.R. Model selection and multimodel inference: a practical information-theoretic approach, (Springer-Verlag, New York, NY, 2003. 32. Akaike, H. Information measures and model selection. International Statistical Institute 22, 277-291 (1983. David Posada University of Vigo, Spain, 2006 41