Phylogenetics: Building Phylogenetic Trees

Similar documents
Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Tree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

BINF6201/8201. Molecular phylogenetic methods

Dr. Amira A. AL-Hosary

EVOLUTIONARY DISTANCES

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Constructing Evolutionary/Phylogenetic Trees

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Week 5: Distance methods, DNA and protein models

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki


Phylogenetic inference

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Multiple Sequence Alignment. Sequences

Algorithms in Bioinformatics

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Evolutionary Tree Analysis. Overview

7. Tests for selection

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

C.DARWIN ( )

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Inferring Molecular Phylogeny

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Lab 9: Maximum Likelihood and Modeltest

Letter to the Editor. Department of Biology, Arizona State University

Phylogenetics. BIOL 7711 Computational Bioscience

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

林仲彥. Dec 4,

Molecular Evolution, course # Final Exam, May 3, 2006

Using models of nucleotide evolution to build phylogenetic trees

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic trees 07/10/13

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Consensus Methods. * You are only responsible for the first two

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Theory of Evolution Charles Darwin

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Phylogenetic inference: from sequences to trees

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Supplementary Information

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetic analyses. Kirsi Kostamo

Consistency Index (CI)

A (short) introduction to phylogenetics

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

How to read and make phylogenetic trees Zuzana Starostová

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Introduction to MEGA

Lecture 4. Models of DNA and protein change. Likelihood methods

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

The Phylogenetic Handbook

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

8/23/2014. Phylogeny and the Tree of Life

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Phylogeny: building the tree of life

Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogeny. November 7, 2017

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetic Assumptions

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Lecture Notes: Markov chains

Theory of Evolution. Charles Darwin

Phylogenetics: Parsimony

Evolutionary Theory and Principles of Phylogenetics. Lucy Skrabanek ICB, WMC March 19, 2008

Kei Takahashi and Masatoshi Nei

Lecture 6 Phylogenetic Inference

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

What is Phylogenetics

molecular evolution and phylogenetics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Understanding relationship between homologous sequences

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics: Likelihood

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

FUNDAMENTALS OF MOLECULAR EVOLUTION

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Agricultural University

Honor pledge: I have neither given nor received unauthorized aid on this test. Name :

Reconstruire le passé biologique modèles, méthodes, performances, limites

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Transcription:

1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should we use? Which test should we use to assess the robustness of the prediction of particular tree features? Desired Properties of the Data Used in a (Species) Phylogeny Reconstruction 3 As we discussed before, reconstructing a tree that reflects the evolutionary history of a set of species is a hard task, and great care must be taken in the choice of the data used An ideal choice is a genomic region that appears exactly once in every species and whose evolutionary history is identical to that of the species The region should have little, if any, traces of HGT The rate of change in the region should be fast enough to distinguish between closely related species, but not too fast that regions from very distantly related species cannot be reliably aligned

Small Ribosomal Subunit rrna 4 The DNA sequence specifying the small ribosomal subunit rrna (called 16S RNA in prokaryotes) has been found to be one of the best genomic segments for this type of analysis, despite occurring in several copies in some genomes We ll later illustrate the reconstruction of the evolutionary history of 38 bacterial species from the Proteobacteria phylum using the 16S RNA sequence 5 Choice of a Method Many computational methods exist for phylogeny reconstruction These can be divided into two categories: distance-based methods and sequence-based methods Distance-based methods first compute pairwise distances from the sequences, and then use these distances to obtain the tree Sequence-based methods use the sequence alignment directly, and usually search the tree space using an optimality criterion that is defined on the columns of the alignment Choice of a Method 6 Examples of distance-based methods include UPGMA (unweighted pairgroup method using arithmetic averages), NJ (neighbor joining), and Fitch-Margoliash Examples of sequence-based approaches include maximum parsimony (MP), maximum likelihood (ML), and Bayesian inference

Properties of Methods 7 Method distances? tree type single tree? tree score? tree test? UPGMA yes ultrametric yes no no NJ yes additive yes no no Fitch-Margoliash yes additive yes no no Minimum evolution yes additive no yes yes MP no additive no yes yes ML no additive no yes yes Bayesian no additive no yes yes 8 Choice of a Model of Evolution As we discussed before, the p-distance is usually an underestimate of the true evolutionary distance Therefore, a correction of the p-distance is necessary for phylogenetic methods Models of evolution can be used to derive such distance corrections (we ll see some of the formulas later) 9 Choice of a Model of Evolution Correlation between corrected and uncorrected distances

Choice of a Model of Evolution 10 Model Base composition R=1? Identical transition rates? Identical transversion rates? Reference JC 1:1:1:1 no yes yes Jukes and Cantor (1969) F81 variable no yes yes Felsenstein (1981) K2P 1:1:1:1 yes yes yes Kimura (1980) HKY85 variable yes no no Hasegawa et al. (1985) TN variable yes no yes Tamura and Nei (1993) K3P variable yes no yes Kimura (1981) SYM 1:1:1:1 yes no no Zharkikh (1994) GTR variable yes no no Rodriguez et al. (1990) Choice of a Model of Evolution 11 Choosing the evolutionary model is not a simple task One approach to choose a model is to try several of them and select the tree that best fits the data (doesn t work for methods such as NJ that don t provide a way to test goodness of fit) A way of comparing two evolutionary models has been proposed, called the hierarchical likelihood ratio test (hlrt), which can be used when one of the models is contained in the other In this method, both models must be assessed with the same tree topology Using the ML method for each evolutionary model in turn, the tree branch lengths are found that give the highest likelihood of the trees These two likelihood values can then be compared using hlrt, which is a chi-squared test based on the difference in likelihoods and the differences in numbers of parameters in the two models Choice of a Model of Evolution 12

Choice of a Model of Evolution 13 Two problems with the hlrt test are (1) the requirement that models are nested, and (2) the decision on the order of testing and significance levels to use when deciding among many models Two other tests that have been proposed are: The Akaike information criterion (AIC), which is defined as -2 ln Li + 2pi, where Li is the maximum likelihood value of the optimal tree obtained using model i in a calculation which has pi parameters (including the branch lengths as parameters) The Bayesian information criterion (BIC), which is defined as -2 ln Li + pi ln(n), where N is the size of the data set 14 Choice of a Test of Tree Features One of the most commonly used tests for the reliability of the tree topology is the bootstrap It has been shown, however, that bootstrap values may be conservative The bootstrap measures the degree of support within the data for the particular branch, given the evolutionary model and tree reconstruction method The bootstrap values give no indication of the robustness of these features to changing the model or method Two Examples 15 Phylogenetic analysis of a data set of 38 species of bacteria from the phylum Proteobacteria A gene tree for a superfamily of enzymes

Analysis of the Proteobacteria Data Set 16 Data: 16S RNA sequences from all 38 species Using the AIC method, as implemented in MrAIC, the GTR+Gamma was found to be the most appropriate model Using GTR+Gamma, an unrooted ML tree was generated using the tool PHYML Analysis of the Proteobacteria Data Set (A) the unrooted tree (B) the same tree rooted so that the Deltaproteobacteria and Epsilonproteobacteria diverge from the other classes first (C) the same tree rooted using the midpoint method 17 Analysis of the Proteobacteria Data Set 18 UPGMA trees reconstructed from the same 16S RNA data set with JC correction. (A) Alignment columns that have at least one gap are ignored (B) Alignment columns are only ignored when one of the two sequences being compared has a gap

Analysis of the Proteobacteria Data Set 19 The condensed tree obtained by a bootstrap analysis of the tree in (A) on the previous slide. Branches with bootstrap values below 75% have been contracted 20 An example of a phylogenetic analysis of an enzyme superfamily This analysis would address two important issues: Predicting the function of an enzyme that belongs to the superfamily, by knowing only its sequence Elucidating the evolution of enzyme function within the superfamily 21 The cofactor thiamine diphosphate (TDP) superfamily Some of the constituent families also use the cofactor flavin adenine dinucleotide (FAD) Unusually, in some cases, FAD does not participate in the reaction but it must still be bound for the enzyme to be active The selected data set contains enzymes from various families The evolution of the superfamily chosen in this analysis must have involved either the development or loss of an FAD-binding site (or both), possibly on more than one occasion, and a correct reconstruction of the evolutionary history would reveal this

22 The NJ tree for the TDP-dependent enzyme superfamily (A) JC correction and (B) K2P correction red squares: proteins whose experimental structure is known and does not contain a bound FAD cofactor green squares: proteins known to bind FAD and use it as an active cofactor orange squares: proteins known to bind FAD but do not use in catalysis 23 Analysis of the same data set using the GTR+Gamma+I model and ML 24 Rooting the ML tree along the branch that separates the FADbinding and non-fad-binding proteins

25 Questions?