Phylogenetics in the Age of Genomics: Prospects and Challenges Antonis Rokas Department of Biological Sciences, Vanderbilt University http://as.vanderbilt.edu/rokaslab
http://pubmed2wordle.appspot.com/
What is a Phylogenetic Tree? A phylogenetic tree is the mathematical structure used to depict the evolutionary history of a group of organisms or genes Phylogenetic trees show historical relationships, not similarities
Why are Phylogenetic Trees Useful? Benefits to Science: The origin and history of life Evolution of molecules, phenotypes and developmental mechanisms Benefits to Society: Human health (identification of disease agents & reservoirs) Agriculture (crops wild relatives) Biodiversity (conservation strategies)
CSI: Phylogeny A gastroenterologist was accused of second-degree murder for injecting his former girlfriend with blood obtained from an HIV type 1 (HIV-1)-infected patient under his care Phylogenetic analyses of HIV-1 sequences were admitted and used as evidence in the court Patient and Victim HIV-1 sequences Sequences from local population sample of HIV-1 infected individuals Metzker et al. (2002) PNAS
Phylogenetic Concepts and Terminology tree / topology taxon - taxa branches, nodes & internodes polytomy & resolution
Rooted and Unrooted Trees
Phylogenetic Concepts and Terminology
The Meaning of Vertical and Horizontal Axes in Trees
Estimating Phylogenies 1. Select an optimality criterion 2. Select a search strategy 3. Use selected search strategy to generate a series of trees, and apply selected optimality criterion to each, always keeping track of the best tree(s) examined thus far
Optimality Criteria Criterion Minimum Evolution Least Squares Parsimony Maximum Likelihood Bayesian Inference Data pairwise distances pairwise distances discrete characters discrete characters discrete characters Distance methods first convert an alignment into a matrix of pairwise distances Discrete character methods use the alignment as is and consider directly each site/character
Search Strategies There is a need for a search strategy because the number of possible trees goes up very fast with increasing number of species T BT ( ) (2n 5) n 3 Search Strategy Exhaustive Heuristic
Exhaustive Search
Heuristic Search Heuristic searches don t guarantee discovery of optimal topology The lazy subtree rearrangement (LSR) tree search strategy used by the RAxML program
Assessing Robustness in Inference: Bootstrap Characters are sampled with replacement to create many bootstrap replicate data sets equal in size to the original one Each bootstrap replicate data set is analysed (e.g., with parsimony, likelihood) 1234567... seq1: AAAATCGTGGGG seq2: ATAATGGTGGCG seq3: AAAATGGTGGCG seq4: ATAATGGTGGCG Agreement among the resulting trees is summarized with a majority-rule consensus tree The frequency of occurrence of clades, also known as bootstrap values, is a measure of support for those groups
The Problem of Incongruence Gene X Gene Y Species tree? Incongruence is pervasive in the phylogenetics literature Rokas & Chatzimanolis (2008) in Phylogenomics (W. J. Murphy, Ed.)
A Systematic Evaluation of Single Gene Phylogenies S. cerevisiae S. paradoxus S. mikatae S. bayanus Dataset: 106 genes on all 16 chromosomes totaling 127kb corresponding roughly to 1% of the genomic sequence, 2% of genes Analyses: Maximum Likelihood (ML) & Maximum Parsimony (MP) on nt data sets and MP on amino acid data sets Kellis et al. (2003) Nature
Incongruence in Shallow Time ML / MP Rokas et al. (2003) Nature
Concatenation of 106 Genes Yields a Single Yeast Phylogeny ML / MP on nt MP on aa Rokas et al. (2003) Nature
Phylogenetic Accuracy is Positively Correlated with Gene Number Murphy et al. (2001) Science Zanis et al. (2002) PNAS Rokas & Carroll (2005) Mol. Biol. Evol.
Reconstructing the Tree of Life Ciccarelli et al. (2006) Science James et al. (2006) Nature
Methods of Assessing Robustness are Subject to Systematic Error Rokas & Carroll (2006) PLoS Biol.; Rokas et al. (2003) Nature
Gene Trees Can Differ from Species Trees Degnan & Rosenberg (2009) Trends Ecol. Evol.
Inferring the Species Tree from Individual Gene Histories Concordance Factor: The proportion of the genome for which a clade is true STEM: Maximum likelihood estimation of species trees Kubatko et al.(2009) Bioinformatics BEST: Bayesian estimation of species trees Liu (2008) Bioinformatics Ané et al. (2007) Mol. Biol. Evol.
Lack of Resolution Due to Lack of Data or to Radiation? Balavoine & Adoutte (1998) Science / Rokas et al. (2003) Evol. Dev.
Lack of Resolution Among Most Metazoan Phyla 50 genes, ML/MP Rokas et al. (2005) Science
Biological Explanations for the Lack of Resolution Hypothesis I: Mutational Saturation the phylogenetic signal originally contained in these protein sequences has been erased by multiple substitutions Hypothesis II: Evolutionary Radiation the close spacing of cladogenetic events early in the history of metazoans limits their resolution
Time of Origin of Metazoa & Fungi 1) Fossils Metazoa: Xiao et al. (1998) Nature Fungi: Yuan et al. (2005) Science 2) Relative molecular clocks Study Douzery et al. (2004) PNAS Hedges et al. (2004) BMC Evol. Biol. Heckman et al. (2001) Science Metazoa 849 Myr 976 Myr 1177 Myr Fungi 727 Myr 968 Myr 1208 Myr 3) The way this set of 50 genes has evolved across the two Kingdoms is remarkably similar
A Remarkable Contrast in Phylogenetic Resolution Given: lack of resolution in Metazoa (using lots of genes) contrast in resolution between Metazoa and Fungi (using identical genes) The early history of animals is best viewed as an evolutionary radiation 50 genes, ML/MP Rokas et al. (2005) Science
Rapid Tempo of Cladogenesis Early in Metazoan History Valentine (2002) Ann. Rev. Ecol. Syst.
Incongruence in Deep Time Rokas (2008) Ann. Rev. Genet.
What Makes Animals Hard to Resolve But Yeasts Easy? Internode length: influences amount of phylogenetic signal (I) Homoplasy: independent evolution of identical characters (*, ) Rokas & Carroll (2006) PLOS Biol.
the independent evolution of identical character states (e.g., amino acid residues) in different branches of a phylogenetic tree that are not directly inherited from a common ancestor Homoplasy
Quantifying Homoplasy Across the Tree of Life Rokas & Carroll (2008) Mol. Biol. Evol.
Measuring Homoplasy Across the Tree of Life Rokas & Carroll (2008) Mol. Biol. Evol.
Measuring Homoplasy Across the Tree of Life
An Overabundance of Homoplastic Substitutions Dataset Obs(H) Exp(H) Obs(H)/Exp(H) Saccharomyces yeasts 4.7% 2.0% 2.4 Aspergillus filamentous fungi 6.9% 2.5% 2.8 Fungal phyla 6.7% 3.5% 1.9 Eukaryote phyla 7.5% 3.5% 2.1 Plants 2.3% 0.9% 2.7 Drosophila fruit-flies 2.6% 1.3% 2.0 Cetartiodactyls mammals 12.3% 4.9% 2.5 Paenungulata mammals 10.4% 5.0% 2.1 Metazoan phyla 7.4% 3.3% 2.3 Vertebrates 10.2% 3.2% 3.2 Average: 7.1% 2.3% 2.4 Rokas & Carroll (2008) Mol. Biol. Evol.
Excess Homoplasy is Specific to Homoplastic Substitutions Parsimony-informative sites Clade Obs(H) Exp(H) Obs(H)/Exp(H) Saccharomyces yeasts 3.3% 3.3% 0.0 Aspergillus filamentous ascomycetes 10.3% 9.8% 1.1 Fungal phyla 2.5% 2.2% 1.1 Eukaryotic phyla 2.3% 2.1% 1.1 Land plants 31.6% 31.7% 1.0 Drosophila fruit-flies 10.2% 10.1% 1.0 Cetartiodactyl mammals 4.5% 3.6% 1.3 Metazoan phyla 3.1% 2.8% 1.1 *Paenungulata mammals 5.4% 2.4% 2.3 *Vertebrates 2.9% 2.4% 1.2 Average: 7.6% 7.0% 1.1 Rokas & Carroll (2008) Mol. Biol. Evol.
Homoplasy Stems From Frequently Exchanged Amino Acids 190 possible interchanges among 20 amino acids - 75 can be achieved via a single nucleotide substitution - The other 115 require two or three substitutions 65% of observed interchanges is between the top12 most frequently observed amino acid interchanges a single mutational step away Rokas & Carroll (2008) Mol. Biol. Evol.
Conservative AA Substitutions Are Very Common in Alignments GAC/T (D) Aspartic Acid GAA/G (E) Glutamic Acid >Cimi >Acla >Afis >Afla >Afum >Anid >Anig >Aory >Ater EGAAGPIRFNMRLPESRYHCAGGITTIVVY AATTSPEDVETRKDDERVEHGGGITTVLVY AATASAEEFELRKEDERVEHGDGITTILIY SPAAAADEFDLQRDEDQNDYADSVSSVFIF AATTSAEEFELRKEDERVEHGDGITTILIY APAATADVSDTRLKQEQAEQAGGITAILVY APAAAPDDLETRLEEEQADYAGSVSSILIY SPAAAADEFDLQRDEDQNDYADSVSSVFIF SPAAAPDDVDTQREEDQNDYAGSVSSIFVF
What Are the Limits to Resolution? For internodes in ~ 600 my animal phylogeny Time Amount of Data _ 40 my -> 700 variable sites 10 my -> 2,800 variable sites 1 my -> 28,000 variable sites 0.1 my ->?????? variable sites 1 y ->?????? variable sites Philippe et al. (1994) Development
What If Mammals Had Diversified in the Cambrian? Springer et al. (2003) PNAS
107 Myr tree 600 Myr tree
Phylogenetic Accuracy is Inversely Correlated with Elapsed Time
Failure to Resolve Internodes < 10 Million Years Rokas et al. (2005) Science
Why Are There So Few Bushes? Data in, fully-resolved phylogenetic tree out
The Human / Chimp / Gorilla Tree Inform. Sites 8561 / 11293 1302 / 11293 1430 / 11293 Carroll (2003) Nature Patterson et al. (2006) Nature
Bushes in the Mammalian Tree Inform. Sites 22 25 21 Nishihara et al. (2009) PNAS
The Tetrapod / Lungfish / Coelacanth Bush Inform. Sites The three major lineages first 96 / 294 appeared within 20 30 million years ago, approximately 390 92 / 294 million years ago 106 / 294 44 genes, ML/MP/NJ Takezaki et al. (2004) Mol. Biol. Evol.
Mind the Gap Between Real Data and Models One can use the most sophisticated audio equipment to listen, for an eternity, to a recording of white noise and still not glean a useful scrap of information Rodrigo et al. (1994) Chapter in: Sponge in Time and Space; Biology, Chemistry, Paleontology
Acknowledgements http://as.vanderbilt.edu/rokaslab