A phylogenomic toolbox for assembling the tree of life

Similar documents
Parsimony via Consensus

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Points of View Matrix Representation with Parsimony, Taxonomic Congruence, and Total Evidence

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Homology Modeling. Roberto Lins EPFL - summer semester 2005

SDM: A Fast Distance-Based Approach for (Super)Tree Building in Phylogenomics

A (short) introduction to phylogenetics

New Tools for Visualizing Genome Evolution

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Anatomy of a species tree

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

BIOINFORMATICS: An Introduction

Dr. Amira A. AL-Hosary

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Introduction to Bioinformatics

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

Evolution, University of Auckland, New Zealand PLEASE SCROLL DOWN FOR ARTICLE

Axiomatic Opportunities and Obstacles for Inferring a Species Tree from Gene Trees

Algorithms in Bioinformatics

Algorithms for efficient phylogenetic tree construction

arxiv: v1 [q-bio.pe] 16 Aug 2007

SUPPLEMENTARY INFORMATION

Increasing Data Transparency and Estimating Phylogenetic Uncertainty in Supertrees: Approaches Using Nonparametric Bootstrapping

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Intraspecific gene genealogies: trees grafting into networks

Chapter 26 Phylogeny and the Tree of Life

Phylogenetic Tree Reconstruction

MiGA: The Microbial Genome Atlas

The Information Content of Trees and Their Matrix Representations

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

A Phylogenetic Network Construction due to Constrained Recombination

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Phylogenetic inference

Supplementary Materials for

Computational methods for predicting protein-protein interactions

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Genomic insights into the taxonomic status of the Bacillus cereus group. Laboratory of Marine Genetic Resources, Third Institute of Oceanography,

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

TheDisk-Covering MethodforTree Reconstruction

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Consensus methods. Strict consensus methods

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

C3020 Molecular Evolution. Exercises #3: Phylogenetics

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Computational Structural Bioinformatics

Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Using Bioinformatics to Study Evolutionary Relationships Instructions

Introduction to Evolutionary Concepts

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Constructing Evolutionary/Phylogenetic Trees


Preface. Contributors

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

9/19/2012. Chapter 17 Organizing Life s Diversity. Early Systems of Classification

An Adaptive Association Test for Microbiome Data

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Chapter 9 BAYESIAN SUPERTREES. Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton

Bioinformatics and BLAST

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Species Tree Inference using SVDquartets

Phylogenetic analyses. Kirsi Kostamo

Jed Chou. April 13, 2015

Predicting Protein Functions and Domain Interactions from Protein Interactions

Comparative Performance of Supertree Algorithms in Large Data Sets Using the Soapberry Family (Sapindaceae) as a Case Study

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Weighted Quartets Phylogenetics

Charles Semple, Philip Daniel, Wim Hordijk, Roderic D M Page, and Mike Steel

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Distances that Perfectly Mislead

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

Finding a gene tree in a phylogenetic network Philippe Gambette

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

Consensus properties for the deep coalescence problem and their application for scalable tree search

Phylogenetic Networks, Trees, and Clusters

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Effects of Gap Open and Gap Extension Penalties

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Phylogenetic diversity and conservation

Constructing Evolutionary/Phylogenetic Trees

The practice of naming and classifying organisms is called taxonomy.

Molecular Evolution & Phylogenetics

Unsupervised Learning in Spectral Genome Analysis

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Trees from Trees: Construction of Phylogenetic Supertrees using Clann.

OMICS Journals are welcoming Submissions

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Transcription:

A phylogenomic toolbox for assembling the tree of life

or, The Phylota Project (http://www.phylota.org) UC Davis Mike Sanderson Amy Driskell U Pennsylvania Junhyong Kim Iowa State Oliver Eulenstein David Fernández-Baca NSF AToL grant DEB-0334963

Overall goal: Developing algorithms and software tools for exploiting the phylogenetic information in large sequence databases

Proteins Taxa QuickTimeª and a TIFF (LZW) decompressor are needed to see this picture. Increasing density Sparse data availability matrix for GenBank (rel. 137) green plant proteins (single copy potentially informative) Density = 0.002

Specific goals Characterizing the phylogenetic information content of sequence databases Extracting optimal collections of data sets from databases to build species trees Designing inputs for supermatrix and supertree construction Algorithms for targeted sequence efforts Further work on novel supertree methods (Flipdistance; additive distance methods)

Project status Year 1 completed Benchmark data sets of ~300,000 proteins constructed and posted on web site Progress on formal problem definitions for groves, quasi-bicliques, targeted sequencing Early prototypes of software tools posted on web site (bicliques, groves, clustering, sample visualization) Preliminary experiments with large sparse supermatrices and their associated supertrees

Phylogenetic information content of two sets of protein sequences from large databases (Driskell et al. 2004) Swiss-Prot rele ase 40.29 GenBank rele ase 137 (green plant) N umber of sequences in release 121,218 185,418 Total number of sequences clustered 121,218 185,089 Total number of taxa 7449 16,348 Total number of clusters 64,712 59,144 MINI MA L PHY LO GENET IC CLUSTERS Number of clusters 4214 (6.5%) 1365 (2.3%) Sequence coverage 41,812 (34%) 65,113 (35%) Taxon coverage 5538 (74%) 15,599 (95%) SINGLE -CO PY CLUSTE RS Number of clusters 3592 (5.6%) 853 (1.4%) Sequence coverage 28,742 (24%) 39,443 (21%) Taxon coverage 4404 (59%) 14,502 (89%) Density 0.0018 0.0021 GRO VES Minimum n umber of groves 123 15 Number of orphan clusters 67 7 Minimum n umber of clusters in largest grove 3183 814 Minimum n umber of sequences in largest grove 25,272 (21%) 38,700 (49%) Minimum n umber of taxa in largest grove 2695 (36%) 14,169 (87%) BICLIQ UES Number of nontrivial maxim al bicliques 43,576 5587 Sequence coverage 23,855 (20%) 15,092 (8.2%) Taxon coverage 1449 (19%) 4230 (26%) Number of clusters in biclique set 3187 (4.9%) 645 (1.1%) La rgest biclique (in terms of taxa ) 2 76 2 975 Largest biclique (in term sof clusters) 352 4 70 4

Proteins Finding complete supermatrices in a sparse database A complete matrix is one with no missing proteins Taxa

Sizes of the 8300 maximal biclique data sets 250 1 6492; 2 1950 200 Number of taxa 150 100 50 503 1 0 0 10 20 30 40 50 60 70 80 Number of proteins

Proteins Constructing incomplete supermatrices in a sparse database. What are efficient methods for assembling these matrices and for identifying optimal Taxa

Green plant supermatrix: 69 taxa 254 proteins QuickTimeª and a TIFF (LZW) decompressor are needed to see this picture. Density = 16% (13% of characters) 96,584 amino acids 2777 sequences

Groves: collections of trees that can be useful inputs to supertree construction (based on taxonomic overlap

Constructing Super-trees and Consensus trees from distance embedding Goal: Develop new algorithmic approaches to super-tree and consensus tree construction by characterizing trees as Additive Distance (AD) matrices. Rationale: For every positive edge-weighted tree, there is an invertible correspondence to an AD matrix. Therefore, we can view construction of super-trees in terms of space of additive distance matrices. Each tree or a gene cluster yields a projection from the hypothetical full distance matrix (corresponding to the full supertree). The problem is to extrapolate to the full matrix (Fig X). Advantages: For n total input taxa, the space of additive distances has O(n 2 ) embedding dimensions as opposed to O(2 n ) embedding dimensions in usual combinatorial algorithms such as Matrix Representation Parsimony (MRP). The histogram on the right shows a small simulation test where a 16-taxon tree was estimated from a random collection of k quartets. The results suggest that the AD embedding method more efficiently reconstructs the supertree compared to MRP. Furthermore, since the embedding is in metric space, it allows us to use standard methods to incorporate different confidence in branches as well as different confidence in data sets. 100 80 60 40 20 0 k=16 n k=64 nlogn Percent Correct Splits k=256 n 2 K -set k=4096 n 3 all taxa Ragan-Baum k = 256

The figure on the right shows one result from AD embedding method where we assembled 476 gene clusters of varying sizes and constructed a single supertree (numbers indicate bootstrap values). Next Steps: The method has been developed for generally dense coverage of submatrices as in the tree shown here. However, for largescale datasets as found in the Genbank, the submatrices very sparsely cover the relationship of the taxa. Thus many of the missing relationships need to be extrapolated. We are investigating the usage of AD matrix extrapolation method proposed by Makarenkov and Lapointe (Bioinformatics 20:2113). The tree is for 40 completely sequence microbial genomes with Yeast (Sce). The three letter taxon code can be found at http://ncbi.nlm.nih.gov

Faster Flipping for Supertree Construction Improvement on previous flipping program (Chen et al., Systematic Biology, 53(2):299 308, 2004) New method achieves speedup of at least n (= number of taxa) for NNI, SPR, and TBR by avoiding re-computation of flip scores for subtrees

Targeted sequencing algorithms: collaborations with other AToLs NemAToL Early Bird The green tree of life Eu-Tree