Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Similar documents
Microbiome: 16S rrna Sequencing 3/30/2018

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

Chad Burrus April 6, 2010

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

FIG S1: Rarefaction analysis of observed richness within Drosophila. All calculations were

SUPPLEMENTARY INFORMATION

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

MiGA: The Microbial Genome Atlas

SUPPLEMENTARY INFORMATION

Taxonomical Classification using:

Using Topological Data Analysis to find discrimination between microbial states in human microbiome data

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Handling Fungal data in MoBeDAC

Exploring Microbes in the Sea. Alma Parada Postdoctoral Scholar Stanford University

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Impact of training sets on classification of high-throughput bacterial 16s rrna gene surveys

Introduction to microbiota data analysis

Microbial Taxonomy and the Evolution of Diversity

Probing diversity in a hidden world: applications of NGS in microbial ecology

SPECIES OF ARCHAEA ARE MORE CLOSELY RELATED TO EUKARYOTES THAN ARE SPECIES OF PROKARYOTES.

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

PHYLOGENY AND SYSTEMATICS

Supplemental Online Results:

Supplementary material to Whitney, K. D., B. Boussau, E. J. Baack, and T. Garland Jr. in press. Drift and genome complexity revisited. PLoS Genetics.

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Introduction to Bioinformatics Online Course: IBT

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Systems biology. Abstract

Phylogenetic diversity and conservation

Microbial analysis with STAMP

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

BINF6201/8201. Molecular phylogenetic methods

Introduction to Evolutionary Concepts

Microbial Diversity. Yuzhen Ye I609 Bioinformatics Seminar I (Spring 2010) School of Informatics and Computing Indiana University

Algorithms in Bioinformatics

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

An Adaptive Association Test for Microbiome Data

a-fB. Code assigned:

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Chapter 19. Microbial Taxonomy

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Stepping stones towards a new electronic prokaryotic taxonomy. The ultimate goal in taxonomy. Pragmatic towards diagnostics

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rrna Gene Based Diversity Studies

A meta-analysis of changes in bacterial and archaeal communities with time

Introduction to polyphasic taxonomy

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Sec$on 9. Evolu$onary Rela$onships

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

a-dB. Code assigned:

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Effects of Gap Open and Gap Extension Penalties

This is a repository copy of Microbiology: Mind the gaps in cellular evolution.

arxiv: v1 [q-bio.pe] 7 Jul 2014

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Pipelining RDP Data to the Taxomatic Background Accomplishments vs objectives

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Computational methods for predicting protein-protein interactions

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Supplemental Material

Unsupervised Learning in Spectral Genome Analysis

Bergey s Manual Classification Scheme. Vertical inheritance and evolutionary mechanisms

Bacterial Community Variation in Human Body Habitats Across Space and Time

Genomic insights into the taxonomic status of the Bacillus cereus group. Laboratory of Marine Genetic Resources, Third Institute of Oceanography,

Phylogenetics and the Human Microbiome

Supplementary Information

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

New Tools for Visualizing Genome Evolution

Anatomy of a species tree

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

SUPPLEMENTARY INFORMATION

Dr. Amira A. AL-Hosary

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

A phylogenomic toolbox for assembling the tree of life

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Diversity, Productivity and Stability of an Industrial Microbial Ecosystem

Database on the structure of large ribosomal subunit RNA

IGENETICS A MOLECULAR APPROACH PDF

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

RGP finder: prediction of Genomic Islands

OMICS Journals are welcoming Submissions

Hiromi Nishida. 1. Introduction. 2. Materials and Methods

Ledyard Public Schools Science Curriculum. Biology. Level-2. Instructional Council Approval June 1, 2005

Computational approaches for functional genomics

doi: / _25

The Evolution of DNA Uptake Sequences in Neisseria Genus from Chromobacteriumviolaceum. Cory Garnett. Introduction

Exploring environmental genetic diversity with similarity networks

Transcription:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses Authors Jennifer Fouquier 1, Jai Ram Rideout 2, Evan Bolyen 2, John Chase 2, Arron Shiffer 2,3, Daniel McDonald 4, Rob Knight 5, J Gregory Caporaso 2,3 and Scott T. Kelley 1,4* 1 Graduate Program in Bioinformatics and Medical Informatics, San Diego State University, San Diego, CA, USA. 2 Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, USA. 3 Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, USA. 4 Department of Computer Science and BioFrontiers Institute, University of Colorado at Boulder, CO, USA. 5 Department of Pediatrics, and Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA. 6 Department of Biology, San Diego State University, San Diego, CA, USA. Corresponding Author Scott T. Kelley Department of Biology 5500 Campanile Drive San Diego State University San Diego, CA, USA, 92182-4614 Phone: +1 619 206 8014 Email: skelley@mail.sdsu.edu Abstract ghost-tree is a bioinformatics tool that integrates sequence data from two genetic markers into a single phylogenetic tree that can be used for diversity analyses. Our approach uses one genetic marker whose sequences can be aligned across organisms spanning divergent taxonomic groups (e.g., fungal families) as a foundation phylogeny. A second, more rapidly evolving genetic marker is then used to build extension phylogenies for more closely related organisms (e.g., fungal species or strains) that are then grafted on to the foundation tree by mapping taxonomic names. We apply ghost-tree to graft fungal extension phylogenies derived from ITS sequences onto a foundation phylogeny derived from fungal 18S sequences. The result is a phylogenetic tree, compatible with the commonly used UNITE fungal database, that supports phylogenetic diversity analysis (e.g., UniFrac) of fungal communities profiled using ITS markers. Availability: ghost-tree is pip-installable. All source code, documentation, and test code are available under the BSD license at https://github.com/jtfouquier/ghost-tree.

45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 Introduction In recent years, next-generation sequencing and bioinformatics methods have rapidly expanded our knowledge of the microbial world by enabling marker gene surveys of bacterial, archaeal and eukaryotic microbial communities. Phylogenetic diversity metrics such as Phylogenetic Diversity (PD) (Faith, 1992) and UniFrac (Lozupone and Knight, 2005) have improved resolution of community differences relative to their nonphylogenetic analogs that were mostly developed for studying communities of macroorganisms (e.g., Chao1 and Bray-Curtis dissimilarity). An ideal marker gene has highly conserved regions that facilitate development of PCR primers and multiple sequence alignment, and highly variable regions that support detailed taxonomic resolution. For some taxonomic groups, such as the fungi, an ideal marker gene does not exist. The small subunit ribosomal RNA (SSU) is highly conserved in the fungi, and therefore sequences from different species are too similar for detailed taxonomic assignment. The Internal Transcribed Spacer (ITS), a non-coding region found between the ribosomal genes, has thus become popular for achieving high-resolution taxonomic profiles of fungal communities. This marker gene is so non-conserved, however, that no meaningful alignment is possible across distant fungal lineages, thus making phylogenetic diversity analyses impossible. Here we present ghost-tree, an open-source bioinformatics software tool for creating phylogenetic trees using multiple genetic loci. Sequences from a more conserved locus are aligned across distant taxonomic lineages to build a foundation tree, and sequences from a less conserved locus are aligned for many groups of closely related taxa to create extension trees that are then grafted onto the foundation tree. We apply ghost-tree to build foundation trees from the fungal 18S rrna and extension trees from fungal ITS, to create a single ghost-tree that can be used in phylogenetic diversity analyses of fungal communities. There are thus two outcomes of this project: a tree for use with popular community analysis tools such as QIIME, and a software package for developing phylogenetic trees for other sets of marker genes. Methods ghost-tree takes as input (1) the Foundation Alignment (for example, the Silva 18S alignment) where sequences are annotated with taxonomy; (2) the Extension Sequence Collection (for example, unaligned ITS sequences from the UNITE database); and (3) a taxonomy map, which contains taxonomic annotations of the sequences in (2). The Foundation Alignment is filtered to remove highly gapped and high entropy positions, and FastTree is used to build a phylogenetic tree from the resulting filtered alignment. This is the Foundation Tree. In parallel, the Extension Sequence Collection is clustered with SUMACLUST http://metabarcoding.org/sumatra resulting in an operational taxonomic unit (OTU) map that groups sequences into OTUs by percent identity. For each Extension Sequence OTU a consensus taxonomy is determined and OTUs with the same consensus taxonomy are further grouped into a single OTU. The sequences in each OTU are then aligned, and FastTree is applied to build an Extension Tree. The OTU s consensus taxonomy is associated with the root of the Extension Tree. The taxa at the root of the extension trees are then used to graft the extension tree on to the tip in the foundation tree with the same taxonomy, resulting in the Ghost Tree. We applied ghosttree to build a phylogenetic tree from Silva (Ver. SSU 119.1) 18S sequences (the foundation) (Pruesse et al., 2007) and UNITE (Ver. 12_11_otus) ITS sequences (Kõljalg et al., 2013). This tree is available in the GitHub repository, and this ghost-tree workflow is illustrated in Supplementary Figure S1. ghost-tree is hosted on GitHub under the BSD open source software license. It is implemented in Python, using scikit-bio and Click, and adheres to the PEP8 Python style guide. ghost-tree is subject to continuous integration testing using Travis CI which, on

99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 each pull request, runs unit tests with nose, monitors code style using flake8, and monitors test coverage with Coveralls. Results To evaluate whether ghost-tree supports improved resolution in studies of fungal community analysis we compiled two real-world ITS sequence datasets: one collection of human saliva samples (Ghannoum et al., 2010) and one of surfaces in public restrooms (Fouquier et al., in review). These communities are expected to differ substantially from one another (large effect size). We next simulated 10 samples based on each of the samples from these studies using QIIME s simsam.py workflow. This generates sample replicates that are phylogenetically similar to each other. We therefore consider the grouping of each set of simulated samples to be a small effect size. We computed distances between all samples using three approaches: using binary Jaccard, a qualitative non-phylogenetic diversity metric where no tree is required; using unweighted UniFrac, a qualitative phylogenetic metric of beta diversity with a tree generated using MUSCLE and FastTree; and using unweighted UniFrac with a tree generated with ghost-tree. Figure 1 contains principal coordinates analysis of these data based on binary Jaccard distances (Fig. 1a), UniFrac/ITS distances (Fig. 1b) and UniFrac/ghost-tree distances (Fig. 1c). UniFrac/ghost-tree resolves the small and the large effect sizes (ANOSIM R=0.38 and R=0.48, respectively), while binary Jaccard (R=0.02 and R=0.04, respectively) and unweighted UniFrac/ITS (R=0.19 and R=0.08, respectively) do not. Summary Widely used bacterial sequence databases such as GreenGenes (DeSantis et al., 2006) annotate 16S rrna gene data, providing a reference tree and taxonomy for bacterial and archaeal community analysis. ghost-tree facilitates development of phylogentic trees that can be used in a similar way for marker genes that are less conducive to phylogenetic reconstruction at a global scale. We show that, as with bacterial community analysis, incorporating phylogeny in diversity metrics is useful for resolving differences in fungal communities. The Silva/UNITE-based ghost tree presented here integrate into a user s exiting fungal analysis pipelines and the ghost-tree software package can be used to develop phylogenetic trees for other marker gene sets that provide different taxonomic resolution, or for bridging genome trees with amplicon trees. Acknowledgements This work was supported in part by a grant from the Alfred P. Sloan Foundation.

141 142 143 144 145 146 147 148 149 150 151 Figure Legends Figure 1. Principal Coordinates comparing samples based on (a) Binary Jaccard distances, (b) UniFrac distances where trees are computed by aligning ITS sequences using MUSCLE, and (c) UniFrac distances where trees are computed using ghost-tree. Blue points are simulated and real human saliva samples, and red points are simulated and real restroom surface samples. Plots were made using EMPeror software (Vázquez- Baeza et al., 2013). Suppl. Figure S1. ghost-tree workflow diagram.

151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 References DeSantis,T.Z., et al. (2006) Greengenes, a chimera-checked 16S rrna gene database and workbench compatible with ARB. Appl. Environ. Microb., 72, 5069-5072. Faith,D.P. (1992) Conservation Evaluation and Phylogenetic Diversity. Biol. Conserv., 61, 1-10. Ghannoum,M.A., et al. (2010) Characterization of the oral fungal microbiome (mycobiome) in healthy individuals. PLoS Pathog., 6, e1000713. Kõljalg,U., et al. (2013) Towards a unified paradigm for sequence-based identification of Fungi. Mol. Ecol., 22, 5271-5277. Lozupone,C. and Knight,R. (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microb., 71, 8228-8235. Pruesse,E., et al. (2007) SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res., 35, 7188-7196. Quast,C., et al. (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res., 41, D590-596. Vázquez-Baeza,E., et al. (2013) EMPeror: a tool for visualizing high-throughput microbial community data. GigaScience., 2:16. doi:10.1186/2047-217x-2-16

171 172 Figure 1 PrePrints 173 174

174 175 176 Supplemental Figure S1 177 178 179 180 181