Centrifuge: rapid and sensitive classification of metagenomic sequences

Similar documents
1. HyperLogLog algorithm

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

Comparing whole genomes

Comparative genomics: Overview & Tools + MUMmer algorithm

MiGA: The Microbial Genome Atlas

Taxonomical Classification using:

Synteny Portal Documentation

SUPPLEMENTARY INFORMATION

Robert Edgar. Independent scientist

Microbiome: 16S rrna Sequencing 3/30/2018

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation.

Open a Word document to record answers to any italicized questions. You will the final document to me at

SUPPLEMENTARY INFORMATION

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Prac%cal Bioinforma%cs for Life Scien%sts. Week 14, Lecture 28. István Albert Bioinforma%cs Consul%ng Center Penn State

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

This document describes the process by which operons are predicted for genes within the BioHealthBase database.

Supplementary Information. Title: Accurate binning of metagenomic contigs via automated clustering sequences

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

Bioinformatics Exercises

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

GenomeBlast: a Web Tool for Small Genome Comparison

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

SUPPLEMENTARY INFORMATION

BIOINFORMATICS LAB AP BIOLOGY

Using Bioinformatics to Study Evolutionary Relationships Instructions

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Heuristic Alignment and Searching

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Emily Blanton Phylogeny Lab Report May 2009

Biology 2.1 Taxonomy: Domain, Kingdom, Phylum. ICan2Ed.com

Reporting from DIMA Preparing for Data Analysis

Biology 160 Cell Lab. Name Lab Section: 1:00pm 3:00 pm. Student Learning Outcomes:

Compilation of GIS data for the Lower Brazos River basin

LEC1: Instance-based classifiers

Supplementary File 4: Methods Summary and Supplementary Methods

Tree Building Activity

GTRAC FAST R ETRIEVAL FROM C OMPRESSED C OLLECTIONS OF G ENOMIC VARIANTS. Kedar Tatwawadi Mikel Hernaez Idoia Ochoa Tsachy Weissman

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

OECD QSAR Toolbox v.3.4. Example for predicting Repeated dose toxicity of 2,3-dimethylaniline

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

The Evolution of DNA Uptake Sequences in Neisseria Genus from Chromobacteriumviolaceum. Cory Garnett. Introduction

/home/thierry/columbia/msongsdb/pyreport_tutorials/tutorial1/tutorial1.py January 23, 20111

Introduction to microbiota data analysis

Bioinformatics Chapter 1. Introduction

Please click the link below to view the YouTube video offering guidance to purchasers:

Depth distributions of alkaline phosphatase and phosphonate utilization genes in the North Pacific Subtropical Gyre

Microbial analysis with STAMP

Data Structures & Database Queries in GIS

A Repetitive Corpus Testbed

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

Introduction to Sequence Alignment. Manpreet S. Katari

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Draft document version 0.6; ClustalX version 2.1(PC), (Mac); NJplot version 2.3; 3/26/2012

PHYLOGENY AND SYSTEMATICS

Lecture 2. The Blast2GO annotation framework

Diversity, Productivity and Stability of an Industrial Microbial Ecosystem

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Life in an unusual intracellular niche a bacterial symbiont infecting the nucleus of amoebae

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

Manual Railway Industry Substance List. Version: March 2011

Taxonomic Classification Note Taking Guide Keys Answers READ ONLINE

Bioinformatics I, WS 16/17, D. Huson, January 30,

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Instructions for using N-CAST

Chapter 6 Meiosis And Mendel Painfreelutions

Bioinformatics for Biologists

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

New SPOT Program. Customer Tutorial. Tim Barry Fire Weather Program Leader National Weather Service Tallahassee

Supplemental Materials

User Guide for Hermir version 0.9: Toolbox for Synthesis of Multivariate Stationary Gaussian and non-gaussian Series

Hot Spot / Kernel Density Analysis: Calculating the Change in Uganda Conflict Zones

Introduction to the SNP/ND concept - Phylogeny on WGS data

Hot Spot / Point Density Analysis: Kernel Smoothing

You w i ll f ol l ow these st eps : Before opening files, the S c e n e panel is active.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

OECD QSAR Toolbox v.3.4. Step-by-step example of how to build and evaluate a category based on mechanism of action with protein and DNA binding

Supplementary Information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

SPECIES OF ARCHAEA ARE MORE CLOSELY RELATED TO EUKARYOTES THAN ARE SPECIES OF PROKARYOTES.

Fundamentals of Computational Science

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Supplementary Materials for R3P-Loc Web-server

Transcription:

Centrifuge: rapid and sensitive classification of metagenomic sequences Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg Supplementary Material Supplementary Table 1 Supplementary Note References Centrifuge alternative setting comparisons

Supplementary Table 1. Centrifuge alternative setting comparisons Settings Genus sensitivity Genus precision Speed (reads/min) Memory usage (GB of RAM) Default 93.1 99.5 563,380 4.2 Index based on uncompressed sequences 93.7 99.5 369,231 6.9 FM index (with higher memory requirements) 93.1 99.5 943,396 9.6

Supplementary Note Optimization of FM-index to address the classification problem In contrast to alignment where mapped locations are reported (e.g. the 26,987,299 th base on chromosome 14), classifiers need only to label a read with a name at the genome level, species level, or higher taxonomic rank. Instead of using 4 or 8 bytes to store an exact location, we just use two bytes to store the genome or species ID. This decreases the index size by >20%. We also developed a parallelized version of FM index building, which allowed us to build our Centrifuge index for the NCBI nt database in only two days using 12 CPU cores, a process that previously required over two weeks. Evaluation of Centrifuge using two alternative settings Based on the same Mason simulated reads used in the main text, we also ran Centrifuge to perform additional evaluations by changing the settings, switching from compressed to uncompressed bacterial sequences, and also changing the FM index parameters to use a larger internal table as well as storing one genome ID per four bases (instead of per 16 bases), which together enables fast access in the FM index and reduces the number of operations in the FM index. In the first case, sensitivity and precision using the uncompressed sequences remained nearly the same compared to compressed sequences; i.e., compression did not affect sensitivity. In the second case, with an FM index 2.3 times larger than the default mode, processing speed increased by 67%, which is nearly as fast as Kraken. Building the Centrifuge index As part of the Centrifuge package available at the Centrifuge website (http://ccb.jhu.edu/software/centrifuge), we provide scripts for downloading the microbial reference genomes, contaminants and the human genome, and for building the database. The microbial genome sequences are post-processed with NCBI s dustmasker to mask low-complexity regions [1]. Low-complexity regions otherwise can lead to spurious matches. Furthermore the database includes the latest release of the UniVec database (ftp://ftp.ncbi.nlm.nih.gov/pub/univec), the EmVec database (ftp://ftp.ebi.ac.uk/pub/databases/emvec), and other artificial sequences. Simulating reads using Mason (version: 0.1) The 4,078 bacterial, 200 archaeal, human, and viral genomes are available for download at http://ccb.jhu.edu/software/centrifuge/downloads/centrifuge-suppl/data.tar.gz The compressed file, data.tar.gz, includes the following files: seqid+gid_to_taxid.map: the map of the sequence id and gid to taxonomy id and the following folders: taxonomy: taxonomy information such as taxonomy tree and genome names bacteria+archaea: the sequence we used to build the index for Centrifuge and other tools bacteria_sim10k: the simulated data set and its truth assignment. Though the name is simplified for brevity, the bacteria_sim10k.fa also contains reads from both bacteria and archaea.

1. Build the reference genome sim_ref.fa from which we simulate reads by concatenating the *.fna files in the folder bacteria+archaea/bacteria and bacteria+archaea/archaea. 2. Run Mason to simulate 20 million 100-bp reads with a per-base error rate of 3%: /path/to/mason illumina -i -n 100 -N 20000000 -pi 0.005 -pd 0.005 -pmm 0.02 -pmmb 0.01 -pmme 0.06 -o bacteria_sim20m.fa sim_ref.fa 3. Remove the reads from the genomes whose taxonomic id has no corresponding species or genus unit, which results in 19,851,369 reads left. 4. Randomly sample 10 million reads without replacement from the reads retained in the above step. This resulting file, bacteria_sim10m.fa, is used for the speed tests of Centrifuge and Kraken. 5. Randomly sample 10 thousand reads without replacement from bacteria_sim10m.fa to bacteria_sim10k.fa. This file is used for the accuracy tests of Centrifuge, Kraken, and Megablast. Running Centrifuge, Kraken, and Megablast Centrifuge (version: 1.0.1-beta): perl /path/to/centrifuge-compress.pl B_only taxonomy/ -map seqid_to_taxid.map o centrifuge_db /path/to/centrifuge-build p 4 conversion-table centrifuge_db.map taxonomy-tree taxonomy/nodes.dmp name-table taxonomy/names.dmp centrifuge_db.fa centrifuge_db /path/to/centrifuge --no-abundance -f -x centrifuge_db U bacteria_sim10k.fa Kraken (version: 0.10.5-beta): /path/to/kraken-build --build --db=kraken_db /path/to/kraken --db kraken_db/ --fasta-input bacteria_10k.fa Megablast (version: 2.2.30+): We concatenate all the *.fa and *.fna files in the bacteria+archaea folder and store the sequences to bacteria+archaea_dustmask.fa after running dustmask. /path/to/makeblastdb in bacteria+archaea_dustmask.fa dbtype nucl out megablast.db /path/to/makembindex input bacteria+archaea_dustmask.fa output megablast.db /path/to/blastn -query bacteria_sim10k.fa db megblast.db -index_name megablast.db - out megablast.out Downloading whole-genome sequencing data sets of 532 isolated bacterial genomes in the Short Read Archive On the NCBI website at http://www.ncbi.nlm.nih.gov, We used the following query (shown in blue) in the search field, and clicked Search button: ("2015/1/1"[PDAT] : "2015/9/23"[PDAT]) AND ("Bacteria"[Organism] OR "Bacteria"[Organism] OR "bacteria"[all Fields]) AND ("biomol dna"[properties] AND "strategy wgs"[properties]) On the result webpage, we followed the SRA link under Genomes. In the list of search results, we clicked Send to:, chose File and format Runinfo to create and download a file providing a summary of the experiments. Then we selected the experiments using HiSeq and MiSeq technology with spots (reads) between 200,000 and 2,000,000.

References 1. Morgulis A, Gertz EM, Schaffer AA, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol 2006, 13:1028-1040.