Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Similar documents
Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

Robert Edgar. Independent scientist

Microbiome: 16S rrna Sequencing 3/30/2018

Taxonomical Classification using:

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

rrdp: Interface to the RDP Classifier

MiGA: The Microbial Genome Atlas

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Accuracy of taxonomy prediction for 16S rrna and fungal ITS sequences

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Handling Fungal data in MoBeDAC

Impact of training sets on classification of high-throughput bacterial 16s rrna gene surveys

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Introduction to microbiota data analysis

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

Pipelining RDP Data to the Taxomatic Background Accomplishments vs objectives

Microbial analysis with STAMP

The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rrna Gene Based Diversity Studies

A Novel Ribosomal-based Method for Studying the Microbial Ecology of Environmental Engineering Systems

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Vocabulary: Fill in the definition for each word. Use your book and/or class notes. You can put the words in your own words. Animalia: Archaea:

Unit 5: Taxonomy. KEY CONCEPT Organisms can be classified based on physical similarities.

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

Plant Names and Classification

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Supplementary Figure 1. Chao 1 richness estimator of microbial OTUs (16S rrna

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

Outline. Classification of Living Things

Supplemental Online Results:

Bergey s Manual Classification Scheme. Vertical inheritance and evolutionary mechanisms

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Censusing the Sea in the 21 st Century

SUPPLEMENTARY INFORMATION

Mitochondrial Genome Annotation

Prac%cal Bioinforma%cs for Life Scien%sts. Week 14, Lecture 28. István Albert Bioinforma%cs Consul%ng Center Penn State

Microbial Taxonomy and the Evolution of Diversity

Automating the Quest for Novel Prokaryotic Diversity (Revisited)

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

Chapter 26 Phylogeny and the Tree of Life

Bacillus anthracis. Last Lecture: 1. Introduction 2. History 3. Koch s Postulates. 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

Interpreting the Molecular Tree of Life: What Happened in Early Evolution? Norm Pace MCD Biology University of Colorado-Boulder

1. HyperLogLog algorithm

Systems biology. Abstract

PHYLOGENY AND SYSTEMATICS

Unsupervised Learning in Spectral Genome Analysis

Centrifuge: rapid and sensitive classification of metagenomic sequences

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

PHYLOGENY & THE TREE OF LIFE

The Tree of Life. Chapter 17

CH. 18 Classification

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

8/23/2014. Phylogeny and the Tree of Life

Naïve Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy

Taxonomic identification from metagenomic and metabarcoding data using. any genetic marker

Biology 2.1 Taxonomy: Domain, Kingdom, Phylum. ICan2Ed.com

Ch 10. Classification of Microorganisms

Supervised Learning to Predict Geographic Origin of Human Metagenomic Samples

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

16S Metagenomics Report

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1

16S Metagenomics Report

The Classification of Plants and Other Organisms. Chapter 18

objective functions...

Background: Why Is Taxonomy Important?

Classification and Viruses Practice Test

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Chapter 17. Organizing Life's Diversity

Probing diversity in a hidden world: applications of NGS in microbial ecology

Macroevolution Part I: Phylogenies

Microbial Taxonomy and Phylogeny: Extending from rrnas to Genomes

Biology 211 (2) Week 1 KEY!

Name: Class: Date: ID: A

Structure, function and host control of rhizosphere microbiome

Fuyong Li 1,2, Thomas C. A. Hitch 3, Yanhong Chen 1, Christopher J. Creevey 3 and Le Luo Guan 1*

Test Bank for Microbiology A Systems Approach 3rd edition by Cowan

Microbiology Helmut Pospiech

Classification of Living Things Test Review

9.3 Classification. Lesson Objectives. Vocabulary. Introduction. Linnaean Classification

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

SUPPLEMENTARY INFORMATION

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

Taxonomy and Biodiversity

Introduction to Microbiology. CLS 212: Medical Microbiology Miss Zeina Alkudmani

Honor pledge: I have neither given nor received unauthorized aid on this test. Name :

Chapter 19. Microbial Taxonomy

Microbial Taxonomy. Classification of living organisms into groups. A group or level of classification

Unit 9: Taxonomy (Classification) Notes

Organizing Life on Earth

Istituto di Microbiologia. Università Cattolica del Sacro Cuore, Roma. Gut Microbiota assessment and the Meta-HIT program.

a,bD (modules 1 and 10 are required)

Bioinformatics Chapter 1. Introduction

BLAST. Varieties of BLAST

Transcription:

Assigning Taxonomy to Marker Genes Susan Huse Brown University August 7, 2014

In a nutshell Taxonomy is assigned by comparing your DNA sequences against a database of DNA sequences from known taxa

Marker Genes Bacteria 16S (SSU rrna) Archaea 16S (SSU rrna) Protist 18S (SSU rrna) Fungi ITS, LSU rrna

Taxonomic Consistency For meaningful community comparisons, taxonomic names must be consistent: Names - one organism should have only one name Levels for automated comparisons, need consistent levels: Kingdom;Phylum;Class;Order;Family;Genus;Species

Official Taxonomic Names Bergey s Taxonomic Outline manual of taxonomic names for bacteria List of Prokaryotic names with Standing in the Nomenclature (vetting process) NCBI similar taxonomy, but multiple subs (subclass, suborder, subfamily, tribe) Fungi UNITE Archaea also a work in progress

Primary Methods of Taxonomic Assignment Marker genes RDP GAST BLAST Shotgun Metagenomic MetaPhlAn AMPHORA

Ribosomal Database Project Uses Bayesian Kmer Matching Wang, et al (2007) Appl Environ Microbiol. 73(16):5261-7 http://rdp.cme.msu.edu

k-mer indexing TGCTAGCTAGTGACATCCACAGAACTTTCCA GAGATGGATTGGTGCCTTCGGGAACTGTGA! Ex. 4-mer indexing TGCT 1! AGCT 1! GCTA 2! 1! TAGT 1! CTAG 2! 1! TAGC 1! AGTG 1! GTGA 1!

RDP Web Results Reports: Taxonomy at each level Bootstrap value at each level Min 80% by default Domain Phylum Class Order Family Genus Root[100%] Bacteria[100%] Proteobacteria [100%] Alphaproteobacteria[100%] Rhodospirillales[100%] Rhodospirillaceae[100%] Dongia[99%]

Bootstrap Confidence Estimation the number of times a genus was selected out of 100 bootstrap trials was used as an estimate of confidence in the assignment to that genus.

Bootstrap Cutoff Values 0% 50% 80% V3 % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% V4 0% 50% 80% % classified to genus 100% 97% 87.9% % classified to correct genus 92.8% 94.5% 95.7% V6 0% 50% 80% % classified to genus 100% 73.5% 40.4% % classified to correct genus 79.0% 96.5% 98.7% Based on 7,028 human gut sequences, RDP-classified full-length then cut and reclassified

Incremental Accuracy V3 0% 50% 80% % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% Assume 1000 reads: At 80%: 823 are classified to genus 807 are identified correctly (823 X 0.981 = 807) 16 seq are identified incorrectly (823-807 = 16) At 50%: 924 are classified to genus (101 more) 878 are correct (924 X 0.95 = 878) (71 more) 46 are incorrect (30 more) 30% of the added reads are incorrect (=30/101).

GAST: Global Assignment of Sequence Taxonomy GAST uses direct comparison to a reference database to assign taxonomy. http://vamps.mbl.edu/resources/software.php Huse, et al (2008) PLoS Genetics 4: e1000255

Reference Matching Sequence matching to nearest sequence in a reference database. global alignment (USEARCH) rather than local alignment (BLAST) Distance is defined by the number of mismatches along the sequence alignment. http://vamps.mbl.edu/resources/software.php Huse, et al (2008) PLoS Genetics 4: e1000255

Assigned Taxon Name GAST uses a lowest common taxon method to assign taxonomy when a read is equidistant to more than one reference sequence. Assignment requires a minimum of 66% concurrence (2/3s majority).

GAST Flowchart High Quality 16S tags Nearest RefSeq(s) and GAST Distance RefDB (e.g., RefV3V4 RefSSU) Using tags cut to match primers is much more efficient Consensus Taxonomy (2/3s majority)

GAST R1: TGGTCTTGACATCCACAGAT! Q: TGGTCTTGACATCCACAGAT! Query exactly matches 1 reference RefID: R1 Distance: 0.0

GAST R1: TGGACTTGACATCCACAGAT! Q: TGGTCTTGACATCCACAGAT! TGGTCTTGACATCGACAGAT! R2: TGGTCTTGGCATCTACAGAT! Query inexactly matches 2 references with two mismatches each RefID R1,R2 Distance = 1 / 20 = 0.05

Consensus Calculation If Query equally matches multiple references: Firmicutes; Clostridia; Clostridiales; Lachnospiraceae (1 hit) Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium (4 hits) Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; perfringens (5 hits) 1. Calculate community size: 1+4+5 = 10 2. Lowest rank = species 3. Calculate voting: 5 / 10 = 50% (less than 66%) 4. Next rank = genus 5. Calculate genus voting: (4+5) / 10 = 90% 6. Assign taxonomy to genus level

GAST Distance GAST does not report a bootstrap value or confidence interval, GAST reports the distance to the nearest sequence. If distance = 0, then good accuracy If distance = 0.05 (5%) then likely same family and maybe genus, but not species.

BLAST Top BLAST hit can lead to unexpected results: hits to unclassified, sources other than microbial SSU rrna local alignments can be misleading Supervised BLAST is an excellent tool Unsupervised can be dangerous

Methods Comparison RDP is considered the standard available via RDP website, mothur, QIIME, or local Works better on longer sequences, not as well on shorter sequences. GAST works well for shorter sequences can go to species depending on the reference database BLAST to nt often returns incorrect taxonomy, not good for pipelines, good for checking individual results

16S regions give similar results at the genus level V3 N = 299,044 Other = 99 V6 N = 322,971 Other = 82 Full-Length N = 5,519 Other = 26 Human gut microbiota with 250nt V3, 60nt V6, 1000nt FL

Reference Database Considerations 1. Size of the database Does it contain reference sequences similar to your data? 2. Taxonomy of the database Are the references classified to genus or species? 3. Quality of the database Are there chimeras or low-quality sequences?

Example Reference Databases 1. SILVA database SSU, LSU http://www.arb-silva.de/ 2. RDP training set SSU, ITS http://rdp.cme.msu.edu/index.jsp 3. Greengenes http://greengenes.lbl.gov/cgi-bin/nphindex.cgi

Specialized Databases HOMD Human Oral Microbiome Database http://www.homd.org/ OSU CORE for oral http://microbiome.osu.edu/ UNITE http://unite.ut.ee/

Which Hypervariable Region? Two regions are better than one - more information Different regions have different specificity at the genus or species level Different primer sets can have different biases Depends on your samples

16S Specificity in SILVA RefHVR sequences mapping V3-V5 V6-V4 Unique taxon 97.1% 97.5% If genus, unique genus 99.7% 99.8% If species, unique species 93.1% 93.6% 2 Taxa (ambiguous) 2.3% 2.0% 2 Lineages 0.6% 0.5% Unique lineage mapping 99% 99% Assigning taxonomy is only as good as your reference

Sources of Error in Taxonomic Analyses Primer bias Chimeras Non-16S tag amplified Discovery of novel 16S Unrepresented in reference database Low-quality references Taxonomy not available Incorrect taxonomy Ambiguous hypervariable sequence Reference biased toward most studied